Date:
I've always been fascinated by reinforcement learning since the AlphaGo days. As an avid gamer, the concept of learning through repetitive play strongly resonates with me. But despite my programming experience, I always viewed reinforcement learning as "high science" – the hardest aspect of machine learning, beyond my reach.
That changed recently. With the rise of powerful language models enabling what Andrej Karpathy calls "vibe coding" - where you fully give in to the vibes and let LLMs handle the technical details - I started wondering: Could I use this approach to train my own reinforcement learning model?
Here's the story of my "side side quest" (yes, that's a side quest to my side projects) and what I've learned so far.
Two recent developments sparked my curiosity:
Both had something in common: they used reinforcement learning techniques for reasoning adaptation. If smaller teams could achieve these results, could I—with LLM assistance—also explore this space?
The first challenge was identifying a suitable problem. I work with many LLM-powered applications, but wanted to be strategic about what to tackle.
Working with Claude, I analyzed one of my agent projects, breaking it down into a detailed decision tree as shown above. This revealed a perfect candidate: the hypothesis validation component of my research system.
My current setup has:
This last judgment step seemed ideal—a narrow classification problem requiring meaningful reasoning but limited enough in scope for my experiment.
I developed a straightforward plan:
For research, I asked Claude to help analyze papers like DeepSeek's, distilling their key technical decisions and reasoning about how they might apply to my context. I wasn't aiming for novel research—just applying established techniques to my specific domain.
Since I feel that o1 represents current state-of-the-art reasoning, I decided to use it to generate my training data. I created:
For this curriculum learning approach, I designed a progressive difficulty system as shown below:
The curriculum starts with Level 1 (clear-cut cases with unambiguous evidence) before progressing to more nuanced scenarios. This structured approach allowed me to generate training data with the right balance of validated (150-200), rejected (100-150), and refined (50-100) examples, each with appropriate confidence scores.
To implement this, I defined a clear TypeScript interface for each example in my dataset:
Taking the time to design this interface upfront proved to be one of the most valuable strategic decisions in the project. By defining this structure before generating any data, I was able to leverage OpenAI's structured output capabilities in their API, essentially giving the model a precise "template" to fill out.
This approach dramatically streamlined the data generation process. Instead of getting freeform text that would need extensive parsing and cleaning, I received perfectly structured JSON objects that mapped directly to my TypeScript interface. Each example includes the hypothesis text, an array of evidence pieces (with source, content, and relevance scores), expected decision (VALIDATED/REJECTED/REFINED), confidence score, and metadata tracking the domain and difficulty level.
This is vibe coding at its best - spending your mental energy on designing the right structure, then letting the AI handle the repetitive generation work while maintaining perfect consistency. It made it possible to systematically generate, evaluate, and eventually train on these examples with minimal friction.
I also implemented a quality control step where the model itself would judge whether generated examples were good, creating a feedback loop until I had solid samples.
Interestingly, hypotheses in the risk domain kept getting rejected by the QA system. After discussing this with Claude, we concluded that risk might be inherently incompatible with "level one" difficulty's requirements for unambiguous evidence and clear signal-to-noise ratio—a fascinating insight to shelve for later.
For benchmarks, I selected two models:
My hypothesis was that if I could significantly improve the smaller model's performance, it would deliver real value through faster speed and lower costs.
To guide this experiment, I developed a structured plan of key technical decisions:
These decisions reflect my approach - I researched state-of-the-art techniques like QLoRA and Proximal Policy Optimization, allowing me to plan a sophisticated reinforcement learning project despite not being an ML specialist.
With Claude's help, I set up a Python environment using Hugging Face transformers. It was great. I asked questions, followed suggestions, and let the LLM guide me through unfamiliar libraries while maintaining just enough critical thinking to ask the right questions. We built:
After running a quick sniff test with five samples, the smaller model predictably performed worse than QwQ-32B. But how much worse? That's what I'm waiting to find out as the full benchmark runs on my laptop (estimated 30 hours).
Depending on the gap between models, I'll decide whether reinforcement learning is worth pursuing. If we're already above 85% accuracy, maybe not—but if there's significant room for improvement, I'll continue the journey.
The most striking part of this experiment is how accessible it's been. I've made significant progress with only 6-7 hours of total investment. While I'm not an ML expert, I feel surprisingly confident in what I'm doing.
Five years ago, trying to train your own reinforcement learning model as a side project would have seemed absurd without specialized knowledge. Today, with the right LLM partners, you can tackle domains that were previously impenetrable.
I'm not sure where this side quest will lead, but I'm already convinced that the barriers to technical experimentation have fundamentally changed. These first steps into reinforcement learning would have been impossible for me without LLM assistance.
In future posts, I'll share the benchmark results, training process, and what I learn about whether this vibe coding approach can produce a genuinely useful model. If you've been curious about a technical domain but thought "that's not for me"—maybe it's time to reconsider.