Summary
Full Transcript
Learn more: https://bit.ly/43p1WIa DeepSeek has put reinforcement learning at the top of the minds of developers, machine learning engineers, and data-driven professionals in the AI space. That’s why we’re happy to launch a new short course: Reinforcement Fine-Tuning LLMs with GRPO, built in collaboration with @Predibase and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Machine Learning Engineer. Many LLM applications rely on reasoning, whether in solving math problems, generating code, or completing multi-step tasks. But fine-tuning models for reasoning is often constrained by the availability of high-quality labeled examples. This course introduces a different approach: Reinforcement Fine-Tuning (RFT) using Group Relative Policy Optimization (GRPO). GRPO is a scalable reinforcement learning algorithm that lets you train models using reward functions instead of human-labeled data or preference scores. You’ll learn: - When reinforcement fine-tuning is a better fit than supervised fine-tuning - How to build and use programmable reward functions in GRPO - How to guide model behavior on structured tasks like the Wordle game - How to evaluate subjective outputs, like summaries, using LLMs as judges - How to avoid reward hacking by combining reward and penalty signals - How to implement GRPO loss: token ratios, clipping, advantages, and KL divergence - How to run RFT jobs using Predibase’s training platform By the end of the course, you’ll know how to fine-tune LLMs for complex reasoning tasks without needing large datasets or manual preference data. Enroll now: https://bit.ly/43p1WIa
