Openai
Post LinkedIn lead magnet · Openai
The LLM training recipe has changed DeepSeek-V3.2 was post-trained using 1.8k RL environments; Minimax M2.1 used over 100k environments... This reflects a shift: from learning on static data to learning through interaction 𝗕𝘂𝘁 𝗵𝗼𝘄 𝗱𝗶𝗱 𝘄𝗲 𝗴𝗲𝘁 𝘁𝗵𝗲𝗿𝗲? 1️⃣ Classic LLM training recipe (InstructGPT) - Pre-training on internet text → learn to create text completions - Supervised Fine-Tuning on Q/A pairs → learn new tasks and to follow instructions - Reinforcement Learning (PPO or DPO) → align with human preferences It worked, until it hit a ceiling. You might remember Ilya Sutskever's talk at NeurIPS 2024: "Pre-training as we know it will end" Data is finite and classic post-training (SFT, Preference Alignment) cannot make miracles. What's next? 2️⃣ OpenAI o1 series hinted at a new direction They showed that Reinforcement Learning can induce chain-of-thought reasoning, and that performance improves with more train-time or test-time compute. No details on how to get there... 3️⃣ DeepSeek-R1 showed a concrete approach Reasoning/COT improves performance but teaching it via SFT needs expensive curated data Instead, they used Reinforcement Learning with Verifiable Rewards: - the model generates reasoning + answer - answer is checked against ground truth - reward drives RL training The idea is more general Any task with a verifiable outcome (a won game, a passing test...) can become a training signal The model is no longer limited by the quality of examples like in SFT By trial and error, it can discover better reasoning strategies on its own. DeepSeek also introduced GRPO: instead of PPO's expensive/unstable setup, generate a group of responses, rank them, use relative performance as baseline. Simpler, lighter, works well with RLVR 4️⃣ The mapping from classic RL to LLMs The Language Model is the Agent, its response is the Action. The Environment is everything needed to check (and possibly train) the model on the task: data, harnesses, scoring rules. SFT relies on curated datasets. RLVR requires environments: dynamic systems the model can interact with. And as LLMs gain access to tools (from APIs to terminals) these environments become more complex and more critical. As Karpathy puts it: > environments give the LLM an opportunity to actually interact - take actions, see outcomes, etc. > This means you can hope to do a lot better than statistical expert imitation --- 📖 For a deeper dive and resources, check the comments.
Mécanisme lead magnet
📖 For a deeper dive and resources, check the comments.