Jonathan Pacifico

494 posts

Jonathan Pacifico

@_jpacifico

Data Scientist @cellenza | Post-training is the key

Paris, France Katılım Şubat 2017

994 Takip Edilen1.4K Takipçiler

Sabitlenmiş Tweet

Jonathan Pacifico@_jpacifico·30 Haz

My post-trained 14B model is now #1 on the French gov «Bac» benchmark, built from real national exam questions, ahead of DeepSeek-R1 70B, Mistral Large, Llama 3.3 & more. Started from the Phi-4 base model — model merging + DPO made the difference. Scale isn’t enough. Post-training is the key (right @maximelabonne ?😉)

English

106

10.4K

Jonathan Pacifico@_jpacifico·18h

Chocolatine 2.1 is now available on @ollama : a compact 4B open-weight model optimized for French with strong cross-lingual performance, designed for local inference and agent-oriented workflows. ollama.com/jpacifico/choc…

English

123

9.8K

Jonathan Pacifico@_jpacifico·5d

Efficient Language Specialization for SLMs I built a DPO dataset from French government open data (Compar:IA) and used it to post-train my Qwen3-4B based model « Chocolatine-2.1 ». Results : consistent gains across every French benchmark I tested , +3.56 GPQA-FR Diamond, +2.66 ARC-C-FR 🔥 @HuggingModels @simonzilinskas @maximelabonne @SebastienBubeck huggingface.co/jpacifico/Choc…

English

210

Jonathan Pacifico retweetledi

Google Research@GoogleResearch·24 Mar

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

GIF

English

5.8K

39K

19.2M

Jonathan Pacifico retweetledi

Microsoft France@microsoftfrance·18 Mar

Aux côtés de @_jpacifico pour Cellenza, et avec les témoignages de Kis, startup lauréate de la 3e édition du Microsoft GenAI Studio et @cosmotechweb, les échanges ont porté sur les choix technologiques et les enjeux de passage à l’échelle.

Français

425

Jonathan Pacifico@_jpacifico·14 Mar

@bnjmn_marie Same here on a single A100 … I ended up killing the eval , probably less patient than you😅 Evaluating this generation of models is becoming a different challenge..

English

337

Benjamin Marie@bnjmn_marie·14 Mar

Qwen3.5 4b generates ~2.5x more reasoning tokens than the 27b. With a single H100, evals are so long I think I won't evaluate it. The sequences grow too much. I launched the gpqa diamond eval 9 hours ago, only 33% done.

English

148

14K

Jonathan Pacifico retweetledi

François Chollet@fchollet·13 Şub

Reaching AGI won't be beating a benchmark. It will be the end of the human-AI gap. Benchmarks are simply a way to estimate the current gap, which is why we need to continually release new benchmarks (focused on the remaining gap). Benchmarking is a process, not a fixed point. We can say we have AGI when it's no longer possible to come up with a test that evidences the gap. When it's no longer possible to point to something that regular humans can do and AI can't. Today, it's still easy. I expect it will become nearly impossible by 2030.

English

104

118

1.2K

108.6K

Jonathan Pacifico retweetledi

DAIR.AI@dair_ai·12 Ara

A 3B model outperforms models 10x its size on reasoning benchmarks. Small language models (SLMs) are often dismissed as fundamentally limited. The belief is that more parameters mean more capability, and that's it. More recent research indicates that the real ceiling isn't parameter count. It's the training methodology. This technical report introduces Nanbeige4-3B, a family of SLMs trained on 23 trillion high-quality tokens and finetuned on over 30 million diverse instructions. The results challenge assumptions about model scaling. On AIME 2024, Nanbeige4-3B-Thinking scores 90.4% versus Qwen3-32B's 81.4%. On GPQA-Diamond, it achieves 82.2% versus Qwen3-14B's 64.0%. This shows that the 3B model consistently outperforms models 4-10x larger. Here's how they did it: Fine-Grained WSD scheduler: Rather than uniform data sampling, they split training into stages with progressively refined data mixtures. High-quality data is concentrated in later stages. On a 1B test model, this improved GSM8K from 27.1% to 34.3% versus vanilla scheduling. Solution refinement with CoT reconstruction: They refine answer quality through iterative critique cycles, then reconstruct a chain-of-thought that logically leads to the improved solution. This yields SFT examples far better than rejection sampling. Dual Preference Distillation: The student model simultaneously learns to mimic teacher output distributions while distinguishing high-quality from low-quality responses. Token-level distillation combined with sequence-level preference optimization. Multi-stage RL: Rather than mixed-corpus training, each RL stage targets a specific domain. STEM reasoning with agentic verifiers. Coding with synthetic test functions. Human preference alignment with pairwise reward models. On the WritingBench leaderboard, Nanbeige4-3B-Thinking (79.03) approaches GPT-5 (83.87) and outperforms DeepSeek-R1 (78.92), Grok-4 (74.65), and O4-mini (72.90). The report demonstrates that carefully engineered small models can match or exceed much larger models when training methodology is optimized at every stage. Paper: arxiv.org/abs/2512.06266 Learn to build with LLMs and AI Agents in our academy: dair-ai.thinkific.com

English

117

606

43.9K

Jonathan Pacifico@_jpacifico·9 Ara

@BarioIsCoding Thanks @BarioIsCoding Indeed, reasoning/cot is part of my exploratory work for a version 2.1, possibly based on Qwen3 or the brand-new Ministral3 14B. For now, it hasn’t given me full satisfaction, I’m working on achieving a significant performance improvement.

English

Bqrio@BarioIsCoding·8 Ara

@_jpacifico When are you going to release Chocolatine 2 (based off of Qwen2.5, possibly Qwen 3 👀) with thinking? I don't know how you discovered your datasets, but how do you unironically get so good results and general conversational quality?

English

Jonathan Pacifico retweetledi

Google for Developers@googledevs·13 Kas

Google Colab is officially coming to @code! ⚡️ You can now connect VS Code notebooks directly to @GoogleColab runtimes. Get the best of both worlds: the editor you love, powered by the compute (GPUs/TPUs) you need. → goo.gle/47QTmnB

English

119

751

4.6K

905.3K

Jonathan Pacifico retweetledi

Sergio Paniego@SergioPaniego·4 Kas

fine-tuning a 14B model with TRL + SFT on a free Colab (T4 GPU)? thanks to the latest TRL optimizations, you actually can! sharing a new notebook showing how to do it ⚡😎

English

102

645

33.4K

Jonathan Pacifico retweetledi

Alexia Jolicoeur-Martineau@jm_alexia·16 Eki

The future of AI doesn't have to break the bank and destroy the environment to reach AGI!

ARC Prize@arcprize

Costs to reproduce: * ARC-AGI-1 Public: 9h 52m 6 * 2x8H100 * $8/hour = $157.86 * ARC-AGI-1 Semi-private: 11h 23m* 2x8H100 * $8/hour = $176.38 * ARC-AGI-1 Public: 9h 35m * 3x8H100 * $8/hour = $216.58 * ARC-AGI-2 Semi-private: 10h 30m * 3x8H100 * $8/hour = $252

English

204

24.2K

Jonathan Pacifico retweetledi

Maxime Labonne@maximelabonne·14 Eki

📚 Efficient Language Specialization for Small Language Models @maxencelsb and @SinoueG have released a preprint about their excellent work on fine-tuning small models in French. It shows a solid post-training pipeline to improve French performance while preserving English capabilities. → The Luth-SFT dataset combines 570k samples from translated English datasets (Tülu 3, OpenHermes) + a unique "Scholar" subset containing 30k samples (from French Baccalauréat and CPGE exams). → All five Luth models (350M to 1.7B parameters) achieve state-of-the-art French performance in their size categories, with absolute improvements up to +11.26% across six benchmarks compared to base models like Qwen3 and LFM2. → Merging the fine-tuned French model with its base version preserves multilingual abilities and boosts performance in both languages. → I can't help but notice that Luth-LFM2-1.2B beats Qwen3-1.7B on 4 out of 6 French tasks despite having 500M fewer parameters, and the pattern holds across other model sizes too. 👀 Fine-tuning models to boost a specific language has always been a very popular use case. This paper is super interesting because it provides an excellent recipe with data, training, and evals that works in practice. Great work!

English

121

11.7K

Jonathan Pacifico retweetledi

Andrej Karpathy@karpathy·13 Eki

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

English

690

3.4K

24.2K

5.8M

Jonathan Pacifico retweetledi

Alexia Jolicoeur-Martineau@jm_alexia·7 Eki

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/29/tin… Code: github.com/SamsungSAILMon… Paper: arxiv.org/abs/2510.04871

English

149

669

4.2K

695.8K

Jonathan Pacifico retweetledi

clem 🤗@ClementDelangue·1 Eyl

If you think @Apple is not doing much in AI, you're getting blindsided by the chatbot hype and not paying enough attention! They just released FastVLM and MobileCLIP2 on @huggingface. The models are up to 85x faster and 3.4x smaller than previous work, enabling real-time vision language model (VLM) applications! It can even do live video captioning 100% locally in your browser 🤯🤯🤯

English

230

581

6.4K

853.4K

Jonathan Pacifico@_jpacifico·27 Ağu

@maximelabonne Wow.. l’ll take a close interest in it ! Thanks for sharing @maximelabonne

English

106

Maxime Labonne@maximelabonne·26 Ağu

Really impressed by the French finetune of LFM2 made by two students. They created a solid post-training pipeline (FFT + merging) and open-sourced all the code and data. Amazing work by Sinoué Gad and Maxence Lasbordes!

English

212

13.3K

Jonathan Pacifico retweetledi

Sebastien Bubeck@SebastienBubeck·19 Tem

It’s hard to overstate the significance of this. It may end up looking like a “moon‑landing moment” for AI. Just to spell it out as clearly as possible: a next-word prediction machine (because that's really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies.

Alexander Wei@alexwei_

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

English

157

1.4K

261K

Jonathan Pacifico@_jpacifico·1 Tem

@_hdarabi @huggingface @ClementDelangue @julien_c @MistralAI @SebastienBubeck @ArashRahnamaPhD @jie_bing @natolambert @microsoftfrance Thanks @_hdarabi Not too expensive , LoRA adapters on a single A100. The real cost, if you ask me: patience and determination ;)

English

105