Jonathan Pacifico

494 posts

Jonathan Pacifico

Jonathan Pacifico

@_jpacifico

Data Scientist @cellenza | Post-training is the key

Paris, France Katılım Şubat 2017
994 Takip Edilen1.4K Takipçiler
Sabitlenmiş Tweet
Jonathan Pacifico
Jonathan Pacifico@_jpacifico·
My post-trained 14B model is now #1 on the French gov «Bac» benchmark, built from real national exam questions, ahead of DeepSeek-R1 70B, Mistral Large, Llama 3.3 & more. Started from the Phi-4 base model — model merging + DPO made the difference. Scale isn’t enough. Post-training is the key (right @maximelabonne ?😉)
Jonathan Pacifico tweet media
English
6
13
106
10.4K
Jonathan Pacifico
Jonathan Pacifico@_jpacifico·
Chocolatine 2.1 is now available on @ollama : a compact 4B open-weight model optimized for French with strong cross-lingual performance, designed for local inference and agent-oriented workflows. ollama.com/jpacifico/choc…
English
6
15
123
9.8K
Jonathan Pacifico retweetledi
Google Research
Google Research@GoogleResearch·
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI
GIF
English
1K
5.8K
39K
19.2M
Jonathan Pacifico retweetledi
Microsoft France
Microsoft France@microsoftfrance·
Aux côtés de @_jpacifico pour Cellenza, et avec les témoignages de Kis, startup lauréate de la 3e édition du Microsoft GenAI Studio et @cosmotechweb, les échanges ont porté sur les choix technologiques et les enjeux de passage à l’échelle.
Microsoft France tweet media
Français
0
1
3
425
Jonathan Pacifico
Jonathan Pacifico@_jpacifico·
@bnjmn_marie Same here on a single A100 … I ended up killing the eval , probably less patient than you😅 Evaluating this generation of models is becoming a different challenge..
English
0
0
0
337
Benjamin Marie
Benjamin Marie@bnjmn_marie·
Qwen3.5 4b generates ~2.5x more reasoning tokens than the 27b. With a single H100, evals are so long I think I won't evaluate it. The sequences grow too much. I launched the gpqa diamond eval 9 hours ago, only 33% done.
English
21
4
148
14K
Jonathan Pacifico retweetledi
François Chollet
François Chollet@fchollet·
Reaching AGI won't be beating a benchmark. It will be the end of the human-AI gap. Benchmarks are simply a way to estimate the current gap, which is why we need to continually release new benchmarks (focused on the remaining gap). Benchmarking is a process, not a fixed point. We can say we have AGI when it's no longer possible to come up with a test that evidences the gap. When it's no longer possible to point to something that regular humans can do and AI can't. Today, it's still easy. I expect it will become nearly impossible by 2030.
English
104
118
1.2K
108.6K
Jonathan Pacifico retweetledi
DAIR.AI
DAIR.AI@dair_ai·
A 3B model outperforms models 10x its size on reasoning benchmarks. Small language models (SLMs) are often dismissed as fundamentally limited. The belief is that more parameters mean more capability, and that's it. More recent research indicates that the real ceiling isn't parameter count. It's the training methodology. This technical report introduces Nanbeige4-3B, a family of SLMs trained on 23 trillion high-quality tokens and finetuned on over 30 million diverse instructions. The results challenge assumptions about model scaling. On AIME 2024, Nanbeige4-3B-Thinking scores 90.4% versus Qwen3-32B's 81.4%. On GPQA-Diamond, it achieves 82.2% versus Qwen3-14B's 64.0%. This shows that the 3B model consistently outperforms models 4-10x larger. Here's how they did it: Fine-Grained WSD scheduler: Rather than uniform data sampling, they split training into stages with progressively refined data mixtures. High-quality data is concentrated in later stages. On a 1B test model, this improved GSM8K from 27.1% to 34.3% versus vanilla scheduling. Solution refinement with CoT reconstruction: They refine answer quality through iterative critique cycles, then reconstruct a chain-of-thought that logically leads to the improved solution. This yields SFT examples far better than rejection sampling. Dual Preference Distillation: The student model simultaneously learns to mimic teacher output distributions while distinguishing high-quality from low-quality responses. Token-level distillation combined with sequence-level preference optimization. Multi-stage RL: Rather than mixed-corpus training, each RL stage targets a specific domain. STEM reasoning with agentic verifiers. Coding with synthetic test functions. Human preference alignment with pairwise reward models. On the WritingBench leaderboard, Nanbeige4-3B-Thinking (79.03) approaches GPT-5 (83.87) and outperforms DeepSeek-R1 (78.92), Grok-4 (74.65), and O4-mini (72.90). The report demonstrates that carefully engineered small models can match or exceed much larger models when training methodology is optimized at every stage. Paper: arxiv.org/abs/2512.06266 Learn to build with LLMs and AI Agents in our academy: dair-ai.thinkific.com
DAIR.AI tweet media
English
21
117
606
43.9K
Jonathan Pacifico
Jonathan Pacifico@_jpacifico·
@BarioIsCoding Thanks @BarioIsCoding Indeed, reasoning/cot is part of my exploratory work for a version 2.1, possibly based on Qwen3 or the brand-new Ministral3 14B. For now, it hasn’t given me full satisfaction, I’m working on achieving a significant performance improvement.
English
1
0
1
32
Bqrio
Bqrio@BarioIsCoding·
@_jpacifico When are you going to release Chocolatine 2 (based off of Qwen2.5, possibly Qwen 3 👀) with thinking? I don't know how you discovered your datasets, but how do you unironically get so good results and general conversational quality?
English
1
0
0
22
Jonathan Pacifico retweetledi
Google for Developers
Google for Developers@googledevs·
Google Colab is officially coming to @code! ⚡️ You can now connect VS Code notebooks directly to @GoogleColab runtimes. Get the best of both worlds: the editor you love, powered by the compute (GPUs/TPUs) you need. → goo.gle/47QTmnB
English
119
751
4.6K
905.3K
Jonathan Pacifico retweetledi
Sergio Paniego
Sergio Paniego@SergioPaniego·
fine-tuning a 14B model with TRL + SFT on a free Colab (T4 GPU)? thanks to the latest TRL optimizations, you actually can! sharing a new notebook showing how to do it ⚡😎
Sergio Paniego tweet media
English
19
102
645
33.4K
Jonathan Pacifico retweetledi
Maxime Labonne
Maxime Labonne@maximelabonne·
📚 Efficient Language Specialization for Small Language Models @maxencelsb and @SinoueG have released a preprint about their excellent work on fine-tuning small models in French. It shows a solid post-training pipeline to improve French performance while preserving English capabilities. → The Luth-SFT dataset combines 570k samples from translated English datasets (Tülu 3, OpenHermes) + a unique "Scholar" subset containing 30k samples (from French Baccalauréat and CPGE exams). → All five Luth models (350M to 1.7B parameters) achieve state-of-the-art French performance in their size categories, with absolute improvements up to +11.26% across six benchmarks compared to base models like Qwen3 and LFM2. → Merging the fine-tuned French model with its base version preserves multilingual abilities and boosts performance in both languages. → I can't help but notice that Luth-LFM2-1.2B beats Qwen3-1.7B on 4 out of 6 French tasks despite having 500M fewer parameters, and the pattern holds across other model sizes too. 👀 Fine-tuning models to boost a specific language has always been a very popular use case. This paper is super interesting because it provides an excellent recipe with data, training, and evals that works in practice. Great work!
Maxime Labonne tweet mediaMaxime Labonne tweet mediaMaxime Labonne tweet media
English
5
16
121
11.7K
Jonathan Pacifico retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.
Andrej Karpathy tweet media
English
690
3.4K
24.2K
5.8M
Jonathan Pacifico retweetledi
clem 🤗
clem 🤗@ClementDelangue·
If you think @Apple is not doing much in AI, you're getting blindsided by the chatbot hype and not paying enough attention! They just released FastVLM and MobileCLIP2 on @huggingface. The models are up to 85x faster and 3.4x smaller than previous work, enabling real-time vision language model (VLM) applications! It can even do live video captioning 100% locally in your browser 🤯🤯🤯
English
230
581
6.4K
853.4K
Maxime Labonne
Maxime Labonne@maximelabonne·
Really impressed by the French finetune of LFM2 made by two students. They created a solid post-training pipeline (FFT + merging) and open-sourced all the code and data. Amazing work by Sinoué Gad and Maxence Lasbordes!
Maxime Labonne tweet media
English
8
25
212
13.3K
Jonathan Pacifico retweetledi
Sebastien Bubeck
Sebastien Bubeck@SebastienBubeck·
It’s hard to overstate the significance of this. It may end up looking like a “moon‑landing moment” for AI. Just to spell it out as clearly as possible: a next-word prediction machine (because that's really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies.
Alexander Wei@alexwei_

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

English
59
157
1.4K
261K
Jonathan Pacifico
Jonathan Pacifico@_jpacifico·
My post-trained 14B model is now #1 on the French gov «Bac» benchmark, built from real national exam questions, ahead of DeepSeek-R1 70B, Mistral Large, Llama 3.3 & more. Started from the Phi-4 base model — model merging + DPO made the difference. Scale isn’t enough. Post-training is the key (right @maximelabonne ?😉)
Jonathan Pacifico tweet media
English
6
13
106
10.4K