Sabitlenmiş Tweet
Greg Jennings
11.8K posts

Greg Jennings
@jenningsgreg
VP of Engineering, AI @anacondainc, enabling the next generation of data science and AI-powered applications. Opinions are my own
Austin, TX Katılım Ekim 2010
6.3K Takip Edilen1.2K Takipçiler
Greg Jennings retweetledi
Greg Jennings retweetledi
Greg Jennings retweetledi

Another cool research on Looped Transformers
They ask the question: "Can we loop a frozen, off-the-shelf checkpoint directly at inference time without any modifications?"
So naive repetition pushes hidden states outside the distribution later layers expect, so performance drops.
But if you treat transformer layers as Euler steps in a residual ODE and replaces naive loops with damped Runge–Kutta substeps, it is possible.
This lets the frozen models get extra latent compute at test time with no fine-tuning, no new weights, and no architecture changes.
And the best gains show up on hard knowledge MC tasks like MMLU-Pro, GPQA, and ARC.

English

@sakurayukiai Very cool paper. The right quantization basis preserves what attention needs, not just the tensor. Attention-aware geometry FTW.
English
Greg Jennings retweetledi
Greg Jennings retweetledi
Greg Jennings retweetledi
Greg Jennings retweetledi

Every year, I share this video of French caretakers who take sand from Omaha Beach in Normandy, and scrub them into the letters to give them the gold coloring.
They do this for all 9,386 US soldiers who died.
France also gave us this land as American soil. #MemorialDayWeekend
English
Greg Jennings retweetledi

There is now a smarter way to pick data for training LLMs!
Enter OPUS!
This is an ICML Oral paper from SJTU, Alibaba, UW–Madison, UIUC, and Mila - Quebec AI Institute.
The proposed method dynamically and intelligently selects the most impactful data for LLM pre-training in every single training iteration, bringing principled, continuous data optimization to the forefront.
This approach aims to significantly boost training efficiency and yield higher-quality LLMs, outperforming conventional static data selection methods across diverse language tasks.
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
Paper: arxiv.org/pdf/2602.05400
Our report: mp.weixin.qq.com/s/xzmjviMMwX20…
📬 #PapersAccepted by Jiqizhixin

English
Greg Jennings retweetledi

FACT ALERT 🚨 : In modern agentic coding, 42% of the time is spent on CPU doing tool use such as editing files, running Bash scripts, running lints, etc. The economy of traditional cloud computing charges at $ per cpu core. In the economy of agents, the business model is $ per token thus to increase token revenue, you need to increase the amount of CPUs power u have so that you can generate your tokens.

English
Greg Jennings retweetledi

Added a DeepSeek Sparse Attention (DSA) from-scratch implementation to my LLMs-from-scratch repo thanks to an awesome new reader contrib.
With motivation, overview, and GPT-style model reference implementation as standalone example code: github.com/rasbt/LLMs-fro…

English
Greg Jennings retweetledi

“Probabilistic Tiny Recursive Model”
This paper makes Tiny Recursive Models stochastic at test time by adding Gaussian noise, running parallel rollouts, and using the existing Q head to pick the best answer.
With no retraining and no task-specific tricks, its PPBench jumps from 62.6% to 91.2%, while Sudoku-Extreme jumps from 87.4% to 98.75%.

English
Greg Jennings retweetledi

BitCPM4-CANN is now open source. 1.58-bit ternary LLM, trained at low-bit precision, not just quantized after. Apache 2.0. 🚀
🤖 modelscope.ai/collections/Op…
Weights stay at {-1, 0, 1} throughout training via QAT. 1B/3B/8B retain 95.7%~97.2% of full-precision MiniCPM4. ~6x inference memory reduction. Only 5% training overhead.
Same hardware, 6x more headroom. No special kernels needed.
Four model sizes available: 0.5B, 1B, 3B, 8B. All come with GGUF variants.

English
Greg Jennings retweetledi
Greg Jennings retweetledi

The former REPUBLICAN Governor of Texas openly saying Paxton protected a pedophile.
No wonder he was endorsed by the Epstein administration.
Rick Perry@GovernorPerry
Ken Paxton initially offered a plea deal to a MAN WHO ADMITTED TO MOLESTING a child to serve only ONE DAY IN JAIL. #txsen
English
Greg Jennings retweetledi

RL has almost always meant trying to maximize a scalar reward.
Very expressive in theory, but do you have only ONE scalar reward? Preferences & tradeoffs are complex & high-dimensional!
Vector Policy Optimization (VPO) trains LLMs to anticipate diverse environments and goals!
Ryan Bahlous-Boldi@RyanBoldi
Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
English
Greg Jennings retweetledi

A 0.6B model learned to manage giants.
That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang.
The paper is not asking:
“How do we build one model that knows everything?”
It is asking something more interesting:
“How do we build a small intelligence layer that knows who should think, who should act, and who should verify?”
TRINITY is a lightweight coordinator for LLMs.
It does not merge weights.
It does not require architectural compatibility.
It does not need access to closed-model internals.
It does not try to turn the coordinator into the smartest model in the room.
Instead, it orchestrates a pool of strong models at test time, including closed and open models.
At each turn, TRINITY chooses a model and gives it one of three roles:
Thinker — plan and decompose
Worker — solve and execute
Verifier — critique and accept/revise
That may sound simple.
It is not.
Too many multi-agent systems are still prompts plus hope.
TRINITY learns the coordination policy.
A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse.
The result is not just better routing.
It is learned division of labor.
The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D.
The most important idea here is bigger than the benchmark.
The future of AI may not be a single supermodel.
It may be an organization of models.
A small conductor.
A team of specialists.
A protocol for planning, execution, and verification.
An intelligence layer that learns how to allocate cognition.
This feels like a real shift:
from bigger models
to better systems
from raw capability
to coordinated capability
from “which model is best?”
to “what structure makes many models better together?”
Full credit to the authors:
Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang.
Paper: TRINITY: An Evolved LLM Coordinator
arxiv.org/abs/2512.04695
I’m attaching the first page because the abstract is worth reading closely.
The future of AI may not be monolithic.
It may be coordinated.
#ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms

English
Greg Jennings retweetledi
Greg Jennings retweetledi

New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels"
This post is my attempt to dissect ThunderKittens from the bottom up. I approached TK by asking what each abstraction is really buying us: which hardware detail it corresponds to, how it maps onto the underlying layouts the GPU actually wants, what boilerplate it removes, and which parts of the GPU programming model still remain visible to us as kernel authors.
The post walks through the tile abstractions TK provides: register, shared, and tensor memory tiles, global layouts, vector abstractions, warp/warpgroup compute, TMA, swizzling, Hopper WGMMA, Blackwell tcgen05, 2xSM MMA, tensor memory, Cluster Launch Control, TK’s pipeline templates, and static persistent tile scheduling.
At the end, I demonstrate TK’s lcf pipeline template by implementing a non-causal attention prefill kernel and benchmarking it against FlashAttention-2 and FlashAttention-3 on an H100 PCIe across different sequence lengths. The kernel beats FA2 across the sweep by ~1.55x on average, and closely tracks FA3, where FA3 is only ~1.05x-1.17x faster on the longer sequence lengths.
Blog link: hamzaelshafie.bearblog.dev/dissecting-thu…
Repo: github.com/HamzaElshafie/…
I also put an extensive list of resources at the end, which I found very useful for interested readers.
Please note: this is my own independent writeup. I’m not affiliated with @HazyResearch, and any mistakes in the post are mine. If you spot any please reach out!
1 / xx




English
Greg Jennings retweetledi

Gated DeltaNet-2 is here. 🚀
🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆
💡 Here's the idea behind it:
Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it.
Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation.
Gated DeltaNet-2 decouples them.
✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove
✍️ a channel-wise write gate w_t picks which value-side coordinates to commit
🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too
⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton
📊 Results:
We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3.
Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings
Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38
Joint work with @YejinChoinka and @jankautz.
📄 Paper: shorturl.at/AAlVb
💻 Code: github.com/NVlabs/GatedDe…
#LinearAttention #StateSpaceModels #Mamba #LLM

English








