Greg Jennings

11.8K posts

Greg Jennings banner
Greg Jennings

Greg Jennings

@jenningsgreg

VP of Engineering, AI @anacondainc, enabling the next generation of data science and AI-powered applications. Opinions are my own

Austin, TX Katılım Ekim 2010
6.3K Takip Edilen1.2K Takipçiler
Greg Jennings retweetledi
Richard Hanania
Richard Hanania@RichardHanania·
Paxton winning is a real “they’re showing you who they are” moment. The guy is such a slimeball there is no way he would’ve survived as AG before MAGA came along, much less be in line for higher office. The purest case imaginable showing that these are simply bad people.
English
53
154
1.3K
20.4K
Greg Jennings retweetledi
Tyler
Tyler@rezoundous·
I'm not ready for Codex 2x limits to end
English
98
18
1K
68.2K
Greg Jennings retweetledi
alphaXiv
alphaXiv@askalphaxiv·
Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint directly at inference time without any modifications?" So naive repetition pushes hidden states outside the distribution later layers expect, so performance drops. But if you treat transformer layers as Euler steps in a residual ODE and replaces naive loops with damped Runge–Kutta substeps, it is possible. This lets the frozen models get extra latent compute at test time with no fine-tuning, no new weights, and no architecture changes. And the best gains show up on hard knowledge MC tasks like MMLU-Pro, GPQA, and ARC.
alphaXiv tweet media
English
8
30
178
9.2K
Greg Jennings
Greg Jennings@jenningsgreg·
@sakurayukiai Very cool paper. The right quantization basis preserves what attention needs, not just the tensor. Attention-aware geometry FTW.
English
0
0
0
118
Sakura Yuki
Sakura Yuki@sakurayukiai·
Quantizing the KV cache to 2-bit usually destroys the model because standard rotation math is blind to what attention heads actually want. OSCAR profiles the covariance offline and spectrally aligns the vectors before compressing. 128k context surviving in INT2 is so clean ✨
English
3
9
55
3.8K
Greg Jennings retweetledi
Youssof Altoukhi
Youssof Altoukhi@Youssofal_·
After spending time with Qwen 3.6 27B in Cursor I’ve come to realise the constraint on local models isn’t intelligence but the harnesses. Local model harnesses are TERRIBLE. Pi, open code etc are genuinely bad. As a community, we need to do better than this.
English
168
17
687
68K
Greg Jennings retweetledi
Spencer Hakimian
Spencer Hakimian@SpencerHakimian·
🚨JUST IN: Trump’s approval rating hits 28%, a new record low.
English
458
1.8K
17.6K
330.7K
Greg Jennings retweetledi
Danny Deraney
Danny Deraney@DannyDeraney·
Every year, I share this video of French caretakers who take sand from Omaha Beach in Normandy, and scrub them into the letters to give them the gold coloring. They do this for all 9,386 US soldiers who died. France also gave us this land as American soil. #MemorialDayWeekend
English
1.6K
26K
184.9K
8.1M
Greg Jennings retweetledi
机器之心 JIQIZHIXIN
There is now a smarter way to pick data for training LLMs! Enter OPUS! This is an ICML Oral paper from SJTU, Alibaba, UW–Madison, UIUC, and Mila - Quebec AI Institute. The proposed method dynamically and intelligently selects the most impactful data for LLM pre-training in every single training iteration, bringing principled, continuous data optimization to the forefront. This approach aims to significantly boost training efficiency and yield higher-quality LLMs, outperforming conventional static data selection methods across diverse language tasks. OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration Paper: arxiv.org/pdf/2602.05400 Our report: mp.weixin.qq.com/s/xzmjviMMwX20… 📬 #PapersAccepted by Jiqizhixin
机器之心 JIQIZHIXIN tweet media
English
4
59
348
88.4K
Greg Jennings retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
FACT ALERT 🚨 : In modern agentic coding, 42% of the time is spent on CPU doing tool use such as editing files, running Bash scripts, running lints, etc. The economy of traditional cloud computing charges at $ per cpu core. In the economy of agents, the business model is $ per token thus to increase token revenue, you need to increase the amount of CPUs power u have so that you can generate your tokens.
SemiAnalysis tweet media
English
50
85
801
209.9K
Greg Jennings retweetledi
Sebastian Raschka
Added a DeepSeek Sparse Attention (DSA) from-scratch implementation to my LLMs-from-scratch repo thanks to an awesome new reader contrib. With motivation, overview, and GPT-style model reference implementation as standalone example code: github.com/rasbt/LLMs-fro…
Sebastian Raschka tweet media
English
41
239
1.8K
70.9K
Greg Jennings retweetledi
alphaXiv
alphaXiv@askalphaxiv·
“Probabilistic Tiny Recursive Model” This paper makes Tiny Recursive Models stochastic at test time by adding Gaussian noise, running parallel rollouts, and using the existing Q head to pick the best answer. With no retraining and no task-specific tricks, its PPBench jumps from 62.6% to 91.2%, while Sudoku-Extreme jumps from 87.4% to 98.75%.
alphaXiv tweet media
English
6
71
461
19K
Greg Jennings retweetledi
ModelScope
ModelScope@ModelScope2022·
BitCPM4-CANN is now open source. 1.58-bit ternary LLM, trained at low-bit precision, not just quantized after. Apache 2.0. 🚀 🤖 modelscope.ai/collections/Op… Weights stay at {-1, 0, 1} throughout training via QAT. 1B/3B/8B retain 95.7%~97.2% of full-precision MiniCPM4. ~6x inference memory reduction. Only 5% training overhead. Same hardware, 6x more headroom. No special kernels needed. Four model sizes available: 0.5B, 1B, 3B, 8B. All come with GGUF variants.
ModelScope tweet media
English
6
8
107
6.8K
Greg Jennings retweetledi
Theo - t3.gg
Theo - t3.gg@theo·
I refuse to support any policy that makes it harder for the world's smartest people to come to the US.
English
54
154
4.4K
398.9K
Greg Jennings retweetledi
Omar Khattab
Omar Khattab@lateinteraction·
RL has almost always meant trying to maximize a scalar reward. Very expressive in theory, but do you have only ONE scalar reward? Preferences & tradeoffs are complex & high-dimensional! Vector Policy Optimization (VPO) trains LLMs to anticipate diverse environments and goals!
Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

English
7
40
443
41.8K
Greg Jennings retweetledi
MONTREAL.AI
MONTREAL.AI@Montreal_AI·
A 0.6B model learned to manage giants. That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang. The paper is not asking: “How do we build one model that knows everything?” It is asking something more interesting: “How do we build a small intelligence layer that knows who should think, who should act, and who should verify?” TRINITY is a lightweight coordinator for LLMs. It does not merge weights. It does not require architectural compatibility. It does not need access to closed-model internals. It does not try to turn the coordinator into the smartest model in the room. Instead, it orchestrates a pool of strong models at test time, including closed and open models. At each turn, TRINITY chooses a model and gives it one of three roles: Thinker — plan and decompose Worker — solve and execute Verifier — critique and accept/revise That may sound simple. It is not. Too many multi-agent systems are still prompts plus hope. TRINITY learns the coordination policy. A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse. The result is not just better routing. It is learned division of labor. The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D. The most important idea here is bigger than the benchmark. The future of AI may not be a single supermodel. It may be an organization of models. A small conductor. A team of specialists. A protocol for planning, execution, and verification. An intelligence layer that learns how to allocate cognition. This feels like a real shift: from bigger models to better systems from raw capability to coordinated capability from “which model is best?” to “what structure makes many models better together?” Full credit to the authors: Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang. Paper: TRINITY: An Evolved LLM Coordinator arxiv.org/abs/2512.04695 I’m attaching the first page because the abstract is worth reading closely. The future of AI may not be monolithic. It may be coordinated. #ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms
MONTREAL.AI tweet media
English
5
50
268
12.9K
Greg Jennings retweetledi
Jan Tempus
Jan Tempus@Jan55028368·
In our new paper, we reinterpret tokenisation as a problem in high-dimensional geometry (100M dims to be precise!), which we can solve efficiently to get a globally near-optimal tokeniser! Our method consistently improves language models over BPE. See 🧵for details.
Jan Tempus tweet media
English
14
75
601
69.7K
Greg Jennings retweetledi
Hamza Elshafie
Hamza Elshafie@hamzaelshafie·
New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post is my attempt to dissect ThunderKittens from the bottom up. I approached TK by asking what each abstraction is really buying us: which hardware detail it corresponds to, how it maps onto the underlying layouts the GPU actually wants, what boilerplate it removes, and which parts of the GPU programming model still remain visible to us as kernel authors. The post walks through the tile abstractions TK provides: register, shared, and tensor memory tiles, global layouts, vector abstractions, warp/warpgroup compute, TMA, swizzling, Hopper WGMMA, Blackwell tcgen05, 2xSM MMA, tensor memory, Cluster Launch Control, TK’s pipeline templates, and static persistent tile scheduling. At the end, I demonstrate TK’s lcf pipeline template by implementing a non-causal attention prefill kernel and benchmarking it against FlashAttention-2 and FlashAttention-3 on an H100 PCIe across different sequence lengths. The kernel beats FA2 across the sweep by ~1.55x on average, and closely tracks FA3, where FA3 is only ~1.05x-1.17x faster on the longer sequence lengths. Blog link: hamzaelshafie.bearblog.dev/dissecting-thu… Repo: github.com/HamzaElshafie/… I also put an extensive list of resources at the end, which I found very useful for interested readers. Please note: this is my own independent writeup. I’m not affiliated with @HazyResearch, and any mistakes in the post are mine. If you spot any please reach out! 1 / xx
Hamza Elshafie tweet mediaHamza Elshafie tweet mediaHamza Elshafie tweet mediaHamza Elshafie tweet media
English
3
42
353
37.4K
Greg Jennings retweetledi
Ali Hatamizadeh
Ali Hatamizadeh@ahatamiz1·
Gated DeltaNet-2 is here. 🚀 🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆 💡 Here's the idea behind it: Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it. Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation. Gated DeltaNet-2 decouples them. ✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton 📊 Results: We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3. Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Joint work with @YejinChoinka and @jankautz. 📄 Paper: shorturl.at/AAlVb 💻 Code: github.com/NVlabs/GatedDe… #LinearAttention #StateSpaceModels #Mamba #LLM
Ali Hatamizadeh tweet media
English
21
99
645
184K