alex wortega

376 posts

alex wortega banner
alex wortega

alex wortega

@justALEXWORTEGA

my opinions

Katılım Mayıs 2016
612 Takip Edilen130 Takipçiler
alex wortega retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 Self-speculation brings 6.75x real speedup for LLM generation with SGLang inference! Same model drafts future tokens in Diffusion mode → then verifies them in AR (causal) mode. One model and one KV cache. Just different attention masks. Thanks to perfect alignment, we get 2× longer acceptance lengths than MTP techniques (Eagle-3, MTP, dFlash). We run 2 forward passes… but the 2× higher acceptance means we break even - and with zero overhead from extra drafter, KV cache, or LM head that comes with MTP - those are not free. Last week we released Nemotron-Labs-Diffusion + Tri-mode LLMs! We did continued pre-training on Ministral-3 models by switching attention patterns (block causal <> bidirectional). Result: one model that runs AR mode, Diffusion mode, and Self-Speculation. Diffusion mode already shows high benchmark accuracy - excited to see what happens when someone beats left-to-right acceptance! 🔥 Github: github.com/NVlabs/Nemotro… Paper: d1qx31qr3h6wln.cloudfront.net/publications/N… SGLang inference: github.com/sgl-project/sg… Try the models on HF: huggingface.co/collections/nv…
English
10
27
227
16.2K
alex wortega
alex wortega@justALEXWORTEGA·
I'm always wondering how they are that's creative
Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation pub.sakana.ai/diffusionblocks What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: arxiv.org/abs/2506.14202 GitHub: github.com/SakanaAI/Diffu… 🐟

English
0
0
0
63
alex wortega retweetledi
Shuo Yang
Shuo Yang@Andy_ShuoYang·
Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: flashml-org.github.io Code: github.com/FlashML-org/fl…
English
40
229
1.6K
640.7K
alex wortega
alex wortega@justALEXWORTEGA·
fully in your browser in zero gpu, 4b qwen based. tuned for pi agent, hits 10% on Terminal Bench 2 powerful enough to write small projects - it built Tetris in seconds, live, inside a HF Space huggingface.co/spaces/AlexWor…
English
0
1
5
175
alex wortega
alex wortega@justALEXWORTEGA·
SkillOpt: train the skill, not the weights (by Microsoft) Instead of finetuning the model or hand-tuning prompts (who finetunes models these days?) optimize the natural-language skill doc itself. The agent stays frozen, the .md file learns. The loop looks exactly like GEPA: rollout - frozen agent runs tasks, logs scored trajectories reflect - a separate optimizer model reads success/fail minibatches, finds reusable rules bounded edits - add/delete/replace under a budget = a "textual learning rate", so good rules don't get nuked gate - edit is accepted only if held-out selection score goes up Output is a single best_skill.md that transfers across models and harnesses (Codex-trained skill → Claude Code, +31.8). Best-or-tied in 52/52 model×benchmark settings, 7 target models, 6 benchmarks. microsoft.github.io/SkillOpt/
alex wortega tweet media
English
0
0
1
116
alex wortega retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
We’re releasing Nemotron-Labs-Diffusion - the first Tri-mode LM family (3B/8B/14B) that switches between 1⃣Autoregressive, 2⃣Diffusion, and 3⃣Self-Speculation decoding by simply changing the attention pattern/mask. One model Three decoding modes. No extra draft models. No architecture changes. Just significantly better efficiency across different concurrency levels. Up to 4× higher real throughput for a single user. 🤗 HF Collection: huggingface.co/collections/nv…, open license 🛜 Project page: research.nvidia.com/publication/20… 📰 Tech report: bit.ly/Nemotron-Labs-… Details below 👇
Pavlo Molchanov tweet media
English
15
90
581
49.1K
alex wortega
alex wortega@justALEXWORTEGA·
Ai safety is amazing
alex wortega tweet media
English
0
0
2
47
alex wortega
alex wortega@justALEXWORTEGA·
"this feature will take 4 weeks to implement" - says Claude and spend 30minutes on it
alex wortega tweet media
English
0
0
1
73
alex wortega
alex wortega@justALEXWORTEGA·
Happy eastern Europe transport Saturday
alex wortega tweet mediaalex wortega tweet mediaalex wortega tweet media
English
0
0
0
30
alex wortega retweetledi
Leandro von Werra
Leandro von Werra@lvwerra·
We released physics-intern: a simple harness for science problems! It gets models like Gemini 3.1 Pro to go from 17.7 -> 31.4, thus beating GPT 5.5 Pro. The physics-intern harness can wrap any model and via dedicated subagent boost the performance of the vanilla reasoning models. While I think more and more of these harness capability gains will be absorbed into the models (like prompting tricks disappeared over time) there is a lot to be gained right now by building good scaffolds for those models and integrating tools well. Interestingly, the exception we found that GPT 5.5 Pro actually didn't benefit from the physics-intern harness! Read more about it here: huggingface.co/spaces/hugging… PS: I think the Harness[Model] notation is kind of nice.
English
28
73
593
95.3K
alex wortega retweetledi
Alexander S
Alexander S@devdef·
Alexander S tweet media
ZXX
0
1
1
103