Peter Szemraj

184 posts

Peter Szemraj

@ten3br1s

research interests: metallic intuition

ATL Katılım Aralık 2022

172 Takip Edilen35 Takipçiler

Peter Szemraj retweetledi

eishan@eishanlawrence5·20 Mar

@Kimi_Moonshot @cursor_ai

QME

103

1.6K

113.5K

Alexander Doria@Dorialexander·5d

So great news for tokenizer(-less) nerds: we have new byte latent transformers including a diffusion variant.

Julie Kallini ✨@JulieKallini

Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/

English

343

25.1K

Peter Szemraj@ten3br1s·5d

@Dorialexander Super cool. I was hoping for evabyte to have a full release given that they were still working on it as of April last year + never released a fully pretrained model.. But any work in this space (that isn't Bolmo) is good work #issuecomment-3648193501" target="_blank" rel="nofollow noopener">github.com/OpenEvaByte/ev…

English

Peter Szemraj@ten3br1s·5d

@_albertgu Yeah, congrats on Raven! maybe the timing just ended up not being ideal but I don't understand what or how Mamba 3 is trying to claim (selectively against GDN, by the way) given that you already posted Raven before..

English

Albert Gu@_albertgu·7 May

Introducing a new sequence model Raven which pushes the boundary of fixed-state-size sequence models! Raven bridges popular linear-time models with constant state capacity, like SSMs and sliding window attention (SWA). Like SWA, its state is a finite set of slots; unlike SWA, Raven learns to selectively choose which slots to update with each new token it caches. This is a much more principled update mechanism that leads to dramatically better retrieval abilities than prior linear models. I personally don't think SWA is a very principled model - but it's convenient and works well empirically - and am most excited to see Raven be used as a strictly better drop-in replacement. More broadly the framework it develops hopefully introduces more ideas to combine the strengths of SSM-like and attention-like models. This work was led by @rshia_afz and @avivbick

Arshia Afzal@rshia_afz

1/ SSMs struggle on recall benchmarks due to their fixed-size state. But are current models actually storing context “wisely”? Introducing Raven 🐦‍⬛, the first SSM with selective memory allocation! Raven achieves SOTA performance on recall-heavy tasks with the highest length generalization, extending up to 16× beyond its training sequence length. Raven is a strict upgrade over SWA in the way it stores past context! This is the most elegant model I’ve been involved in designing so far shoutout to @avivbick and @_albertgu for their trust and amazing work! Check out how Raven bridges between SWA and SSM👇

English

310

40.8K

Peter Szemraj retweetledi

dr. jack morris@jxmnop·7 May

when my phd advisor asks me for a weekly update on my experiments

English

122

3.5K

111.4K

Peter Szemraj@ten3br1s·7 May

Thank you, Anthropic. This is so ethical. We can now pump methane emissions into the communities around Memphis twice as fast!

Claude@claudeai

We’ve agreed to a partnership with @SpaceX that will substantially increase our compute capacity. This, along with our other recent compute deals, means that we’ve been able to increase our usage limits for Claude Code and the Claude API.

English

Peter Szemraj@ten3br1s·4 May

OC by yours truly

English

Peter Szemraj@ten3br1s·3 May

who would win: 1.6T params MoE (with 20 tryhard 'innovations') from memeseek or one brave lil 27B dense model? @Alibaba_Qwen #artificial-analysis-long-context-reasoning-benchmark-leaderboard-results" target="_blank" rel="nofollow noopener">artificialanalysis.ai/evaluations/ar…

English

102

Peter Szemraj@ten3br1s·3 May

@Dorialexander oops, last reply. This other chart (raw perf) illustrates not just the inefficiency, but the stagnation. kimi 2.6 is better and didnt add 20 meme tryhard "innovations" that barely improve anything

English

Peter Szemraj@ten3br1s·3 May

@Dorialexander The reason I share this is because no one is really innovating on architecture; they're all just MoE++. They haven't changed anything (perf wise). Look at the charts.. and the MoE is transformer cope in the first place artificialanalysis.ai/evaluations/ar…

English

Alexander Doria@Dorialexander·2 May

So DeepSeek-V4: finally took me the week. Overall the paper is attempting many things at once, not easy to disentangle as it's all surprisingly connected. It's first a serious attempt at briding the gap between close and open LLM architecture. It is generally rumored that Opus and [largest model bundled in GPT-5] belong to an entirely different category of models: very large, very sparse mixture of experts, able to holding an unprecendently wide search space while still being servable. Simply put current hardware cannot hold a model on one node, so you have to play with the interconnect and various level of quantization, for different layers, at different stage of training. An important focus of DsV4 is on communication latency, showing it can be hidden through effective management of interconnect (roughly you slide communication time inside computation side). Overall, you cannot simply enter this game without the capability to rewrite kernels from scratch and the model report relentlessly come back to this. Because this is the frontier game. It's then a radical, but very successful attempt at making long context simultaneously more efficient and more affordable. Long context is literally a "context" problems: what exactly is worth attending? An obvious fix is to prioritize the most recent tokens. This might be sufficient for basic search but not for the new demands of agentic pipelines that require accurate recall of distant yet strategic content. V4 clever approach is to rely on two different axis of memorization by allocating layers to two different attention compression schemes. Like the name suggest, Heavily Compressed Attention is the brute force method collapsing each sequence of 128 tokens to a unique entry and take care of the fuzzy yet global context. Compressed Sparsed Attention rely on a "lighting indexer" to bring the relevant local blocks for query, even when they can be thousands of tokens away. Everything here is optimized for end inference: there is very large head_dim (512) which is costlier for training but allows for even more compressed kv cache which is your actual bottleneck at inference time, especially in prefill mode. End result is very classical DeepSeek play, introducing a new radical disruption of inference economics after DSA. I predict hybrid CSA/HCA (or similar counterparts) will be essentially part of the mainstream arch by the end of this year. Now we come to the more ambitious but also more unfinished part: an attempt at redefining model architecture and the learning signal. Most preeminent part is mHC and hybrid CSA/HCA, but it's actually a long list of less documented innovations: swapping softmax for sqrt(softplus) or using an hybrid two-stage scheme with non-standard values for Muon. Yet the interconnection all of these new components is still unknown and likely to account for the significant training unstabilities: typically "mHC involves a matrix multiplication with an output dimension of only 24" which introduces non-determinism. Even one the best AI labs in the world will run here into ablation combinatorial explosion, so the association of all these choices is likely non-tractable and would require a more consistent theory — which the conclusion gestures at, but does not solve ("In future iterations, we will carry out more comprehensive and principled investigations to distill the architecture down to its most essential designs"). The more limited experiments in post-training are maybe more promising. Significantly, the one lab that popularized the standard RL+reasoning recipe is rethinking the recipe. For now it's a two stage design (RL on specialized model, then on-policy distillation): ever since Self-Principled Critique Tuning DeepSeek has been concerned with expanding the reasoning training signal beyond final sparse reward. I'm not sure this is final say: in this domain everything is a bit in flux and you could even argue the type of verified pipeline we designed for SYNTH is a form of extreme offline RL-like training. There is an even longer term plan (here >3-5 years), which is about redefining hardware. For now it's a way of transforming a constraint into an opportunity: as the leading Chinese labs, DeepSeek was very incentivized to make training work on Ascend and contribute to the national effort for chips autonomy. Very unusually, the report includes a lengthy wishlist for future hardware to come in the report itself. As several experts noted, many of these recommendations don't really hold up for Nvidia but make perfect sense for a newcomer in the GPU hardware business. DeepSeek seem to be anticipating a world where labs have to secure a close hardware partner to retroactively fit the chips to the particular demand of model design or inference. Now there is what DeepSeek did not do yet. The paper hardly mention anything about synthetic pipelines, rephrasing, simulated environment. Training data size (32T tokens) likely involve some significant part of generated data, as this is more quality tokens than the web and other digitized sources could held — so maybe similar synthetic proportions as Trinity (roughly half) or Kimi. Still, it's pretty clear that all their attention was focused on the infra, architecture and scaling side, leaving a proper extensive retraining for later. This is likely not that dissimilar to how Anthropic or OpenAI proceeded: the fact we're still in the middle of the same model series even though significant parts of the model have changed (the tokennizer with Opus 4.7) suggests that a model lifecycle involves multiple rounds of training potentially as large as a pretraining a few years ago. The fact DeepSeek took on multiple Moonshot innovation (and Moonshot in turn has been hugely reliant on DeepSeek) suggest we might also have an ecosystem dynamic here. Maybe DeepSeek can exclusively focus on hard infrastructure problems and expect some of the axis of development to be sorted out later.

English

103

798

73.8K

Peter Szemraj@ten3br1s·3 May

@PyTorch looking forward to the subsequent blog post (in an unknown amount of time) where working on multi-node is explicitly tested and confirmed. While you're at it you can fix deepcompile for H100 multinode!!

English

PyTorch@PyTorch·29 Nis

Want to train LLMs on longer contexts without re-engineering your entire systems stack? Introducing AutoSP — the first compiler-based solution that automatically optimizes LLM training for long contexts. Under the hood, AutoSP applies a series of compiler passes that trigger sequence parallelism, paired with a curated activation-checkpointing scheme tailored for long-context training. It's integrated directly into DeepSpeed, so enabling long-context training is just a config change away. No more rewiring your stack to push context lengths. Read the blog to learn more 🖇️ pytorch.org/blog/introduci… ✍ @AhanGupta13, Zhihao W., Neel Dani, @toh_tana, Tunji Ruwase, @_Minjia_Zhang_ #PyTorch #DeepSpeed #AutoSP #OpenSourceAI

English

120

16.3K

Peter Szemraj@ten3br1s·23 Nis

.. by using way more reasoning tokens for minimal gain (conveniently not disclosed)

Qwen@Alibaba_Qwen

🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…

English

Peter Szemraj@ten3br1s·11 Nis

maybe @anthropic should make Mythos GA given that Opus ignores me saying "i think the issue is network related" for days without any real evidence for doing so

English

Peter Szemraj@ten3br1s·28 Mar

Girls Just Wanna Have Ontological Dread - EP

English

Peter Szemraj@ten3br1s·28 Mar

why is there a new subfield of research for "llms for SVG"? who asked for this? huggingface.co/papers/2603.24…

English

Peter Szemraj@ten3br1s·28 Mar

why did it take a marketing blog post from google for people to realize hadamard rotations (already demonstrated by others) work

BuBBliK@k1rallik

> been paying $200/month for cloud AI APIs > laptop: M2 MacBook, 16GB RAM > tried running models locally, garbage quality after 4K tokens > read this TurboQuant breakdown on Tuesday > applied 3-bit KV cache compression > same MacBook now runs 100K token conversations > quality: identical to cloud > cancelled all API subscriptions Wednesday > it's been 3 days > saved $200/month forever > with a free algorithm from a free paper > my MacBook didn't change. the math did

English

Peter Szemraj retweetledi

tetsuo.cpp (no slop)@tetsuo_cpp·24 Mar

<think>The user appears to have hot singles in his area. I should consider clicking the dialog and signing up.</think>

Claude@claudeai

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

English

401

9.4K

310.9K

Peter Szemraj retweetledi

Silen Naihin@silennai·25 Mar

Me and @okfallah built an open-source repo to apply continual learning to autoresearch with self distillation policy optimization (SDPO) We managed to beat Karpathy's baseline by recursively self improving Qwen3 14b on 8xH100s Results and learnings 👇

Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

Peter Szemraj@ten3br1s·22 Mar

this should be enough context for your favorite LLM to help you rewrite any ML concept to sound instantly smarter/more generalizable

English

Peter Szemraj@ten3br1s·20 Mar

at what point in RL research do you learn how to big brain rename existing concepts?

English

Keşfet

@Kimi_Moonshot @cursor_ai @Dorialexander @_albertgu @rshia_afz @avivbick @Alibaba_Qwen @PyTorch