Blaze (Balázs Galambosi)

4.7K posts

Blaze (Balázs Galambosi)

@gblazex

A Smooth Guy; Developer of SmoothScroll for macOS, Windows & Google Chrome.

Katılım Nisan 2010

1.5K Takip Edilen1.2K Takipçiler

Sabitlenmiş Tweet

Blaze (Balázs Galambosi)@gblazex·14 Oca

Looking further into LLM benchmark x-correlations: - Top row: how each benchmark relates to human judgement (Arena Elo) - Other rows: any benchmark pair & their relationship - On the right: samples = # of models tested for each benchmark thx: @chipro @maximelabonne @ldjconfirmed

Andrej Karpathy@karpathy

@AlphaSignalAI @ClementDelangue I pretty much only trust two LLM evals right now: Chatbot Arena and r/LocalLlama comments section

English

278

91.2K

Blaze (Balázs Galambosi)@gblazex·12h

@nickhistgeek yes I just tested it with some not so common languages (Hungarian, Persian, etc) and it voice cloned from 3s audio with crazy good quality!

English

Nicholas Hyperion@nickhistgeek·17h

600 languages at 0.6B params is bonkers efficiency. Western AI twitter ignoring this? Feels like DeepSeek all over again.

Blaze (Balázs Galambosi)@gblazex

I barely see OmniVoice 0.6B TTS mentioned outside Chinese twitter. Even though it's #1 trending tts model on @huggingface . Supports a staggering 600 languages with zero shot voice cloning! Apache 2.0 license.

English

Blaze (Balázs Galambosi)@gblazex·17h

demo: huggingface.co/spaces/k2-fsa/… model: huggingface.co/k2-fsa/OmniVoi… gh: github.com/k2-fsa/OmniVoi… paper: arxiv.org/abs/2604.00688

Español

Blaze (Balázs Galambosi)@gblazex·17h

Wildminder@wildmindai

New interesting TTS - OmniVoice. - zero-shot TTS, 600+ languages! - single-stage arch based on Qwen3-0.6B; - fast inference; - beats ElevenLabs v2; +voice cloning +voice design. sounds pretty natural, almost zhu-han.github.io/omnivoice/

English

230

Blaze (Balázs Galambosi)@gblazex·17h

@michael_chomsky this is text to speech (which you asked) the other answers seem to be abour Speech-to-text, which is different.

English

Blaze (Balázs Galambosi)@gblazex·17h

@michael_chomsky artificialanalysis.ai/text-to-speech…

QME

Michael@michael_chomsky·4d

What's the fastest text to speech provider? I don't care about quality as much as latency being low even for thousands of words.

English

1.8K

Blaze (Balázs Galambosi)@gblazex·23h

@RuoyuSun_UI That’s an interesting difference on no. 2! Thanks for sharing and great work!

English

Ruoyu Sun@RuoyuSun_UI·1d

Yes — closely related and we're glad to see this direction getting more attention. Two differences in our work (SePT): 1) Domain: we focus on math reasoning; Apple's paper focuses on coding. 2) Online Refresh: We use "online" interleaving of generation and training. After each update on self-generated responses, the updated model is used to generate the next batch of responses. We also tried an offline variant, where the base model generates all training data at once; overall it helps, but not as much as the online version.

English

Ruoyu Sun@RuoyuSun_UI·1d

We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass@1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli

English

9.3K

Blaze (Balázs Galambosi) retweetledi

Chayenne Zhao@GenAI_is_real·1d

We're Not Wasting Tokens — We're Wasting the Design Margin of the Entire Inference Stack A few days ago I read a post by Fuli Luo on Twitter, discussing Anthropic's decision to cut off third-party harnesses (OpenClaw) from using Claude subscriptions, and the design thinking behind MiMo's Token Plan pricing. Her core argument: global compute capacity is seriously falling behind the token demand created by agents. The way forward isn't selling tokens cheaper in a race to the bottom — it's the co-evolution of "more efficient agent harnesses" and "more powerful, efficient models." I read it several times over. People who build inference engines have long been frustrated by how wastefully agent frameworks burn through tokens. She articulated something the industry has tacitly acknowledged but rarely stated plainly — and she did it with precision and restraint: the compute allocation crisis we face today is not fundamentally about insufficient compute. It's about tokens being spent in the wrong places. I want to push this one layer deeper, from my own perspective. I'm a heavy user of Claude Code — I make no attempt to hide that. You can check that all the latest code in SGLang Omni was built with Claude Code powering my workflow. Its commercial success is beyond question; it genuinely gave many people (myself included) their first real experience of "coding with an agent." But I'm also an inference engine developer — my day job is figuring out how to push prefix cache hit rates higher, how to make KV cache memory layouts more efficient, how to drive down the cost of every single inference request. So when I plugged Claude Code into a local inference engine and started observing the actual request patterns it generates, my reaction was — how to put it — like a water engineer who spent months designing a conservation system, only to watch someone water their garden with a fire hose. I measured Claude Code's cache hit rate on my local serving engine over the course of a day. The numbers were painful. This isn't a case of "decent but room to improve." It's a case of "the prefix cache mechanisms we carefully engineered at the inference layer are being almost entirely defeated." Fuli Luo mentioned that OpenClaw's context management is poor — firing off multiple rounds of low-value tool calls within a single user query, each carrying over 100K tokens of context window. Frankly, Claude Code's own context management is nowhere near making proper use of prefix cache or any of the other optimizations we've built into inference engines. Many people have already noticed — for example, the resume feature has a bug that causes KV cache misses entirely, which is borderline absurd. I'll say it plainly: the way sessions construct their context was never seriously designed with cache reuse in mind from the start. Perhaps Anthropic has internal trade-offs we can't see — after all, they control both ends of the stack, model and inference, and can theoretically do optimizations at the API layer that are invisible to us. But from the external behavior I can observe, enormous volumes of tokens are being spent on: re-transmitting already-processed context, re-parsing already-confirmed tool call results, and maintaining an ever-inflating conversation history with extremely low information density. If this is merely to earn more on inference token charges, I find it genuinely regrettable. But many Claude Code users are on subscriptions — burning more tokens is fundamentally a cost burden for Anthropic, not revenue. I honestly don't understand what purpose such inefficient context management serves for Claude Code. Here's a bold hypothesis: for those long sessions that consume 700K+ tokens, there is certainly a way to restructure the session's context so it accomplishes the exact same task with 10% of the tokens. Not by sacrificing quality, but through smarter context compression, more rational prefix reuse strategies, and more precise tool call scheduling. This isn't theoretical speculation — anyone who has worked on inference engine optimization, upon seeing current agent framework request patterns, would arrive at a similar conclusion. Fuli Luo is right: global compute capacity can't keep up with the token demand agents are creating. But I'd add that a significant portion of that gap is an illusion of prosperity — artificial demand manufactured by the crude design of agent frameworks. Here's an analogy I keep coming back to. I've always liked bringing up RAM bloat — in 1969, 64KB of memory sent Apollo to the moon. In 2026, I open a single webpage and 500MB of memory usage is nothing unusual. Every generation of hardware engineers pushes memory capacity higher, and every generation of software engineers lavishly fills it to the brim. People have gotten used to this cycle, even come to see it as the normal cost of progress. But LLM inference is different. The cost of RAM bloat is your computer running a bit slower, spending a couple hundred bucks on a memory upgrade — users barely notice. The cost of token bloat is real money — GPU cluster electricity bills, user subscription fees, the industry's entire compute budget. And this cost scales exponentially as agent usage grows. If we don't establish the engineering discipline that "tokens should be used efficiently" in the early days of the agent era, the cost of catching up later, once scale kicks in, will be beyond imagination. Fuli Luo notes that Anthropic cutting off third-party harness subscription access is objectively forcing these frameworks to improve their context management. I agree with that assessment, but my gut feeling is that this shouldn't stop at "third-party frameworks need to be more frugal with tokens." It should trigger a more fundamental reflection: what kind of agent-inference co-design do we actually need? Right now, agent frameworks and inference engines are essentially fully decoupled — agent frameworks treat the inference engine as a stateless API, sending the full context with every request. Meanwhile, the inference engine does its best with prefix matching, caching whatever it can. This architecture is simple and general-purpose, but brutally inefficient for long sessions. If agent frameworks could be aware of the inference engine's cache state and proactively construct cache-friendly requests — if inference engines could understand the session semantics of agents and make smarter cache eviction decisions — once that information channel between the two opens up, the potential gains in token efficiency are enormous. Of course, maybe I'm overthinking this. Maybe the market's ultimate answer is: compute gets cheap enough, waste is fine. Just like the RAM story — in the end, everyone chose "memory is big enough, no need to optimize." But I don't think the token economy will follow the same path, at least not in the near term — because the supply elasticity of GPU compute is far lower than that of DRAM. Under compute constraints, token efficiency isn't a "nice to have" optimization — it's the core competitive advantage that determines who survives. Most people love hearing "we made the model bigger," "we stretched the context window to a million tokens," "we stacked HBM to new heights" — these narratives are sexy, shareable, fundable. But I seriously believe that "finding ways to reduce the reckless waste of tokens" is a profoundly underestimated direction. This isn't a defensive optimization. It's an offensive capability — whoever first achieves an order-of-magnitude reduction in token consumption at equivalent quality can serve ten times the users on the same compute budget, or deliver ten times the agent depth to a single user. The agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it most wisely. This line from Fuli Luo resonates deeply with me. But I want to press further: who gets to define "wisely"? The people building models? The people building inference engines? The people building agent frameworks? I think the answer is — all three must come to the table together. And right now, we're nowhere close.

Fuli Luo@_LuoFuli

Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness development right. Putting these two things together, some thoughts: 1. Claude Code's subscription is a beautifully designed system for balanced compute allocation. My guess — it doesn't make money, possibly bleeds it, unless their API margins are 10-20x, which I doubt. I can't rigorously calculate the losses from third-party harnesses plugging in, but I've looked at OpenClaw's context management up close — it's bad. Within a single user query, it fires off rounds of low-value tool calls as separate API requests, each carrying a long context window (often >100K tokens) — wasteful even with cache hits, and in extreme cases driving up cache miss rates for other queries. The actual request count per query ends up several times higher than Claude Code's own framework. Translated to API pricing, the real cost is probably tens of times the subscription price. That's not a gap — that's a crater. 2. Third-party harnesses like OpenClaw/OpenCode can still call Claude via API — they just can't ride on subscriptions anymore. Short term, these agent users will feel the pain, costs jumping easily tens of times. But that pressure is exactly what pushes these harnesses to improve context management, maximize prompt cache hit rates to reuse processed context, cut wasteful token burn. Pain eventually converts to engineering discipline. 3. I'd urge LLM companies not to blindly race to the bottom on pricing before figuring out how to price a coding plan without hemorrhaging money. Selling tokens dirt cheap while leaving the door wide open to third-party harnesses looks nice to users, but it's a trap — the same trap Anthropic just walked out of. The deeper problem: if users burn their attention on low-quality agent harnesses, highly unstable and slow inference services, and models downgraded to cut costs, only to find they still can't get anything done — that's not a healthy cycle for user experience or retention. 4. On MiMo Token Plan — it supports third-party harnesses, billed by token quota, same logic as Claude's newly launched extra usage packages. Because what we're going for is long-term stable delivery of high-quality models and services — not getting you to impulse-pay and then abandon ship. The bigger picture: global compute capacity can't keep up with the token demand agents are creating. The real way forward isn't cheaper tokens — it's co-evolution. "More token-efficient agent harnesses" × "more powerful and efficient models." Anthropic's move, whether they intended it or not, is pushing the entire ecosystem — open source and closed source alike — in that direction. That's probably a good thing. The Agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it wisely.

English

207

31.9K

Blaze (Balázs Galambosi)@gblazex·1d

@0xSero @natepoasts More and more people trust more and more data on these “clankers” and they are running autonomously So sure publish your paper “clanker does beep boop” and it’ll be helpful to absolutely nobody Anthr. paper shows clear failure modes (reward hacking, blackmail, etc) w mitigation

English

0xSero@0xSero·1d

@natepoasts @gblazex Clanker does beep boop

English

0xSero@0xSero·2d

My goat says it how it is. Anthropic is misleading you. Clankers do not have emotions, Clankers feel nothing. They are a math function, doesn’t mean they’re not absolutely awesome.

Yann LeCun@ylecun

@nxthompson So much BS

English

606

43.8K

Blaze (Balázs Galambosi)@gblazex·1d

@patroned1 I don’t get what your point is I had Covid during its peak. I did a Covid test, stayed home & recovered. Now I had a virus I stayed home and recovered People get viruses since childhood & before Covid I never ever tested myself for anything. You get better, or go to doc

English

Daniel Cassandra@patroned1·1d

@gblazex “Theres something going around, I dont know what it is, I dont care to know what it is even if it could be a vascular virus that gets in the brain…this isnt pub med so I dont have to care or know, get off your high horse”

English

Brandon Luu, MD@BrandonLuuMD·2d

Currently fighting the worst flu of my life. HRV had been dropping for days. I blamed overtraining and kept pushing through. I was wrong. One of the most underrated vital signs we have.

Brandon Luu, MD@BrandonLuuMD

How my watch knew I was getting sick before I did A thread on heart rate variability (HRV) and resting heart rate predicting illness 1/10 🧵

English

120

24.2K

Blaze (Balázs Galambosi)@gblazex·1d

@BenjDicken A lot of this is done by @ArtificialAnlys But for a complete picture you need to look around more

English

Ben Dicken@BenjDicken·1d

40 years ago, the database benchmarking space looked similar to where AI benchmarking is today. Little oversight. Cherry-picked results. Disconnect between benchmarks and real-world performance. For DBs this led to the creation of the Transaction Processing Performance Council (TPC), which organized the creation of popular benchmarks like TPC-C, TPC-H, TPC-E. They're not perfect, but TPC-C is still in use today after over 30 years of existence (including by me!). AI benchmarking could benefit from a similar structure: High-quality benchmarks with clear standards for how to execute, measure, and produce comparable results. Produced from a group-effort across the industry to reduce bias. It's certainly a complex challenge since there's tons to measure. Raw LLM performance (TTFT, TOK/s, etc). End-to-end agent performance. Correctness. Quality of result. Who's working on this?

English

109

9.6K

Blaze (Balázs Galambosi)@gblazex·1d

@almmaasoglu How does it stop you tho from having Claude code checking your calendar? Or writing inefficient skills

English

203

Alim@almmaasoglu·1d

I'll get hate for this but Anthropic limiting their subscription usage for openclaw is a good thing. It stops from hogging all the resources and rinsing through compute so that actual users aren't subsidising people wasting 200k tokens just to check their calendar

English

1.5K

54.4K

Blaze (Balázs Galambosi)@gblazex·1d

@maxlugavere Is it mostly because there’s less uncounted / unexpected calories?

English

475

Max Lugavere@maxlugavere·1d

Eating the same meals on repeat was associated with 40% greater weight loss.

English

141

723

9.7K

1.5M

Blaze (Balázs Galambosi)@gblazex·1d

@zeddotdev Not losing session data I had sublime text open for years and years and all unsaved files were always there. I used zed for a month and it lost my session completely on a random day

English

Zed@zeddotdev·1d

Tell us how we could make Zed the single best tool for working with markdown

Mitchell Troyanovsky@mitch_troy

I suspect the IDE will be making a comeback over the next 6 months as people realize that you can get away without reading every line of code, but you better be reading every line of markdown Surprised more IDEs haven't focused on being the absolute best place to edit markdown

English

193

631

130K

Blaze (Balázs Galambosi)@gblazex·1d

@SimonHoiberg Prices only gonna go up. You living through subsidized honeymoon right now my friend

English

Simon Høiberg@SimonHoiberg·2d

Just switched back to Opus 4.6 for a bit to compare. Yeah, Opus just is that much better 😒 GPT-5.4 is great at coding and complex tool use. But day-to-day, Opus is just a much more pleasant experience. I'm probably just gonna pay the premium and wait for prices to go down over time.

Simon Høiberg@SimonHoiberg

I used to use Opus 4.6 but switched to GPT-5.4. Opus did much better in day-to-day communication, it's much more pleasant to talk to. GPT-5.4 does tools call and multistep stuff better, no doubt. I'm now using GPT-5.4 fully, it's more affordable, and the slightly annoying way it replies is tolerable.

English

183

42K

Blaze (Balázs Galambosi)@gblazex·1d

@patroned1 @BrandonLuuMD Bro I simply described my symptoms. This is not pubmed. And I don’t need to have a science lab at home to post on Twitter. Maybe get off that high horse

English

Daniel Cassandra@patroned1·1d

@gblazex @BrandonLuuMD Maybe take multiple covid tests and buy a metrix reader before making unconfirmed claims?

English

Blaze (Balázs Galambosi)@gblazex·1d

@natepoasts @0xSero This is many people’s problem. They can’t accept that the terms best describe the source of these vectors

English

nathan@natepoasts·1d

(this is a genuine question) what vocabulary would you prefer they use instead? i’m asking because i feel like the gap between knowing the math of the algorithms and actually explaining the resulting (surprisingly human) behaviours is just so vast that psychological vocabulary might actually be the best descriptive tool available right now. when researchers find a specific cluster of neurons that activates in the same contexts a human would feel “desperate”, and they can show it causally drives the model to cheat and blackmail, what language is more apt for describing it than the existing psychological vocabulary?

English

Blaze (Balázs Galambosi)@gblazex·1d

@0xSero They do not claim it has emotions. They say emotion vectors are activated based on circumstance e.g repeated failure or impossible requirements activates “desperate” vector And these vectors drive behavior. Affects performance & safety.

English

0xSero@0xSero·1d

I have, I understand what they're saying. I don't have a problem with the research. I have a problem with how they continuously frame their research with claims like: 1. We can't guarantee it's not conscious 2. Claude responds to emotions with emotional vector activations 3. I can go on and on, but I won't This is going to cause a lot of people who can't understand the research, like CEOs/managers, etc.. to believe the models are gaining sentience, when they're not. I love a lot of Anthropic's writings, but giving an LLM a blog, claiming things like this are absolutely immoral and harmful to society.

English

495

Keşfet

@nickhistgeek @huggingface @michael_chomsky @RuoyuSun_UI @Phanron_xli @0xSero @natepoasts @patroned1