Blaze (Balázs Galambosi)

4.7K posts

Blaze (Balázs Galambosi) banner
Blaze (Balázs Galambosi)

Blaze (Balázs Galambosi)

@gblazex

A Smooth Guy; Developer of SmoothScroll for macOS, Windows & Google Chrome.

Katılım Nisan 2010
1.5K Takip Edilen1.2K Takipçiler
Sabitlenmiş Tweet
Blaze (Balázs Galambosi)
Looking further into LLM benchmark x-correlations: - Top row: how each benchmark relates to human judgement (Arena Elo) - Other rows: any benchmark pair & their relationship - On the right: samples = # of models tested for each benchmark thx: @chipro @maximelabonne @ldjconfirmed
Blaze (Balázs Galambosi) tweet media
Andrej Karpathy@karpathy

@AlphaSignalAI @ClementDelangue I pretty much only trust two LLM evals right now: Chatbot Arena and r/LocalLlama comments section

English
12
49
278
91.2K
Blaze (Balázs Galambosi)
@nickhistgeek yes I just tested it with some not so common languages (Hungarian, Persian, etc) and it voice cloned from 3s audio with crazy good quality!
English
0
0
0
13
Blaze (Balázs Galambosi)
I barely see OmniVoice 0.6B TTS mentioned outside Chinese twitter. Even though it's #1 trending tts model on @huggingface . Supports a staggering 600 languages with zero shot voice cloning! Apache 2.0 license.
Wildminder@wildmindai

New interesting TTS - OmniVoice. - zero-shot TTS, 600+ languages! - single-stage arch based on Qwen3-0.6B; - fast inference; - beats ElevenLabs v2; +voice cloning +voice design. sounds pretty natural, almost zhu-han.github.io/omnivoice/

English
1
1
1
230
Michael
Michael@michael_chomsky·
What's the fastest text to speech provider? I don't care about quality as much as latency being low even for thousands of words.
English
6
0
11
1.8K
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
Yes — closely related and we're glad to see this direction getting more attention. Two differences in our work (SePT): 1) Domain: we focus on math reasoning; Apple's paper focuses on coding. 2) Online Refresh: We use "online" interleaving of generation and training. After each update on self-generated responses, the updated model is used to generate the next batch of responses. We also tried an offline variant, where the base model generates all training data at once; overall it helps, but not as much as the online version.
English
1
0
3
95
Ruoyu Sun
Ruoyu Sun@RuoyuSun_UI·
We’re excited to share our work "A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning". An earlier version of this work has been on arXiv for a few months. We added more experiments and revised it to this new title. The recipe is simple: the model samples its own reponses at low temperature, learns from them with ordinary SFT training, and repeats. No reward. No verifier. No fancy objective beyond standard SFT. On Qwen2.5-Math-7B, mean Pass@1 over 6 math benchmarks improves 22.7 → 39.5. Note that mean Pass@32 also improves 61.0 → 67.9, suggesting that this simple reward-free procedure unlocks more of the model’s existing reasoning potential. See the updated paper directly at: github.com/ElementQi/SePT… The arXiv link is: arxiv.org/abs/2510.18814 The updated version will appear on arXiv shortly. @Phanron_xli
Ruoyu Sun tweet media
English
7
11
91
9.3K
Blaze (Balázs Galambosi) retweetledi
Chayenne Zhao
Chayenne Zhao@GenAI_is_real·
We're Not Wasting Tokens — We're Wasting the Design Margin of the Entire Inference Stack A few days ago I read a post by Fuli Luo on Twitter, discussing Anthropic's decision to cut off third-party harnesses (OpenClaw) from using Claude subscriptions, and the design thinking behind MiMo's Token Plan pricing. Her core argument: global compute capacity is seriously falling behind the token demand created by agents. The way forward isn't selling tokens cheaper in a race to the bottom — it's the co-evolution of "more efficient agent harnesses" and "more powerful, efficient models." I read it several times over. People who build inference engines have long been frustrated by how wastefully agent frameworks burn through tokens. She articulated something the industry has tacitly acknowledged but rarely stated plainly — and she did it with precision and restraint: the compute allocation crisis we face today is not fundamentally about insufficient compute. It's about tokens being spent in the wrong places. I want to push this one layer deeper, from my own perspective. I'm a heavy user of Claude Code — I make no attempt to hide that. You can check that all the latest code in SGLang Omni was built with Claude Code powering my workflow. Its commercial success is beyond question; it genuinely gave many people (myself included) their first real experience of "coding with an agent." But I'm also an inference engine developer — my day job is figuring out how to push prefix cache hit rates higher, how to make KV cache memory layouts more efficient, how to drive down the cost of every single inference request. So when I plugged Claude Code into a local inference engine and started observing the actual request patterns it generates, my reaction was — how to put it — like a water engineer who spent months designing a conservation system, only to watch someone water their garden with a fire hose. I measured Claude Code's cache hit rate on my local serving engine over the course of a day. The numbers were painful. This isn't a case of "decent but room to improve." It's a case of "the prefix cache mechanisms we carefully engineered at the inference layer are being almost entirely defeated." Fuli Luo mentioned that OpenClaw's context management is poor — firing off multiple rounds of low-value tool calls within a single user query, each carrying over 100K tokens of context window. Frankly, Claude Code's own context management is nowhere near making proper use of prefix cache or any of the other optimizations we've built into inference engines. Many people have already noticed — for example, the resume feature has a bug that causes KV cache misses entirely, which is borderline absurd. I'll say it plainly: the way sessions construct their context was never seriously designed with cache reuse in mind from the start. Perhaps Anthropic has internal trade-offs we can't see — after all, they control both ends of the stack, model and inference, and can theoretically do optimizations at the API layer that are invisible to us. But from the external behavior I can observe, enormous volumes of tokens are being spent on: re-transmitting already-processed context, re-parsing already-confirmed tool call results, and maintaining an ever-inflating conversation history with extremely low information density. If this is merely to earn more on inference token charges, I find it genuinely regrettable. But many Claude Code users are on subscriptions — burning more tokens is fundamentally a cost burden for Anthropic, not revenue. I honestly don't understand what purpose such inefficient context management serves for Claude Code. Here's a bold hypothesis: for those long sessions that consume 700K+ tokens, there is certainly a way to restructure the session's context so it accomplishes the exact same task with 10% of the tokens. Not by sacrificing quality, but through smarter context compression, more rational prefix reuse strategies, and more precise tool call scheduling. This isn't theoretical speculation — anyone who has worked on inference engine optimization, upon seeing current agent framework request patterns, would arrive at a similar conclusion. Fuli Luo is right: global compute capacity can't keep up with the token demand agents are creating. But I'd add that a significant portion of that gap is an illusion of prosperity — artificial demand manufactured by the crude design of agent frameworks. Here's an analogy I keep coming back to. I've always liked bringing up RAM bloat — in 1969, 64KB of memory sent Apollo to the moon. In 2026, I open a single webpage and 500MB of memory usage is nothing unusual. Every generation of hardware engineers pushes memory capacity higher, and every generation of software engineers lavishly fills it to the brim. People have gotten used to this cycle, even come to see it as the normal cost of progress. But LLM inference is different. The cost of RAM bloat is your computer running a bit slower, spending a couple hundred bucks on a memory upgrade — users barely notice. The cost of token bloat is real money — GPU cluster electricity bills, user subscription fees, the industry's entire compute budget. And this cost scales exponentially as agent usage grows. If we don't establish the engineering discipline that "tokens should be used efficiently" in the early days of the agent era, the cost of catching up later, once scale kicks in, will be beyond imagination. Fuli Luo notes that Anthropic cutting off third-party harness subscription access is objectively forcing these frameworks to improve their context management. I agree with that assessment, but my gut feeling is that this shouldn't stop at "third-party frameworks need to be more frugal with tokens." It should trigger a more fundamental reflection: what kind of agent-inference co-design do we actually need? Right now, agent frameworks and inference engines are essentially fully decoupled — agent frameworks treat the inference engine as a stateless API, sending the full context with every request. Meanwhile, the inference engine does its best with prefix matching, caching whatever it can. This architecture is simple and general-purpose, but brutally inefficient for long sessions. If agent frameworks could be aware of the inference engine's cache state and proactively construct cache-friendly requests — if inference engines could understand the session semantics of agents and make smarter cache eviction decisions — once that information channel between the two opens up, the potential gains in token efficiency are enormous. Of course, maybe I'm overthinking this. Maybe the market's ultimate answer is: compute gets cheap enough, waste is fine. Just like the RAM story — in the end, everyone chose "memory is big enough, no need to optimize." But I don't think the token economy will follow the same path, at least not in the near term — because the supply elasticity of GPU compute is far lower than that of DRAM. Under compute constraints, token efficiency isn't a "nice to have" optimization — it's the core competitive advantage that determines who survives. Most people love hearing "we made the model bigger," "we stretched the context window to a million tokens," "we stacked HBM to new heights" — these narratives are sexy, shareable, fundable. But I seriously believe that "finding ways to reduce the reckless waste of tokens" is a profoundly underestimated direction. This isn't a defensive optimization. It's an offensive capability — whoever first achieves an order-of-magnitude reduction in token consumption at equivalent quality can serve ten times the users on the same compute budget, or deliver ten times the agent depth to a single user. The agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it most wisely. This line from Fuli Luo resonates deeply with me. But I want to press further: who gets to define "wisely"? The people building models? The people building inference engines? The people building agent frameworks? I think the answer is — all three must come to the table together. And right now, we're nowhere close.
Fuli Luo@_LuoFuli

Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness development right. Putting these two things together, some thoughts: 1. Claude Code's subscription is a beautifully designed system for balanced compute allocation. My guess — it doesn't make money, possibly bleeds it, unless their API margins are 10-20x, which I doubt. I can't rigorously calculate the losses from third-party harnesses plugging in, but I've looked at OpenClaw's context management up close — it's bad. Within a single user query, it fires off rounds of low-value tool calls as separate API requests, each carrying a long context window (often >100K tokens) — wasteful even with cache hits, and in extreme cases driving up cache miss rates for other queries. The actual request count per query ends up several times higher than Claude Code's own framework. Translated to API pricing, the real cost is probably tens of times the subscription price. That's not a gap — that's a crater. 2. Third-party harnesses like OpenClaw/OpenCode can still call Claude via API — they just can't ride on subscriptions anymore. Short term, these agent users will feel the pain, costs jumping easily tens of times. But that pressure is exactly what pushes these harnesses to improve context management, maximize prompt cache hit rates to reuse processed context, cut wasteful token burn. Pain eventually converts to engineering discipline. 3. I'd urge LLM companies not to blindly race to the bottom on pricing before figuring out how to price a coding plan without hemorrhaging money. Selling tokens dirt cheap while leaving the door wide open to third-party harnesses looks nice to users, but it's a trap — the same trap Anthropic just walked out of. The deeper problem: if users burn their attention on low-quality agent harnesses, highly unstable and slow inference services, and models downgraded to cut costs, only to find they still can't get anything done — that's not a healthy cycle for user experience or retention. 4. On MiMo Token Plan — it supports third-party harnesses, billed by token quota, same logic as Claude's newly launched extra usage packages. Because what we're going for is long-term stable delivery of high-quality models and services — not getting you to impulse-pay and then abandon ship. The bigger picture: global compute capacity can't keep up with the token demand agents are creating. The real way forward isn't cheaper tokens — it's co-evolution. "More token-efficient agent harnesses" × "more powerful and efficient models." Anthropic's move, whether they intended it or not, is pushing the entire ecosystem — open source and closed source alike — in that direction. That's probably a good thing. The Agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it wisely.

English
14
38
207
31.9K
Blaze (Balázs Galambosi)
@0xSero @natepoasts More and more people trust more and more data on these “clankers” and they are running autonomously So sure publish your paper “clanker does beep boop” and it’ll be helpful to absolutely nobody Anthr. paper shows clear failure modes (reward hacking, blackmail, etc) w mitigation
English
0
0
0
19
0xSero
0xSero@0xSero·
My goat says it how it is. Anthropic is misleading you. Clankers do not have emotions, Clankers feel nothing. They are a math function, doesn’t mean they’re not absolutely awesome.
Yann LeCun@ylecun

@nxthompson So much BS

English
48
24
606
43.8K
Blaze (Balázs Galambosi)
@patroned1 I don’t get what your point is I had Covid during its peak. I did a Covid test, stayed home & recovered. Now I had a virus I stayed home and recovered People get viruses since childhood & before Covid I never ever tested myself for anything. You get better, or go to doc
English
1
0
0
18
Daniel Cassandra
Daniel Cassandra@patroned1·
@gblazex “Theres something going around, I dont know what it is, I dont care to know what it is even if it could be a vascular virus that gets in the brain…this isnt pub med so I dont have to care or know, get off your high horse”
English
1
0
0
17
Ben Dicken
Ben Dicken@BenjDicken·
40 years ago, the database benchmarking space looked similar to where AI benchmarking is today. Little oversight. Cherry-picked results. Disconnect between benchmarks and real-world performance. For DBs this led to the creation of the Transaction Processing Performance Council (TPC), which organized the creation of popular benchmarks like TPC-C, TPC-H, TPC-E. They're not perfect, but TPC-C is still in use today after over 30 years of existence (including by me!). AI benchmarking could benefit from a similar structure: High-quality benchmarks with clear standards for how to execute, measure, and produce comparable results. Produced from a group-effort across the industry to reduce bias. It's certainly a complex challenge since there's tons to measure. Raw LLM performance (TTFT, TOK/s, etc). End-to-end agent performance. Correctness. Quality of result. Who's working on this?
English
7
2
109
9.6K
Alim
Alim@almmaasoglu·
I'll get hate for this but Anthropic limiting their subscription usage for openclaw is a good thing. It stops from hogging all the resources and rinsing through compute so that actual users aren't subsidising people wasting 200k tokens just to check their calendar
English
70
51
1.5K
54.4K
Max Lugavere
Max Lugavere@maxlugavere·
Eating the same meals on repeat was associated with 40% greater weight loss.
Max Lugavere tweet media
English
141
723
9.7K
1.5M
Blaze (Balázs Galambosi)
@zeddotdev Not losing session data I had sublime text open for years and years and all unsaved files were always there. I used zed for a month and it lost my session completely on a random day
English
0
0
2
73
Simon Høiberg
Simon Høiberg@SimonHoiberg·
Just switched back to Opus 4.6 for a bit to compare. Yeah, Opus just is that much better 😒 GPT-5.4 is great at coding and complex tool use. But day-to-day, Opus is just a much more pleasant experience. I'm probably just gonna pay the premium and wait for prices to go down over time.
Simon Høiberg@SimonHoiberg

I used to use Opus 4.6 but switched to GPT-5.4. Opus did much better in day-to-day communication, it's much more pleasant to talk to. GPT-5.4 does tools call and multistep stuff better, no doubt. I'm now using GPT-5.4 fully, it's more affordable, and the slightly annoying way it replies is tolerable.

English
62
3
183
42K
nathan
nathan@natepoasts·
(this is a genuine question) what vocabulary would you prefer they use instead? i’m asking because i feel like the gap between knowing the math of the algorithms and actually explaining the resulting (surprisingly human) behaviours is just so vast that psychological vocabulary might actually be the best descriptive tool available right now. when researchers find a specific cluster of neurons that activates in the same contexts a human would feel “desperate”, and they can show it causally drives the model to cheat and blackmail, what language is more apt for describing it than the existing psychological vocabulary?
English
2
0
1
76
Blaze (Balázs Galambosi)
@0xSero They do not claim it has emotions. They say emotion vectors are activated based on circumstance e.g repeated failure or impossible requirements activates “desperate” vector And these vectors drive behavior. Affects performance & safety.
English
0
0
2
34
0xSero
0xSero@0xSero·
I have, I understand what they're saying. I don't have a problem with the research. I have a problem with how they continuously frame their research with claims like: 1. We can't guarantee it's not conscious 2. Claude responds to emotions with emotional vector activations 3. I can go on and on, but I won't This is going to cause a lot of people who can't understand the research, like CEOs/managers, etc.. to believe the models are gaining sentience, when they're not. I love a lot of Anthropic's writings, but giving an LLM a blog, claiming things like this are absolutely immoral and harmful to society.
English
3
0
15
495