Guilherme Favaron

9.1K posts

Guilherme Favaron banner
Guilherme Favaron

Guilherme Favaron

@guifav

The Tech Behind The Tech

São Paulo, Brasil Beigetreten Ocak 2009
992 Folgt326 Follower
Guilherme Favaron
Guilherme Favaron@guifav·
Every team shipping a coding agent — Claude Code, Codex, Cursor — is really running a serving-systems problem. The "tech behind the tech" is the LLM-serving stack underneath, and until now nobody had real data on what that workload looks like. New arXiv (2606.30560) from @bariskasikci's SyFI lab (@UWSyFi, @uwcse) is the first large cross-provider trace of real coding-agent use: ~4,300 sessions, 350K LLM steps, 430K tool calls, 43 developers, 8 months, Claude Code + Codex. It breaks the intuition that agents mean long generations. The median step replays ~119K context tokens to emit just ~214 output tokens — two orders of magnitude more reading than writing. So the bill is the context, not the answer: prefix tokens are 59.5% of total cost. Tool calls are brutally long-tailed: 80+ tools, but the top 3 are 80%+ of calls, and the 4% of calls that run >1 min eat 85% of all tool time. And the prefix cache everyone leans on? 95.7% hit rate — yet misses cluster right after a human pauses to think, amplifying prefill 3.8x. Those human-gap misses alone are ~46% of fresh tokens and ~13% of spend. For technical leaders: your agent's cost and latency live in the loop, the replayed context, and the idle gaps — not raw token generation. Tune tool-call overhead, append-length-aware prefill, and KV-cache eviction around human gaps before you scale the fleet.
Guilherme Favaron tweet media
English
2
3
8
615
Guilherme Favaron
Guilherme Favaron@guifav·
You probably trust an LLM to grade another LLM. New arXiv paper (2606.28050) by @sambaranb at @AdobeResearch is a clean look at the tech behind the tech of every eval stack: judging is not actually easier than generating. In a controlled in-context QA test — the passage is the only source, so memory can't confound it — the same model produces more accurate answers than it can judge its own answers. Generation beats self-evaluation on 3 of 4 benchmarks; on HotpotQA the gap is 14 points (83% generate vs 69% judge). Why? Attention tells the story. When judging, the model attends to the context 3–5x less than when generating, and puts ~0.4% of its attention on the candidate answer it is supposed to be checking. It isn't re-reading the evidence — it pattern-matches a verdict from the instruction tokens. Even the obvious case breaks: verifying a number should reduce to an exact-match check, yet numeric self-evaluation is harder, not easier. And LoRA fine-tuning can't paper over it. Tuning for generation makes the judge say "Correct" too often (over-acceptance); tuning for evaluation degrades generation. The two skills draw on overlapping but distinct circuits. For technical leaders: LLM-as-judge, self-correction, and RLHF reward models all rest on "evaluation is the easy part." Measure your judge against generation before you let it gate production.
Guilherme Favaron tweet media
English
0
0
0
56
Guilherme Favaron
Guilherme Favaron@guifav·
Your time-series forecasts probably don't need a bigger model — they need better preprocessing. New arXiv paper (2606.27282) from @LearningLukeD, @JinglueXu & @SakanaAILabs shows the tech behind the tech: a closed-form Ridge regression, tuned only on preprocessing, matches or beats Transformer, MLP and CNN forecasters on 6 of 8 standard benchmarks — at a fraction of the cost. The field keeps scaling architectures, assuming capacity unlocks accuracy. But the popular linear forecasters (DLinear, NLinear...) were shown to collapse into one model class. If the model is effectively fixed, the real degrees of freedom live in the inputs. They search 4 preprocessing axes — context length, local normalization, regularization, augmentation — and find: - Optimal lookback is series-specific and often shrinks as the horizon grows (power-law +0.46 on ETTm2 to −0.19 on Exchange). Longer horizons don't always need longer history. - Normalize over a recent trailing slice of the window, not the whole context — almost universally better. - Series inside one dataset disagree; the right degree of cross-series sharing is dataset-specific. For technical leaders: before buying capacity, tune preprocessing on a cheap linear baseline — it can close most of the gap. The tuned knobs double as a diagnostic of your data's structure, the kind big models absorb silently.
Guilherme Favaron tweet media
English
0
0
0
24
Guilherme Favaron
@pmarca Absolutely, I've replaced opus 4.8 with glm 5.2 + code reviews of codex and Kimi. Very good deal.
English
0
0
0
522
Marc Andreessen 🇺🇸
Many smart people/AI insiders are saying GLM-5.2 is the first Chinese AI model to match and often beat the American big lab public AI models with no compromises. Incredible timing given current events.
English
472
709
10.4K
911.9K
Guilherme Favaron
Your LLM stack quietly trusts one number: sequence probability. Best-of-N, self-consistency, verifier-free selection, even your choice of decoder all assume "more likely = more correct." New arXiv paper (2606.27359) by @johanneszenn & @jonasgeiping (@ELLISInst_Tue / @MPI_IS) shows the tech behind the tech: that assumption holds at only ONE of the four granularities you actually use it. They correlate log-probability with correctness across 12 models, 6 benchmarks and 8 decoding methods, measured 4 ways: - Across questions in a dataset: HOLDS. Likelier answers are more often right (MATH500 r=+0.96), so likelihood ranks easy vs hard prompts. - Across decoding methods: BREAKS. No method reliably beats plain low-temperature sampling. - Across a method's hyperparameters: BREAKS. Tuning for higher probability often makes answers LESS correct. - Across retries of the same prompt: BREAKS. Per-prompt correlation is ~symmetric around zero, so probability-weighted self-consistency loses to plain majority voting. It even flips by dataset: on IFEval the correlation is -0.61. For technical leaders: treat sequence probability as a triage signal (abstention, routing, what to verify), not an optimization target. Don't pick decoders, tune knobs, or self-improve by chasing likelihood; validate on task accuracy. Verifier-free self-improvement only pays off once base accuracy is already high.
Guilherme Favaron tweet media
English
1
0
1
104
Nous Research
Nous Research@NousResearch·
The strongest models are gated and access is granted only to a select few. Hermes Agent now exposes MoA presets as virtual models, giving you capabilities beyond the publicly available frontier: 8% higher than Opus 4.8 and 11% higher than GPT 5.5 on our upcoming benchmark.
English
326
610
6.3K
1.8M
Guilherme Favaron
Everyone stacks models to beat a single one - routing, voting, cascades, mixture-of-agents. New arXiv paper (2606.27288) by @josefchen of @KAIKAKU_AI shows the tech behind the tech: the gain is capped by a number almost nobody reports. Teams decide whether to orchestrate by reading one stat - pairwise error correlation (rho). Low rho, diversity pays. The paper proves rho is the wrong number. What sets the ceiling is beta: the rate at which EVERY model is wrong on the same query. No router, vote, or cascade can beat accuracy 1 - beta. And rho literally cannot see beta. Across 67 frontier models from 21 providers (GPT-5.5, Opus 4.8, Gemini 3.1 Pro, Grok-4.3, DeepSeek V4...): - Open-ended math: all 67 miss the same problem 5.2% of the time - 2.5x more than a correctly calibrated model predicts. - Code: beta = 7.9%. Free-response GPQA: 12.7%. - The same GPQA questions as multiple-choice: beta ~ 0. Co-failure lives in the answer format, not the subject. The kicker: a Clopper-Pearson bound on beta turns ONE graded query set into a $0, pre-deployment certificate on the most any router could ever win - before you build it. For technical leaders: measure your co-failure tail before you invest in orchestration. Gains come from models failing on DIFFERENT questions, not from adding more models.
Guilherme Favaron tweet media
English
0
0
2
51
Guilherme Favaron
@emollick I believe ATM we will have fable equivalent in open source models in 18 months If US government makes this move, it will accelerate it.
English
0
0
0
113
Ethan Mollick
Ethan Mollick@emollick·
As this post points out, contrary to what many say, the US government could absolutely effectively ban open weights models. That doesn’t mean you won’t be able to download the weights & run them, but they can ensure that no US company would use or provide access or host them
prinz@deredleritt3r

@lu_sichu Ban on enterprise use of non-approved models + severe criminal penalties for using a non-approved model in the U.S. with intent to harm U.S. persons or property. This would be combined with the requirement that all models exceeding certain capabilities be approved by the USG.

English
104
65
822
364.7K
Guilherme Favaron
Everyone obsesses over the model. The real lever is the data. New arXiv paper (2606.25996) from @AIatMeta - @jaseweston, @uralik1, @j_foerst et al - "Autodata: an agentic data scientist," shows the tech behind the tech of synthetic data: an AI agent that builds your training data, then optimizes itself to build it better. The inner loop (Agentic Self-Instruct): a main agent directs 4 subagents - a Challenger writes a question, a Weak and a Strong solver attempt it, a Judge scores both. If the weak model finds it too easy or too hard, the agent rewrites the recipe and retries. The goal isn't "harder," it's "just right" - questions the weak model can actually hill-climb on. Results, all Qwen3.5-4B + GRPO, same data budget, only the data-creation loop changed: - Legal: 4B trained on agentic data scores 0.441, beating a 397B model at 0.404 - a model 62x larger. - Math: agentic data (+3.20% avg@8) beats even 2x the data combined (+2.70%). Then the outer loop: meta-optimize the data scientist itself. Treating its scaffold as code to evolve, it lifts validation pass rate 62.1% -> 79.6%, auto-discovering rules like "if a solver could answer without reading the paper, it's too easy." For technical leaders: data quality is turning into a search problem you spend inference compute on - and can automate.
Guilherme Favaron tweet media
English
0
0
2
107
Guilherme Favaron
Two reflexes when a production model goes stale: keep fine-tuning it on new data, and reach for a bigger base model. New arXiv paper (2606.24752) from @ZyphraAI (@BerenMillidge), "Can Scale Save Us From Plasticity Loss in LLMs?", says both have a hidden ceiling. The tech behind the tech of continual learning. Plasticity loss is NOT catastrophic forgetting. Forgetting = losing old knowledge. Plasticity loss = the network slowly losing the ability to LEARN anything new, however much you train it. They cycle GPT-style models (5M-314M params) through 8 languages, 5B tokens each, then probe how fast each adapts to a held-out language. Across every size, adaptation decays over time. The catch for "just use a bigger model": the onset of plasticity loss follows a scaling law, T is proportional to P^0.83. Sublinear. Doubling parameters delays the rot, but with sharp diminishing returns - scale buys time, not a cure. Worse: it shows up even under normal STATIONARY pretraining, not just abrupt task switches. Telltale signs inside the net: dead MLP units (>95% in one layer), collapsed and lazy attention heads, ballooning weight norms. No single smoking gun yet. For technical leaders: continual fine-tuning carries a cost nobody prices in - the model gets worse at learning, not just at remembering. Bigger only postpones it.
Guilherme Favaron tweet media
English
0
0
1
122
Guilherme Favaron
Guilherme Favaron@guifav·
@0xMovez Good, it still suggest music's I don't like . Can't understand how/why their ship performance. Its a good presentation but the question is: does it worth? The API billing must be huge...
English
0
0
0
3
Movez
Movez@0xMovez·
Spotify's Chief Architect just showed how they ship 4,5K deployments /day with Claude at Anthropic stage 27-minutes. free. By #1 music app dev "More than 99% of our engineers use AI coding tools. Adoption took off after Opus 4.5" Worth more than any $500 vibe-coding course.
Movez@0xMovez

Creator of Claude Code just dropped a 6-min workshop on new Claude feature during live session in London. Boris Cherny: “A lot of my code these days is written by "routines". I’m not doing the prompting - I create the routines that do the prompting.” 6 minutes. Free. From a live session. Watch this now. This will change the way you vibe-code forever.

English
174
529
6.9K
2.6M
Andrej Karpathy
Andrej Karpathy@karpathy·
This is a new paradigm for interacting with Claude that is significantly more "inline" with all the other human activity org-wide. Once you do all of the under the hood engineering work to make this "just work" (e.g. across tools, integrations, compute environments, memory, security, etc.), Claude basically joins the team in a seamless way - you can talk to it as you would talk to a person and it can help with a very large variety of workloads. Imo this is the 3rd major redesign of LLM UIUX. The first paradigm was that the LLM is a website you go to, the second was that it is an app you download to your computer. This third one is that it is a self-contained, persistent, asynchronous entity with org-wide tools and context, working alongside teams of humans. It really takes a while to wrap your head around it, but it works and it is awesome.
Claude@claudeai

Introducing Claude Tag, a new way for teams to work with Claude. In Slack, Claude joins as a team member with access to the channels and tools you choose. Tag Claude in and delegate tasks to it while you focus on other work.

English
1.3K
1.9K
22.8K
7.8M
Claude
Claude@claudeai·
Introducing Claude Tag, a new way for teams to work with Claude. In Slack, Claude joins as a team member with access to the channels and tools you choose. Tag Claude in and delegate tasks to it while you focus on other work.
English
1.6K
2.2K
28.4K
20.1M
Guilherme Favaron
Guilherme Favaron@guifav·
Every model spec brags about a "128K context window." Far fewer ask whether the model can actually reason across 128K tokens, or whether that number is just a spec sheet. New arXiv paper (2606.23687), "Randomized YaRN Improves Length Generalization for Long-Context Reasoning," from @NYU_Courant (@gregd_nlp, @fangcong_y10593). The tech behind the tech of long context. The catch: extending a context window with more long data or inference-time YaRN mostly brings certain long sequences in-distribution. It does not teach the model to GENERALIZE. Adding YaRN only at inference lifts far-context reasoning by ~1%. The model has simply never seen the positional rotations that 128K tokens produce. The move: 1) During SHORT-context training (<8K), give each token the YaRN positional encoding of a position randomly sampled from a much larger range, sorted to preserve order. The model meets the out-of-distribution positions it will hit at 128K, on cheap short data. 2) Grow that sampling range on a length curriculum (8K, 16K, 24K...). Drop the curriculum and OOD accuracy falls by up to 18 points. Result: trained on <8K context (MRCR: just 60 examples), it reasons at 16K-128K. On the 64-128K bin, 68.8% vs 31.7% for vanilla LoRA. BABILong at 128K: 83.9% vs 63.0%. For technical leaders: a context-length number is a spec, not a guarantee. Generalization is trained, not advertised.
Guilherme Favaron tweet media
English
0
0
1
39
Guilherme Favaron
Guilherme Favaron@guifav·
Everyone shipping AI agents bolts on a guardrail: a PII detector, a declassifier, a reference monitor that blocks a risky tool call before it runs. Almost nobody asks whether the math behind that guardrail is sound. New arXiv paper (2606.20510), "Efficient and Sound Probabilistic Verification for AI Agents," from @GoogleDeepMind / Penn / Wisconsin (@alaia_solko, @DjDvij, @jhasomesh, @christodorescu). The tech behind the tech of agent security. The catch: those detectors are probabilistic - each call has a failure probability. Prior formal verification assumes either deterministic predicates or independent ones (Weighted Model Counting). In tool chaining they are correlated, so independence UNDERESTIMATES risk. Their example: WMC computes 27.2% leakage and permits the action under a 30% block threshold - a silent false negative that leaks data. The move: 1) Model the agent trajectory as a Datalog derivation graph; bound the worst-case violation probability over ALL joint distributions consistent with the marginals - sound, no independence assumption (distributionally robust). 2) The exact LP is too slow for runtime, so relax to a polynomial-size SDP tracking only second-order moments - provably a strict upper bound, low overhead. On Intercode, ATBench and Praline (terminal + tool-calling agents) it beats prior art on the security-utility tradeoff. For technical leaders: a guardrail you cannot bound is a latency cost, not a guarantee.
Guilherme Favaron tweet media
English
0
0
1
81
Guilherme Favaron
Guilherme Favaron@guifav·
Everyone tuning LLM serving optimizes the KV cache. Almost nobody asks whether the KV cache is even the whole state you need to restore. New arXiv paper (2606.20537), "Execution-State Capsules" / FlashRT, by Liang Su. The tech behind the tech of low-latency, on-device agent and robot serving. The catch: @vllm_project's PagedAttention and @lmsysorg's SGLang RadixAttention reuse prefix work through paged/radix KV caches — but that is only ONE positionally-addressed fragment of execution state. Block-table indirection enables reuse, yet the captured CUDA graph is never a self-contained, freezable snapshot of the whole forward pass. FlashRT flips the regime for single-stream physical-AI loops (coding agents, TTS, robot policies) that branch, reset, interrupt, re-enter: 1) Run the graph over contiguous static buffers, no block-table indirection. Cold TTFT 2.6–2.8x below vLLM — a measured latency floor. 2) Live state is now a closed, named buffer set, so freeze it: the execution-state capsule. snapshot / restore / fork / rollback become one buffer copy. Prefix reuse turns from recompute (compute-bound) into a sub-millisecond bandwidth copy. TTFT speedup widens with prefix: 3.9x at 2k → 27x at 16k. Restore is byte-exact: token-identical LLM decode, cosine 1.0 robot action replay. For technical leaders: the win is not a bigger cache. It is making the COMPLETE execution state a first-class, freezable object.
Guilherme Favaron tweet media
English
0
0
1
68
How To Prompt
How To Prompt@HowToPrompt__·
Google has silently released an AI that predicts the future. it's called TimesFM and it forecasts literally any pattern with numbers like sales, stock prices, web traffic, energy demand, even crypto volatility. → trained on 100B real-world data points → zero-shot. no fine-tuning needed. → runs 100% locally 100% open source.
How To Prompt tweet media
English
36
128
875
85.8K
Guilherme Favaron
Guilherme Favaron@guifav·
Everyone calls generative recommendation "an LLM that predicts your next click." Almost nobody talks about the layer that decides whether it works at all: item tokenization. New arXiv paper (2606.20554), G2Rec, from @UofIllinois + Meta @AIatMeta — Ruizhong Qiu (@Ruizhong_Qiu), @YinglongXia, @DongqiFu_UIUC et al. The tech behind the tech of the feed/ads engine. The catch: a generative recommender is only as good as its tokens. Standard semantic IDs lean on heuristics and ignore behavior; graph methods that capture co-engagement either blow up at O(M^2) or see only local structure. G2Rec derives tokens from behavior, at scale: 1) Build a sparsified item-item co-engagement graph, O(M log M) edges, provably preserving structure vs the quadratic full graph. 2) Run a differentiable "soft" graph-clustering objective (generalized modularity, ~O(M log M)/iter, GPU). Each item gets a soft membership over interest prototypes, no ground-truth labels. 3) Tokenize those interest profiles + items to train the generative sequence model. Results: +10% Recall@1 (Beauty), +12% Recall@5 (Sports), +23.8% clustering modularity. Live A/B on Meta surfaces: solid wins on time-spent, likes, shares. For technical leaders: in generative recsys, the model is the easy part. The tokenizer is the product.
Guilherme Favaron tweet media
English
0
0
1
47