Akshit

1.1K posts

Akshit

@akshitwt

ml @cambridge_uni. previously @precogatiiith, @iiit_hyderabad. futurebound.

23 // del | cam Katılım Haziran 2023

858 Takip Edilen3.3K Takipçiler

Sabitlenmiş Tweet

Akshit@akshitwt·18 Eyl

a skill that i am really proud of is my ability to iterate on experiments fast, and write "good" code. writing code is an important skill to have as a researcher, and in this post i discuss some tips to hopefully help you get better at it!

English

777

65.2K

Akshit retweetledi

David Stutz@davidstutz92·1d

Great take on the importance of evals as upstream of everything, including training oftentimes. I'd go one step further and say that proper evaluation is becoming and exciting standalone discipline within AI: davidstutz.de/LpZDE

Lun Wang@lunwang1996

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. wanglun1996.github.io/blog/your-eval…

English

6.4K

Akshit@akshitwt·22h

adding a very popular gem to my tech collection thanks to @Prolific! my dad will enjoy this a lot now that he spends his time toying with claude and codex :p Another perk of attending conferences like ICLR :)

English

389

Akshit@akshitwt·1d

surprised that in all the math/ML courses ive taken this has never come up joschu.net/blog/kl-approx…

English

4.4K

Akshit@akshitwt·1d

thank you @natolambert 🙇

English

294

Akshit@akshitwt·1d

unbelievably good series on how traditional RL evolved into everything we see today!!! my rule of thumb for good educational content is combining technical details with the context/history surrounding them; it gives a much richer understanding of why things are the way they are

English

190

5.7K

Akshit@akshitwt·1d

@learning_pt

QME

315

LearningPoint (24x7 Chatterboxes)@learning_pt·1d

Almost everyone i know from IIITH was on the software Engg/ Engg manager/ lead track all the way past 35 None of the diversity and elite roles of CS/EC/mnc grads from IIT/BITs.. quant finance, VCs, product managers, business leadership roles, research post PhD, M7 MBAs - MBB etc

English

150

13.7K

Akshit retweetledi

Ryan Bahlous-Boldi@RyanBoldi·4d

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

English

120

846

203K

Akshit@akshitwt·5d

@ShashwatGoel7 is this only applicable for pretraining? or further training also

English

259

Shashwat Goel@ShashwatGoel7·5d

Always thought pretraining data filters were hacky and ugly, and just felt wrong. Glad there's evidence they stop mattering at scale (where usually principles stuff wins)

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

English

9.8K

Akshit retweetledi

Maksym Andriushchenko@maksym_andr·6d

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

English

349

40K

Akshit retweetledi

METR@METR_Evals·19 May

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.

English

190

858

289.2K

Akshit@akshitwt·19 May

holy shit

Andrej Karpathy@karpathy

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

English

295

Akshit@akshitwt·18 May

@maksym_andr 🥲

QME

Maksym Andriushchenko@maksym_andr·18 May

@akshitwt it's actually interesting that Anthropic, apparently, just doesn't care about this :-) like if it's not something that advances automated AI research (i.e., things like speech recognition, voice mode, image gen, etc.), it doesn't really matter.

English

463

Akshit@akshitwt·18 May

just had to try the claude voice mode (because i ran out of wispr flow usage) and man... that is by far the worst product i've used by anthropic. could not be worse if they tried. would not recommend

English

1.2K

Akshit retweetledi

Shashwat Goel@ShashwatGoel7·15 May

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

GIF

English

528

110.8K

Akshit retweetledi

Joykirat@joykiratsingh·13 May

🚨Excited to announce Agent-BRACE! LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL. Key takeaways: 1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history. 2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length. 3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode. 👇🧵

English

15.6K

Akshit@akshitwt·13 May

Jonas’s group doing incredibly cool and timely work as always☝️makes you think how stunted we’ve become because we’re accustomed to the status quo

Jonas Geiping@jonasgeiping

We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn. This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information. In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can … 🔵Be created by instruction-tuning for the stream format 🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in) 🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency 🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security 🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized. Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.

English

1.1K

Akshit@akshitwt·13 May

sidenote this is one of the coolest landing pages ive seen. nice to see that the taste of the company is visible not only through their vision but their branding too

Recursive@Recursive_SI

x.com/i/article/2054…

English

400

Akshit@akshitwt·13 May

@louvishh @Recursive_SI woo congrats!!!

English

212

lovish@louvishh·13 May

wrapped up my phd at meta/ucl early and joined an amazing set of folks @recursive_si we're building self-improving superintelligence using the science of scaling and open-endedness. come build it with us!

Tim Rocktäschel@_rockt

Excited to co-found Recursive (@recursive_si) with an exceptional team in London and SF to create AI that experiments on how to safely improve itself, turning compute into knowledge that accumulates in an open-ended process of endless, automated scientific discoveries.

English

181

14.4K

Akshit@akshitwt·13 May

@srijatwt yes its a largely uncontrollable mess lol

English

197

srija@srijatwt·13 May

largely agree with this. it is incredibly hard right now to filter out the right signals from sooooo much noise (neurips this time has around 40k submissions?) also i think it is a sort of recursive problem, because if lower quality papers get in, it also means that those authors will soon be the ones reviewing the next batches of papers, which means that the review process will also end up being more and more noisy/low signal

English

325

Akshit@akshitwt·13 May

as sad as i am saying this, conference acceptance just do not equal good paper anymore. i think most high alpha people that i’ve talked to feel the same way. there’s a lot to talk about, i have a blog draft about this. wondering if people are interested? some initial thoughts⬇️

English

4.2K

Akshit@akshitwt·13 May

does not* stupid typo, at least its clear i don’t use ai to write my tweets 😋☝️

English

502

Keşfet

@Prolific @natolambert @learning_pt @ShashwatGoel7 @maksym_andr @elonmusk @BarackObama @taylorswift13