Akshit

1.1K posts

Akshit banner
Akshit

Akshit

@akshitwt

ml @cambridge_uni. previously @precogatiiith, @iiit_hyderabad. futurebound.

23 // del | cam Katılım Haziran 2023
858 Takip Edilen3.3K Takipçiler
Sabitlenmiş Tweet
Akshit
Akshit@akshitwt·
a skill that i am really proud of is my ability to iterate on experiments fast, and write "good" code. writing code is an important skill to have as a researcher, and in this post i discuss some tips to hopefully help you get better at it!
Akshit tweet media
English
19
37
777
65.2K
Akshit retweetledi
Akshit
Akshit@akshitwt·
adding a very popular gem to my tech collection thanks to @Prolific! my dad will enjoy this a lot now that he spends his time toying with claude and codex :p Another perk of attending conferences like ICLR :)
Akshit tweet media
English
0
0
12
389
Akshit
Akshit@akshitwt·
unbelievably good series on how traditional RL evolved into everything we see today!!! my rule of thumb for good educational content is combining technical details with the context/history surrounding them; it gives a much richer understanding of why things are the way they are
Akshit tweet media
English
4
14
190
5.7K
LearningPoint (24x7 Chatterboxes)
Almost everyone i know from IIITH was on the software Engg/ Engg manager/ lead track all the way past 35 None of the diversity and elite roles of CS/EC/mnc grads from IIT/BITs.. quant finance, VCs, product managers, business leadership roles, research post PhD, M7 MBAs - MBB etc
English
9
4
150
13.7K
Akshit retweetledi
Ryan Bahlous-Boldi
Ryan Bahlous-Boldi@RyanBoldi·
Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
Ryan Bahlous-Boldi tweet media
English
35
120
846
203K
Akshit
Akshit@akshitwt·
@ShashwatGoel7 is this only applicable for pretraining? or further training also
English
1
0
0
259
Akshit retweetledi
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!
Maksym Andriushchenko tweet mediaMaksym Andriushchenko tweet media
English
12
47
349
40K
Akshit retweetledi
METR
METR@METR_Evals·
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.
METR tweet media
English
29
190
858
289.2K
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
@akshitwt it's actually interesting that Anthropic, apparently, just doesn't care about this :-) like if it's not something that advances automated AI research (i.e., things like speech recognition, voice mode, image gen, etc.), it doesn't really matter.
English
3
0
3
463
Akshit
Akshit@akshitwt·
just had to try the claude voice mode (because i ran out of wispr flow usage) and man... that is by far the worst product i've used by anthropic. could not be worse if they tried. would not recommend
English
3
0
9
1.2K
Akshit retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code
GIF
English
21
65
528
110.8K
Akshit retweetledi
Joykirat
Joykirat@joykiratsingh·
🚨Excited to announce Agent-BRACE! LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL. Key takeaways: 1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history. 2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length. 3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode. 👇🧵
Joykirat tweet media
English
2
39
67
15.6K
Akshit
Akshit@akshitwt·
Jonas’s group doing incredibly cool and timely work as always☝️makes you think how stunted we’ve become because we’re accustomed to the status quo
Jonas Geiping@jonasgeiping

We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn. This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information. In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can … 🔵Be created by instruction-tuning for the stream format 🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in) 🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency 🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security 🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized. Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.

English
0
0
7
1.1K
lovish
lovish@louvishh·
wrapped up my phd at meta/ucl early and joined an amazing set of folks @recursive_si we're building self-improving superintelligence using the science of scaling and open-endedness. come build it with us!
Tim Rocktäschel@_rockt

Excited to co-found Recursive (@recursive_si) with an exceptional team in London and SF to create AI that experiments on how to safely improve itself, turning compute into knowledge that accumulates in an open-ended process of endless, automated scientific discoveries.

English
26
13
181
14.4K
Akshit
Akshit@akshitwt·
@srijatwt yes its a largely uncontrollable mess lol
English
0
0
2
197
srija
srija@srijatwt·
largely agree with this. it is incredibly hard right now to filter out the right signals from sooooo much noise (neurips this time has around 40k submissions?) also i think it is a sort of recursive problem, because if lower quality papers get in, it also means that those authors will soon be the ones reviewing the next batches of papers, which means that the review process will also end up being more and more noisy/low signal
English
1
0
3
325
Akshit
Akshit@akshitwt·
as sad as i am saying this, conference acceptance just do not equal good paper anymore. i think most high alpha people that i’ve talked to feel the same way. there’s a lot to talk about, i have a blog draft about this. wondering if people are interested? some initial thoughts⬇️
English
7
0
34
4.2K
Akshit
Akshit@akshitwt·
does not* stupid typo, at least its clear i don’t use ai to write my tweets 😋☝️
English
0
0
1
502