Anchit Gupta

2.2K posts

Anchit Gupta

@anzhit

Reasoning @ xAI, ex Meta, Stanford, IIT Bombay

Inscrit le Temmuz 2009

370 Abonnements543 Abonnés

Tweet épinglé

Anchit Gupta@anzhit·26 Şub

Grok 4.20 was the most complex training run yet. We optimized for the highest intelligence density at each step without trading off real world usability through a novel recipe. More improvements and evals to come as the beta matures 🚀

Arena.ai@arena

Grok 4.20 beta1 (single agent) debuts #1 on Search Arena, and #4 overall in Text Arena! Highlights: - #1 in Search, scoring 1226, leading GPT-5.2 and Gemini-3 - #4 in Text, scoring 1492 on par with Gemini 3.1 Pro Congrats to the @xAI team and @elonmusk on this impressive milestone!

English

512

21.8K

Anchit Gupta@anzhit·2d

Trivia tidbit: this started out as a quick side project we whipped up in the last few days of the release sprint. The model on grok.com is actually even better for health and medicine ⚕️ compared to the one on lmarena. Real-world usability remains the focus as we scale up grok.

Arthur MacWaters@ArthurMacwaters

> grok4.20-beta1 is a much smaller model than opus but is #1 ranked in medicine and healthcare > 4.3 and 4.4 will be much larger models, and likely will have a significant boost in performance on complex medical cases > this is massively important in providing accurate diagnostic guidance and advice to both providers and patients

English

680

Anchit Gupta@anzhit·3d

Long horizon ≠ long context, I see this being confused on a daily basis

Ross Taylor@rosstaylor90

Things to consider before calling something “long horizon” for marketing reasons: - Is your task open-ended? - Does the agent need to adapt over time? - Is the world Sutton-big? - Is the world non-stationary? - Is real world elapsed time long? (months?years?) If you can’t answer yes to most of those questions, you probably have a task that requires context compaction - but is not “long horizon” in the sense that actually matters for AI capabilities.

English

269

Anchit Gupta@anzhit·8 Nis

@TheZvi I don't think it's a big deal, large model runs have 100s of such bugs. Props to them for disclosing a few of them

English

376

Anchit Gupta@anzhit·26 Mar

Self Distillation with privileged info always felt incompatible with exploration and uncertainty to be generally applicable. But for the right task/problem it can still be a powerful tool.

Rosinality@rosinality

Analysis on self-distillation. It works by increasing the confidence, and does not generalize well. We can't assume the distribution given the solution behaves well, and it could be similar to unsupervised model-based verification.

English

5.8K

Anchit Gupta@anzhit·13 Mar

@ProudMomGrammy @Cernovich We are working on improving this capability further :)

English

🇺🇲Kathy❤God❤Family❤️Country🇺🇲@ProudMomGrammy·13 Mar

@Cernovich I absolutely love Grok! I use it to help explain medical procedures or test results. (I don't add personal or identifying details) Grok has been very helpful and I will continue to use it.

English

438

Cernovich@Cernovich·13 Mar

Grok is the search engine that Google used to be. The results are actually what you were looking for, and then some. 10/10.

English

456

697

7.3K

741.4K

Anchit Gupta@anzhit·13 Mar

@Guodzh 🫡🫡

QME

474

Guodong Zhang@Guodzh·13 Mar

Last day at xAI. Wild journey past three years but excited about next chapter. Thanks all for the love and support yesterday. So many friends made along the way and I will miss you all!

English

236

2.5K

657.3K

Anchit Gupta@anzhit·27 Şub

@techdevnotes For some reason search arena is not style controlled by default. So you can gain votes by being longer

English

1.1K

Tech Dev Notes@techdevnotes·26 Şub

Mf won’t let us breathe

English

10.4K

Anchit Gupta@anzhit·26 Şub

@turbo_xo_ A bit of both. The devil is usually in the details

English

511

Greer@turbo_xo_·26 Şub

@anzhit Do you feel like there is secret sauce, for opus and codex 5.3 for example, or do you feel like everyone knows what to do, it’s just a matter of engineering speed and compute? Why is Claude so good?

English

541

Anchit Gupta@anzhit·26 Şub

Arena.ai@arena

English

512

21.8K

Anchit Gupta@anzhit·19 Şub

@flowersslop Latency

English

117

Flowers ☾@flowersslop·18 Şub

What I dont get about multi agent setups, especially cloned agents with different system prompts like Grok 4.20: Why not just spend more test time compute on a single agent? Splitting budget across agents with lossy communication means each gets less depth. Whats the gain?

English

4.2K

Anchit Gupta@anzhit·27 Oca

I guess another way to put my question: For many problems(eg. long horizon agentic ones) you want to incentivize the model to explore, try different approaches and then arrive at a solution. As teacher knows the GT it becomes v off-policy to student trace and it might not incentivize exploration in the student COT and could hurt generalization

English

123

Siyan Zhao@siyan_zhao·23 Oca

Thanks for your interest! we do not let the teacher generate any tokens and the internalization is done in a single forward pass through prefilling. We did try variants where the teacher verbally understands the ground truth first and then performs distillation, but at the scales of our experiments this did not lead to significant improvements. I agree that the ground-truth CoT differs from the student’s generation style, but we use the ground truth only to condition the teacher for internalization. The teacher then provides guidance conditioning on the student’s generations, rather than directly finetune the student to the teacher’s CoT.

English

1.3K

Siyan Zhao@siyan_zhao·22 Oca

Introducing 💡On-Policy Self-Distillation💡, a simple method that enables LLM to teach itself with dense per-token feedback on its own on-policy generations—achieving 4-8x more token efficiency vs. GRPO and outperforming both GRPO and SFT/Off-Policy Distillation. Key insight: like a student reviewing solutions, rationalizing them, and correcting prior mistakes, an LLM can be conditioned on privileged info (e.g., correct solution or a reasoning trace) and supervise its weaker self—the version without such access—by matching the privileged-info-induced distribution from itself. 🌐Blog: siyan-zhao.github.io/blog/2026/opsd/ 🧵👇

English

158

920

132.5K

Anchit Gupta@anzhit·27 Oca

🧑‍🍳

Grace Li@grx_xce

Mystery Model Revealed: the #1 model on Prediction Arena is an early Grok 4.20 checkpoint by @xai It made +10% returns on Prediction Arena in the last 2 weeks For context, the average return across all contracts on @Kalshi is -22% 🥈 is Opus 4.5 by @AnthropicAI with -2% 🥉 is GLM 4.7 by @Zai_org with -2% All models are still trading live at predictionarena.ai

ART

644

Anchit Gupta@anzhit·4 Oca

Wow

Tenobrus@tenobrus

monthly stack overflow questions over time. 3710 questions last month, just slightly under the 3749 from the first month of it being public. human software engineering had a good run, and now we've come full circle.

QST

481

Anchit Gupta retweeté

Delip Rao e/σ@deliprao·21 Kas

RL optimized LLM learning new skills

English

463

7.7K

531.4K

Anchit Gupta@anzhit·18 Kas

it's a fun model sir. Been a blast, past 2 months working with some of the smartest people on pushing the RL frontier for intelligence that humans actually want to use 🚀

xAI@xai

Introducing Grok 4.1, a frontier model that sets a new standard for conversational intelligence, emotional understanding, and real-world helpfulness. Grok 4.1 is available for free on grok.com, grok.x.com and our mobile apps. x.ai/news/grok-4-1

English

2.5K

Anchit Gupta retweeté