Naman Jain

555 posts

Naman Jain

@StringChaos

Research @cursor_ai | CursorBench, LiveCodeBench, DeepSWE, R2E-Gym, GSO, LMArena Coding | Past: @UCBerkeley @MetaAI @AWS @MSFTResearch @iitbombay

San Francisco, CA Entrou em Mart 2018

1.4K Seguindo2.8K Seguidores

Tweet fixado

Naman Jain@StringChaos·13 Mar

New post: how we do evals at @cursor_ai. Takeaways: 1. Online metrics from real Cursor requests provide construct validity 2. CursorBench: a dynamic offline suite distilled from online learnings 3. Multi-axes evals -- correctness, efficiency, agent interaction behavior

Cursor@cursor_ai

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

English

142

37.7K

Naman Jain@StringChaos·1d

Excited to share Composer-2 with everyone. It has come a long way since Composer-1, still lots more to go! Hope you like it!

Cursor@cursor_ai

Composer 2 is now available in Cursor.

English

3.1K

Naman Jain retweetou

Cursor@cursor_ai·4d

We trained Composer to self-summarize through RL instead of a prompt. This reduces the error from compaction by 50% and allows Composer to succeed on challenging coding tasks requiring hundreds of actions.

English

101

1.6K

216.5K

Naman Jain@StringChaos·13 Mar

Check out full post at: cursor.com/blog/cursorben…

English

867

Naman Jain@StringChaos·13 Mar

Lots more details in the post: 1. Pareto frontier across different metrics 2. How CursorBench has shifted as agent capabilities changed 3. CursorBench vs public evals: what’s missing and future work directions 4. CursorBench vs online: how online metrics shape offline evals

English

1.1K

Naman Jain@StringChaos·13 Mar

Cursor@cursor_ai

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

English

142

37.7K

Naman Jain retweetou

Manish Shetty@slimshetty_·11 Mar

GSO Update. gpt-5.4 (xhigh) scores 31.4% with reasoning_effort=high, gpt-5.4 slightly lower than gpt-5.2. a quick thought on why below:

English

6.5K

Naman Jain retweetou

Manish Shetty@slimshetty_·18 Şub

x.com/i/article/2023…

ZXX

23.4K

Naman Jain retweetou

Cursor@cursor_ai·12 Şub

Long-running agents are now available at cursor.com/agents for Ultra, Teams, and Enterprise plans. With our new harness, agents can complete much larger tasks. cursor.com/blog/long-runn…

English

961

343.4K

Naman Jain retweetou

Cursor@cursor_ai·10 Şub

Composer 1.5 is now available. We’ve found it to strike a strong balance between intelligence and speed.

English

154

184

1.9K

659.2K

Naman Jain retweetou

Michael Truell@mntruell·15 Oca

We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week. It's 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM. It *kind of* works! It still has issues and is of course very far from Webkit/Chromium parity, but we were astonished that simple websites render quickly and largely correctly.

Cursor@cursor_ai

GPT-5.2 Codex is now available in Cursor! We believe it's the frontier model for long-running tasks.

English

687

921

9.6K

6.4M

Naman Jain retweetou

Michael Truell@mntruell·7 Oca

We rebuilt how our agent uses context. Instead of stuffing everything into a prompt, Cursor dynamically discovers context via files, tools, and history, cutting token usage by 46.9% and freeing up more space for the agent to work.

Cursor@cursor_ai

Cursor's agent now uses dynamic context for all models. It's more intelligent about how context is filled while maintaining the same quality. This reduces total tokens by 46.9% when using multiple MCP servers.

English

108

2.5K

255.7K

Naman Jain retweetou

Jediah Katz@jediahkatz·4 Ara

We heard you loud and clear that it was getting confusing to pick between so many models, so we completely revamped the model picker in Cursor.

Cursor@cursor_ai

The new Codex model is available in Cursor! It's free to use until December 11th. We worked with OpenAI to optimize Cursor's agent harness for the model. cursor.com/blog/codex-mod…

English

1.6K

207.9K

Naman Jain@StringChaos·26 Kas

@SaberaTalukder Congrats!

English

135

Sabera Talukder, Ph.D.@SaberaTalukder·25 Kas

DM me!

Yisong Yue@yisongyue

My student @SaberaTalukder and I are creating a new startup that deeply rethinks how we architect and engage with multimodal models. 🚀 We are chatting with investors at #NeurIPS2025, and if you want to get on our radar, DM Sabera.

English

145

44.2K

Naman Jain retweetou

Deedy@deedydas·11 Kas

The reviews are in. Cursor's new Composer-1 model is really good at coding, especially for large codebases! — ~4x faster — good at using it's own search tool to find the right files — much better than the base open-source model it's RL'd on and it's free to use right now.

English

446

45.8K

Naman Jain retweetou

Sasha Rush@srush_nlp·9 Kas

Talk at Ray Summit on "Building Cursor Composer." Overview of the work from our research team. youtube.com/watch?v=md8D8e…

YouTube

English

396

185.7K

Naman Jain retweetou

Amanda Bertsch@abertsch72·7 Kas

Can LLMs accurately aggregate information over long, information-dense texts? Not yet… We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

English

355

80.2K

Naman Jain retweetou

Cursor@cursor_ai·5 Kas

Semantic search improves our agent's accuracy across all frontier models, especially in large codebases where grep alone falls short. Learn more about our results and how we trained an embedding model for retrieving code.

English

107

1.5K

878.8K

Naman Jain@StringChaos·5 Kas

@jyangballin As a sucker for good analysis, the paper is great!

English

650

John Yang@jyangballin·5 Kas

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

English

411

95.1K

Naman Jain@StringChaos·4 Kas

We added LLM judge based hack detector to our code optimization evals and found models perform non-idiomatic code changes in upto 30% of the problems 🤯

Manish Shetty@slimshetty_

Tests certify functional behavior; they don’t judge intent. GSO, our code optimization benchmark, now combines tests with a rubric-driven HackDetector to identify models that game the benchmark. We found that up to 30% of a model’s attempts are non-idiomatic reward hacks, which are not caught by correctness tests!

English

1.1K

Naman Jain retweetou

will brown@willccbb·1 Kas

ok composer-1 is pretty nuts, and the code it writes is quite nice. probably my new daily driver for many things not quite as galaxy-brain as codex, but it's SO fast that you can use it sync instead of async, and very quickly iterate on fixes. follows instructions very well

English

541

104.1K

Descobrir

@cursor_ai @SaberaTalukder @jyangballin @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates