Rayhan Athaillah

235 posts

Rayhan Athaillah

@THYluhkoyd

AI agents, coding tools, and robotics perception · EECS @ NYCU · learning in public

Hsinchu, Taiwan Katılım Haziran 2026

158 Takip Edilen70 Takipçiler

Sabitlenmiş Tweet

Rayhan Athaillah@THYluhkoyd·2d

Every time DeepSWE comes up, agentic-coding users seem to ask the same thing: Where is Composer 2.5? DeepSWE has GPT-5.5, Opus, Sonnet, Kimi, Gemini, DeepSeek, Qwen, etc. But Composer 2.5, one of the main models people actually use inside Cursor, has no official DeepSWE row yet. So I tried a benchmark-linking estimate. CursorBench 3.1 has Composer 2.5. DeepSWE does not. But both share several other model-effort configurations. So the question becomes: if we use those shared rows as a bridge, where would Composer 2.5 roughly land on DeepSWE? I recomputed DeepSWE Pass@1 from trial-level data, normalized model names and reasoning-effort labels, then matched overlapping model-effort pairs between DeepSWE and CursorBench 3.1. Then I estimated Composer 2.5 using several linking checks: ordinary least squares regression, ridge-style regression, Theil-Sen robust regression, linear equating, equipercentile equating, nearest-neighbor imputation, bootstrap plus leave-one-out sensitivity, and a median-delta baseline. This is not an official DeepSWE result. It is an estimate from overlapping model-effort pairs, meant to ask whether Composer’s CursorBench performance could transfer to DeepSWE-style long-horizon software-engineering tasks.

English

6.9K

Rayhan Athaillah@THYluhkoyd·46m

@Maha_kalpa So the first quadrillionaire person will rise in the 2136. We all won’t see him.

English

DeepakUXUI@Maha_kalpa·20h

First Millionaire - John Astor (~early 1800s)  First Billionaire - John D. Rockefeller (1916)  First Trillionaire - Elon Musk (2026)

English

860

Rayhan Athaillah@THYluhkoyd·49m

@Bhavani_00007 Winrar

GIF

English

Bhavy☄️@Bhavani_00007·18h

name a single app that literally nobody hates

English

1.7K

Rayhan Athaillah@THYluhkoyd·51m

Buat sahabat-sahabat mutualan saya di X ini. Sharing yuk kalian dari datang background apa? Saya sendiri dari teknik elektro, terus lanjut ke robotika, akhirnya sekarang belajar-belajar tentang AI. Lanjut komen di bawah kuy? 👏👇

Indonesia

Rayhan Athaillah@THYluhkoyd·58m

@bryanonchain Yoiii, tapi kan bisa coba kecil-kecilan yang 8B parameter dulu aja hehe

Indonesia

Bryän@bryanonchain·1h

@THYluhkoyd Wuih Local LLM challenge nya masih banyak bang, harus punya spek yang mumpuni

Indonesia

Bryän@bryanonchain·1d

Second brain modern bukan cuma Obsidian. Menurut gue ada 4 layer: 1. Capture 2. Structure 3. AI processing 4. Retrieval Kalau satu layer hilang, knowledge lo tetap susah dipakai. Thread singkat.

Indonesia

102

4.2K

Rayhan Athaillah@THYluhkoyd·1h

@jun_song Where is Opus 4.8?

English

Jun Song@jun_song·11h

And you call this benchmaxxed? Try GLM yourself.

Design Arena@Designarena

BREAKING: GLM-5.2 is now 1st on Design Arena. With an Elo of 1360, GLM-5.2 has jumped ahead of the now unavailable Claude Fable 5. And it's open weights. This is an improvement of 4 positions and 27 Elo points to achieve one of the highest Elo scores in our code categories since Design Arena started. Huge congratulations to the @Zai_org on the release!

English

258

39.3K

Rayhan Athaillah@THYluhkoyd·1h

@sri9s There was a referral promo last month. I subbed the pro+ plan, worth it!!!

English

SrinathJ@sri9s·5h

If you are switching to cursor now, you have real fomo

English

230

Rayhan Athaillah@THYluhkoyd·1h

@jackprice "Sell me this pen!"

GIF

English

Jack Price@jackprice·16h

Pitch me your startup in 1 sentence I will rate your UI 1/10 Best ones I’ll try out and sign up

English

170

15K

Rayhan Athaillah@THYluhkoyd·1h

@IrulFajarx Yups

English

Irul Fajar@IrulFajarx·1h

@THYluhkoyd Kita tinggal liat kedepannya tools mana yang paling worth it

Indonesia

Irul Fajar@IrulFajarx·2h

Kalau di suruh milih chat gpt, grok, claude dan gemini kalian bakalan milih yang mana?

Indonesia

Rayhan Athaillah@THYluhkoyd·1h

@birdabo Who are they? Whats the context? Is it about the SpaceX acquiring Cursor?

English

118

sui ☄️@birdabo·3h

i slept for one night and missed generational cursor news. never sleep on X. insane acceleration. congrats!!

English

206

4.1K

Rayhan Athaillah@THYluhkoyd·1h

@just_some_dev Freshgrad like me will be f*ed up, right?

English

721

Astrid@just_some_dev·19h

how the FUCK do you get a job in tech rn holy shit I have like 7 years of experience and nobody cares

English

264

714.4K

Rayhan Athaillah@THYluhkoyd·1h

@kinopee_ai This is the first time I heard about Origin. Is it basically equivalent to github?

English

Kinopee@kinopee_ai·7h

Origin は Git ホスティングだったのですね。これは大きな一手。

Cursor@cursor_ai

We're launching code storage and git hosting. Origin gives teams and agents a place to host, review, and collaborate on code. Available this fall. Join the waitlist. cursor.com/origin-waitlist

日本語

12.4K

Rayhan Athaillah@THYluhkoyd·1h

@ml_angelopoulos I only know that GPT is not that good for Frontend coding and I also don't come from Frontend background. But GLM > Opus 4.7 > Opus 4.8? The order looks reversed 😅

English

532

Anastasios Nikolas Angelopoulos@ml_angelopoulos·9h

Just to be clear, if you remove Fable which is unavaialble, GLM-5.2 (Max) is the #1 model in the world for frontend coding. This is a huge moment. OSS has caught up with proprietary, and China has caught up with the US, in this very important domain.

Arena.ai@arena

Exciting news: GLM-5.2 (Max) ranks #2 in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2 React and #4 HTML sub-leaderboards - Ranks as the top model in nearly all sub categories: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, and Simulations. Congrats @Zai_org for the incredible milestone!

English

128

1.7K

161.1K

Rayhan Athaillah@THYluhkoyd·1h

@Venkydotdev I never thought about that. Do I need to run 4 different sessions for that?

English

Venkatesh@Venkydotdev·17h

Why do developers use different LLMs for different tasks? - one for Coding - one for reviewing - one for Debugging - one for Testing Why not just use a single model for everything?

English

879

Rayhan Athaillah@THYluhkoyd·1h

@ptremblay CompoXer 3.0 will level the Fable 😆

English

Philippe Tremblay@ptremblay·7h

I don't know if Anthropic and OpenAI are scared of Cursor/xAI, but they should be. I expect their new model is going to slay. Keep in mind, this is a model that is 1.5T parameters in size and trained using 10-20x the compute Composer 2.5 did.

English

355

31.6K

Rayhan Athaillah@THYluhkoyd·1h

@yusuke And for the model could be like Xomposer or CompoXer 🤣

English

Rayhan Athaillah@THYluhkoyd·1h

@yusuke CurXor is banger 🔥

Indonesia

325

山本ユースケ@yusuke·16h

CursorをSpaceXが買収したら、そうですね、CodeXなんて名前はどうでしょう

日本語

725

90.6K

Rayhan Athaillah@THYluhkoyd·1h

I’m not sure it is strictly “better” yet, but DeepSWE by @datacurve feels closer to how many people actually use coding agents in practice. SWE-Bench and LiveCodeBench are still useful, but popular benchmarks can become over-optimized over time, or feel less representative of messy real workflows. What makes DeepSWE interesting to me is the longer-horizon, agentic setup: repo navigation, tool use, debugging, verification, and recovery from failed attempts. So I’d treat it as another important proxy, not the ground truth. The most useful signal is whether model rankings stay consistent across SWE-Bench, LiveCodeBench, DeepSWE, and real user workflows.

English

Rayhan Athaillah retweetledi

Ichlas@sinterchlas·2h

@THYluhkoyd @datacurve Is DeepSWE a better proxy for real world software engineering than SWE Bench or LiveCodeBench? Why or why not?

English

Rayhan Athaillah@THYluhkoyd·2h

I turned my earlier Composer 2.5 × DeepSWE estimate into a reproducible GitHub repo. Repo: github.com/RayhanHaqi/com… Graph below is the updated pinned snapshot from the repo. Quick recap: DeepSWE by @datacurve has public rows for GPT-5.5, Opus, Sonnet, Kimi, Gemini, DeepSeek, Qwen, etc. But Composer 2.5, one of the main models many people actually use inside @cursor_ai, still has no public DeepSWE trial row in the artifact I used. So I tried a cross-benchmark linking estimate. CursorBench 3.1 has Composer 2.5. DeepSWE does not. But both benchmarks share several other model-effort pairs. The repo recomputes DeepSWE Pass@1 from trial-level data, normalizes model/effort labels, matches overlapping model-effort pairs, then estimates where Composer 2.5 might roughly land on a DeepSWE-style axis. Pinned snapshot: - central estimate: ~58.1% DeepSWE Pass@1 - median across all methods: ~57.6% - mean across all methods: ~55.8% - all-method spread: ~48.0%–62.2% - conservative median-delta anchor: ~52.3% Small clarification from my earlier chart: The old visual showed a narrower conservative-to-optimistic band, around 52.3% → 62.2%. The repo now reports the full method spread across all 8 linking methods, so the lower floor is 48.0%, driven by the cost-normalized sensitivity check. Nothing dropped from 52.3% to 48.0%. They are different statistics: 52.3% = conservative robust_median_delta anchor 48.0% = all-method sensitivity floor from cost_normalized Important caveat: This is still not an official DeepSWE result, and it is not a measured Composer 2.5 DeepSWE score. It is an unofficial estimate from overlapping CursorBench 3.1 ↔ DeepSWE model-effort pairs. Before this weekend, I also plan to test the full DeepSWE 113-task suite directly using Composer 2.5. So this repo is basically the “before direct measurement” estimate. Once I finish the full run, I want to compare: estimated Composer 2.5 DeepSWE Pass@1 vs actual Composer 2.5 DeepSWE Pass@1 from the full 113-task run That comparison should be much more interesting than the estimate alone.

English

598

Rayhan Athaillah@THYluhkoyd·2h

@PeteCapeCod @zebassembly Thank you. I think the cursor plan is really generous regarding the limit usage of composer model. Thats why I want to dump the rest of my usage limit for this DeepSWE benchmark test before the plan is renewed.

English

Peter Cruckshank@PeteCapeCod·2h

@THYluhkoyd @zebassembly Nice that's some cool stuff. I haven't got into benchmarking models myself yet, but I like where you're going with this. It is a good question 🤔 Looking forward to seeing how the final tests compare to the prelim numbers too.

English

zeb@zebassembly·10h

is the cursor $20 plan usable without hitting limits instantly with composer? I haven't used a composer model yet but it looks like cursor is kinda on fire right now

English

125

594

118.3K

Keşfet

@Maha_kalpa @Bhavani_00007 @bryanonchain @jun_song @sri9s @jackprice @IrulFajarx @birdabo