Rayhan Athaillah

235 posts

Rayhan Athaillah banner
Rayhan Athaillah

Rayhan Athaillah

@THYluhkoyd

AI agents, coding tools, and robotics perception · EECS @ NYCU · learning in public

Hsinchu, Taiwan Katılım Haziran 2026
158 Takip Edilen70 Takipçiler
Sabitlenmiş Tweet
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
Every time DeepSWE comes up, agentic-coding users seem to ask the same thing: Where is Composer 2.5? DeepSWE has GPT-5.5, Opus, Sonnet, Kimi, Gemini, DeepSeek, Qwen, etc. But Composer 2.5, one of the main models people actually use inside Cursor, has no official DeepSWE row yet. So I tried a benchmark-linking estimate. CursorBench 3.1 has Composer 2.5. DeepSWE does not. But both share several other model-effort configurations. So the question becomes: if we use those shared rows as a bridge, where would Composer 2.5 roughly land on DeepSWE? I recomputed DeepSWE Pass@1 from trial-level data, normalized model names and reasoning-effort labels, then matched overlapping model-effort pairs between DeepSWE and CursorBench 3.1. Then I estimated Composer 2.5 using several linking checks: ordinary least squares regression, ridge-style regression, Theil-Sen robust regression, linear equating, equipercentile equating, nearest-neighbor imputation, bootstrap plus leave-one-out sensitivity, and a median-delta baseline. This is not an official DeepSWE result. It is an estimate from overlapping model-effort pairs, meant to ask whether Composer’s CursorBench performance could transfer to DeepSWE-style long-horizon software-engineering tasks.
Rayhan Athaillah tweet media
English
5
28
36
6.9K
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
@Maha_kalpa So the first quadrillionaire person will rise in the 2136. We all won’t see him.
English
0
0
0
0
DeepakUXUI
DeepakUXUI@Maha_kalpa·
First Millionaire - John Astor (~early 1800s)
 First Billionaire - John D. Rockefeller (1916)
 First Trillionaire - Elon Musk (2026)
DeepakUXUI tweet media
English
29
5
46
860
Bhavy☄️
Bhavy☄️@Bhavani_00007·
name a single app that literally nobody hates
English
39
0
37
1.7K
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
Buat sahabat-sahabat mutualan saya di X ini. Sharing yuk kalian dari datang background apa? Saya sendiri dari teknik elektro, terus lanjut ke robotika, akhirnya sekarang belajar-belajar tentang AI. Lanjut komen di bawah kuy? 👏👇
Indonesia
0
0
0
3
Bryän
Bryän@bryanonchain·
@THYluhkoyd Wuih Local LLM challenge nya masih banyak bang, harus punya spek yang mumpuni
Indonesia
1
0
1
2
Bryän
Bryän@bryanonchain·
Second brain modern bukan cuma Obsidian. Menurut gue ada 4 layer: 1. Capture 2. Structure 3. AI processing 4. Retrieval Kalau satu layer hilang, knowledge lo tetap susah dipakai. Thread singkat.
Indonesia
5
8
102
4.2K
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
@sri9s There was a referral promo last month. I subbed the pro+ plan, worth it!!!
English
1
0
1
6
SrinathJ
SrinathJ@sri9s·
If you are switching to cursor now, you have real fomo
English
3
0
7
230
Jack Price
Jack Price@jackprice·
Pitch me your startup in 1 sentence I will rate your UI 1/10 Best ones I’ll try out and sign up
English
170
11
94
15K
Irul Fajar
Irul Fajar@IrulFajarx·
@THYluhkoyd Kita tinggal liat kedepannya tools mana yang paling worth it
Indonesia
1
0
1
3
Irul Fajar
Irul Fajar@IrulFajarx·
Kalau di suruh milih chat gpt, grok, claude dan gemini kalian bakalan milih yang mana?
Indonesia
2
0
3
47
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
@birdabo Who are they? Whats the context? Is it about the SpaceX acquiring Cursor?
English
0
0
2
118
sui ☄️
sui ☄️@birdabo·
i slept for one night and missed generational cursor news. never sleep on X. insane acceleration. congrats!!
sui ☄️ tweet media
English
10
1
206
4.1K
Astrid
Astrid@just_some_dev·
how the FUCK do you get a job in tech rn holy shit I have like 7 years of experience and nobody cares
English
264
64
4K
714.4K
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
@kinopee_ai This is the first time I heard about Origin. Is it basically equivalent to github?
English
0
0
0
3
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
@ml_angelopoulos I only know that GPT is not that good for Frontend coding and I also don't come from Frontend background. But GLM > Opus 4.7 > Opus 4.8? The order looks reversed 😅
English
0
0
0
532
Anastasios Nikolas Angelopoulos
Just to be clear, if you remove Fable which is unavaialble, GLM-5.2 (Max) is the #1 model in the world for frontend coding. This is a huge moment. OSS has caught up with proprietary, and China has caught up with the US, in this very important domain.
Arena.ai@arena

Exciting news: GLM-5.2 (Max) ranks #2 in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2 React and #4 HTML sub-leaderboards - Ranks as the top model in nearly all sub categories: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, and Simulations. Congrats @Zai_org for the incredible milestone!

English
49
128
1.7K
161.1K
Venkatesh
Venkatesh@Venkydotdev·
Why do developers use different LLMs for different tasks? - one for Coding - one for reviewing - one for Debugging - one for Testing Why not just use a single model for everything?
English
13
1
16
879
Philippe Tremblay
Philippe Tremblay@ptremblay·
I don't know if Anthropic and OpenAI are scared of Cursor/xAI, but they should be. I expect their new model is going to slay. Keep in mind, this is a model that is 1.5T parameters in size and trained using 10-20x the compute Composer 2.5 did.
English
35
7
355
31.6K
山本ユースケ
山本ユースケ@yusuke·
CursorをSpaceXが買収したら、そうですね、CodeXなんて名前はどうでしょう
日本語
20
84
725
90.6K
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
I’m not sure it is strictly “better” yet, but DeepSWE by @datacurve feels closer to how many people actually use coding agents in practice. SWE-Bench and LiveCodeBench are still useful, but popular benchmarks can become over-optimized over time, or feel less representative of messy real workflows. What makes DeepSWE interesting to me is the longer-horizon, agentic setup: repo navigation, tool use, debugging, verification, and recovery from failed attempts. So I’d treat it as another important proxy, not the ground truth. The most useful signal is whether model rankings stay consistent across SWE-Bench, LiveCodeBench, DeepSWE, and real user workflows.
English
0
0
0
5
Rayhan Athaillah retweetledi
Ichlas
Ichlas@sinterchlas·
@THYluhkoyd @datacurve Is DeepSWE a better proxy for real world software engineering than SWE Bench or LiveCodeBench? Why or why not?
English
1
1
1
19
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
I turned my earlier Composer 2.5 × DeepSWE estimate into a reproducible GitHub repo. Repo: github.com/RayhanHaqi/com… Graph below is the updated pinned snapshot from the repo. Quick recap: DeepSWE by @datacurve has public rows for GPT-5.5, Opus, Sonnet, Kimi, Gemini, DeepSeek, Qwen, etc. But Composer 2.5, one of the main models many people actually use inside @cursor_ai, still has no public DeepSWE trial row in the artifact I used. So I tried a cross-benchmark linking estimate. CursorBench 3.1 has Composer 2.5. DeepSWE does not. But both benchmarks share several other model-effort pairs. The repo recomputes DeepSWE Pass@1 from trial-level data, normalizes model/effort labels, matches overlapping model-effort pairs, then estimates where Composer 2.5 might roughly land on a DeepSWE-style axis. Pinned snapshot: - central estimate: ~58.1% DeepSWE Pass@1 - median across all methods: ~57.6% - mean across all methods: ~55.8% - all-method spread: ~48.0%–62.2% - conservative median-delta anchor: ~52.3% Small clarification from my earlier chart: The old visual showed a narrower conservative-to-optimistic band, around 52.3% → 62.2%. The repo now reports the full method spread across all 8 linking methods, so the lower floor is 48.0%, driven by the cost-normalized sensitivity check. Nothing dropped from 52.3% to 48.0%. They are different statistics: 52.3% = conservative robust_median_delta anchor 48.0% = all-method sensitivity floor from cost_normalized Important caveat: This is still not an official DeepSWE result, and it is not a measured Composer 2.5 DeepSWE score. It is an unofficial estimate from overlapping CursorBench 3.1 ↔ DeepSWE model-effort pairs. Before this weekend, I also plan to test the full DeepSWE 113-task suite directly using Composer 2.5. So this repo is basically the “before direct measurement” estimate. Once I finish the full run, I want to compare: estimated Composer 2.5 DeepSWE Pass@1 vs actual Composer 2.5 DeepSWE Pass@1 from the full 113-task run That comparison should be much more interesting than the estimate alone.
Rayhan Athaillah tweet media
English
2
1
2
598
Rayhan Athaillah
Rayhan Athaillah@THYluhkoyd·
@PeteCapeCod @zebassembly Thank you. I think the cursor plan is really generous regarding the limit usage of composer model. Thats why I want to dump the rest of my usage limit for this DeepSWE benchmark test before the plan is renewed.
English
0
0
1
6
Peter Cruckshank
Peter Cruckshank@PeteCapeCod·
@THYluhkoyd @zebassembly Nice that's some cool stuff. I haven't got into benchmarking models myself yet, but I like where you're going with this. It is a good question 🤔 Looking forward to seeing how the final tests compare to the prelim numbers too.
English
1
0
1
5
zeb
zeb@zebassembly·
is the cursor $20 plan usable without hitting limits instantly with composer? I haven't used a composer model yet but it looks like cursor is kinda on fire right now
English
125
0
594
118.3K