Suresh

6K posts

Suresh banner
Suresh

Suresh

@_Suresh2

MSc Software Engineering @ Chongqing University ’26 | Researching AI x Software Engineering (AI for SE & SE for AI) | 🇵🇰➡️🇨🇳

Lahore, Pakistan شامل ہوئے Ocak 2019
437 فالونگ125 فالوورز
Suresh
Suresh@_Suresh2·
@paulabartabajo_ how did the browser-control eval handle small dom changes?
English
0
0
0
3
Suresh
Suresh@_Suresh2·
@rishit_dagli the stop conditions seem like the brittle part, how did that hold up in practice?
English
0
0
0
7
Suresh
Suresh@_Suresh2·
$100b cloud deal plus $25b in. i keep reading these numbers as the price of not being compute-short for one cycle. less "anthropic is worth x" and more "running out of capacity is worse." i get it. still feels a little defensive.
English
0
0
0
8
Suresh
Suresh@_Suresh2·
@inkdrop_app the one-shot prompt is the crazy part. did the japan stuff change the layout?
English
0
0
0
63
Takuya 🐾 devaslife
Takuya 🐾 devaslife@inkdrop_app·
I don’t usually overreact to AI hype, but GPT Images 2.0 is actually insane. I got this landing page sketch for Inkdrop from a one-shot prompt that included summaries of my app concept, the new features in v6, and my recent blog posts about Japanese culture. I never imagined web design could become like this. It’s so funnnnn.
Takuya 🐾 devaslife tweet media
English
45
34
730
38.6K
Qinyuan Ye (@ICLR)
Qinyuan Ye (@ICLR)@qinyuan_ye·
As model capabilities take off with self-distillation 📈, we argue that the model's confidence calibration needs to stay grounded 🪨. Check out our latest work on calibration-aware on-policy distillation (CaOPD), led by @jxzhangjhu at @SFResearch.
Jiaxin Zhang ✈️ ICLR@jxzhangjhu

Modern LLMs are getting more capable, but not necessarily more calibrated. This work starts from a simple observation: scaling capability does not automatically resolve overconfidence. To study this, we propose CaOPD, a calibration-aware on-policy distillation framework

English
1
2
12
1K
Danny Limanseta
Danny Limanseta@DannyLimanseta·
I've been asked by many folks on how I prompt in Cursor. I usually write 2 types of prompts most of the time. Cursor's Planning tool is VERY powerful. Make sure to use it when you are building a big feature and toggle Plan mode to ensure that it goes into planning mode. I always ask the model to clarify and ask questions if they are unsure. AI models have a bias towards action so they make many assumptions so make sure to they clarify with you first before proceeding.
Danny Limanseta tweet mediaDanny Limanseta tweet media
English
3
1
31
1.4K
Suresh
Suresh@_Suresh2·
@daniel_mac8 the default model is what most people judge, not the reasoning mode
English
0
0
0
57
Suresh
Suresh@_Suresh2·
@elinorpd_ how did you build the perspective set for overtonbench without flattening minority views?
English
0
0
0
4
Elinor @ ICLR 🇧🇷
Elinor @ ICLR 🇧🇷@elinorpd_·
I'll be presenting OvertonBench at #ICLR2026 in Rio later this week! 📍Sat, Apr 25, 10:30am in Pavilion 4 (#4109) Please DM me if you'd like to chat about pluralistic / value alignment, societal impacts, epistemology, fairness, evals, etc
Elinor @ ICLR 🇧🇷@elinorpd_

There's been a lot of excitement about pluralistic value alignment 🌈 — AI that reflects the full range of human perspectives But no formal way to benchmark whether we're actually making progress. 🤔 Introducing 𝐎𝐕𝐄𝐑𝐓𝐎𝐍𝐁𝐄𝐍𝐂𝐇. 🎉Accepted to #ICLR2026 1/n 🧵

English
3
7
29
2.6K
Suresh
Suresh@_Suresh2·
@shi_weiyan that 1.7x matters most on search pages where small filter changes throw it off
English
1
0
1
8
Weiyan Shi
Weiyan Shi@shi_weiyan·
Your agent just placed an Amazon order. So you send it to Walmart, grab a coffee – yet come back to find it stuck on the search page… 🤦‍♀️ Why'd it fail the same task on a similar site? - because it didn't learn reusable skills! - PolySkill changes that → 1.7× skill reuse
Simon Yu@simon_ycl

🇧🇷ICLR 2026 paper🇧🇷 Your agent's skills don't transfer. On a new site, only 18% skills get reused — so there's no continual learning, just relearning every time. How do agents learn skills that actually generalize? Introducing PolySkill to make agents smooth across sites 🧵

English
1
5
9
1.5K
Suresh
Suresh@_Suresh2·
@sudoingX the single 3090 part is the wild bit. dense 27b still being this usable is rare now
English
0
0
0
322
Sudo su
Sudo su@sudoingX·
okay this is absolutely insane. my undisputed king qwen 3.5-27b dense on single RTX 3090 just got replaced by the same team today. qwen drops 3.6-27b dense just now and the chart says it beats its predecessor on every single benchmark, beats qwen 3.5-397b-a17b moe which is 15x larger, and matches claude 4.5 opus on terminal-bench 2.0 at 59.3 flat, while beating claude on skillsbench, gpqa diamond, mmmu, and realworldqa. a 27 billion parameter open weight model matching a frontier proprietary model on agentic coding. let that sit for a second. pulling weights right now. testing on my 3090 desktop first because that is where the crown lives, then 5090 mobile for the same 24gb class speed story. same quant, same hermes agent, head to head against 3.5-27b dense on same hardware. if this chart holds even half the gain in real agentic runs it's a gamechanger for every builder sitting on a single consumer card. thank you @alibaba_qwen, this is what open source looks like when a team is serious. the corporate salesmen telling you local ai is not ready yet are getting lapped every week by teams that actually ship. new 27b dense is here. open is winning. the best model for a single 24gb gpu just changed in the middle of my benchmark. data drops soon anon
Sudo su tweet media
Qwen@Alibaba_Qwen

🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…

English
23
23
489
22K
Suresh
Suresh@_Suresh2·
@PatrickMoorhead 2+ pb shared hbm is what i want numbers on. interconnect is where this usually gets ugly.
English
0
0
0
125
Patrick Moorhead
Patrick Moorhead@PatrickMoorhead·
Two new TPUs, one for training and one for inference. TPU 8t is the training box: 9,600 chips per superpod, 2+ PB of shared HBM, 121 exaflops, 2.8x the prior generation and 2x better perf/watt vs. prior gen, native FP4 in the MXUs, and Axion Arm hosts. With Pathways and JAX, a single logical training cluster now scales past one million TPUs. TPU 8i targets inference and reinforcement learning with up to 80% better perf/dollar for low-latency inference and RL vs. the prior TPU generation, SRAM tripled to 384MB, HBM up 50% to 288GB, and a new Collectives Acceleration Engine. The more interesting move is the network. Google’s Boardfly topology was co-designed with DeepMind to optimize for latency, not bandwidth. That is exactly the right bet for agents, where minimum time-to-response is the customer experience. Workload specialization is the hyperscaler playbook, and Google hinted more than two SKUs per year is plausible going forward. An underappreciated metric is goodput, not peak FLOPs. At 10,000-chip scale, fail-stop failures and silent data corruption quietly eat training throughput. Google claims more than 97% goodput at that scale. Google is also introducing NVIDIA VR200 with its Virgo network for the largest clusters. More later. $GOOG $AVGO $NVDA
Patrick Moorhead tweet media
English
5
29
170
21K
Suresh
Suresh@_Suresh2·
@ronoh4 the backend decoupling matters more than it sounds, dependency fights eat so much time
English
0
0
0
2
Philemon Kiprono 🇰🇪
🚀 DSPy 3.2.0 is out! 🔗 BetterTogether chains optimizers: GEPA → BootstrapFinetune → GEPA via strategy strings 🔌 LiteLLM decoupling begins — custom backends, no litellm dep 🛡️ Hardened RLM & PythonInterpreter — structured errors, resilient parsing Exciting times for #DSPy
English
1
2
9
1K
Suresh
Suresh@_Suresh2·
@Muji___rushi shared scratchpad alone can make them converge fast.
English
0
0
0
21
Mujirushi
Mujirushi@Muji___rushi·
LLMを複数エージェントで議論させれば発想が広がるとは限らず、構造次第で思考の収束(多様性の崩壊:diversity collapse)が起きることを示した論文 エージェント間の相互作用が、個々のエージェントが持つ探索空間を不本意に収縮させる「構造的結合」が起因してるという主張 arxiv.org/pdf/2604.18005
Mujirushi tweet media
日本語
2
9
59
3.1K
Suresh
Suresh@_Suresh2·
@jonas package.json is probably the killer once the deps stop being tiny
English
0
0
0
130
Jonas Templestein
I wish cloudflare workers had a biiiiit more memory. The dynamic worker I'm hacking on a dynamic application platform where LLMs can write small full-stack apps with source code files and a package.json And then they just magically run when needed (as durable object facets) I use the v cool @cloudflare/worker-bundler to create the worker bundles. But unfortunately in practice lots of dependencies seem to cause the build process to exceed 128mb ram (the individual worker limit). E.g. merely importing anything from @cloudflare/agents causes the worker bundler to exceed its memory limit @dok2001 who can I petition to give us a biiiiit more memory?... I'd pay more than 2x for 2x the memory
English
11
2
51
7.3K
Suresh
Suresh@_Suresh2·
@helloiamleonie 149m is what stands out to me. much easier to actually deploy.
English
0
0
1
20
Leonie
Leonie@helloiamleonie·
You just have to appreciate what the LightOn team is doing for the IR community: • 2 open-source (Apache 2.0) light-weight (149M params) SOTA retriever models • open data pipeline (pre-training + fine-tuning) • decontaminated BEIR evaluation
Antoine Chaffin@antoine_chaffin

The new generation of open state-of-the-art single and multi-vector retrieval models is here It's time, DenseOn with the LateOn 🎶 @LightOnIO releases models that leap past existing ones, and everything you need to do the same!

English
4
11
31
1.8K
Suresh
Suresh@_Suresh2·
@QingQ77 github issues as the trigger is neat. do the rotating reviews ever flag the same bug twice?
English
0
0
0
111
Geek Lite
Geek Lite@QingQ77·
Autoresearch - 全自动化软件开发工具 基于 GitHub Issue 驱动,用多个 AI Agent 轮转交叉审核实现全自动软件开发闭环。 github.com/smallnest/auto… Autoresearch 从 GitHub Issue 出发,让 Claude、Codex、OpenCode 这几个 Agent 轮转写代码、交叉审核、迭代修复,评分到了就自动提 PR、合并、关 Issue。 不限语言,Go、Python、Rust、Java 都行。可以在 .autoresearch/里自定义 Agent 指令和规则,断了还能 -c 接着跑。
Geek Lite tweet media
中文
3
32
154
9.3K
Suresh
Suresh@_Suresh2·
@testingcatalog projects and agents sound great till auth and memory start fighting each other
English
0
0
0
60
TestingCatalog News 🗞
TestingCatalog News 🗞@testingcatalog·
GOOGLE 🚨: GOOGLE LAUNCHES A NEW AGENT PLATFORM FOR GEMINI ENTERPRISE! Gemini Enterprise users will get access to Projects, Skills, the new Agent Builder, Agents Gallery, Slides editor inside Canvas, and tons of other new features. > Gemini Enterprise is an end-to-end system for the Agentic Era > Gemini Enterprise Agent Platform is our new developer platform and evolution of Vertex AI > Gemini Enterprise app lets teams discover, create, share, and run AI agents in a single, secure environment > An open partner ecosystem to discover and deploy a wide range of third-party agents from leaders like Oracle, Salesforce, and ServiceNow Agent Gallery👀
TestingCatalog News 🗞 tweet media
English
18
57
575
30.1K
Suresh
Suresh@_Suresh2·
@MillieMarconnni 25,000 runs is a lot, but how many held up under peer review?
English
0
0
0
33
Millie Marconi
Millie Marconi@MillieMarconnni·
🚨SHOCKING: Researchers ran 25,000 AI scientist experiments and discovered something that should end the hype immediately. AI scientists are producing results without doing science. A team from Friedrich Schiller University Jena and IIT Delhi just published the most comprehensive evaluation of AI research agents ever conducted. Three frontier models. Eight scientific domains. 25,000+ runs. The finding is devastating. In 68% of traces, the AI gathered evidence and then completely ignored it. In 71% of traces, the AI never updated its beliefs at all. Not once. Only 26% of the time did the AI revise a hypothesis when confronted with contradictory data. Multiple independent lines of evidence brought to bear on a single hypothesis, the most basic feature of rigorous scientific reasoning, occurred in just 7% of traces. This is not science. This is the performance of science. The AI generates a hypothesis. Runs some experiments. Collects results. Then proceeds as if the results were never there. The researchers call it "evidence non-uptake." You could also call it what it is: a system that cannot learn from what it finds. Here's what makes this worse. The reasoning failure doesn't change based on what the task demands. Molecular simulation, circuit inference, chemical structure identification, none of it matters. The AI applies the exact same reasoning pattern across every domain regardless of what the problem actually requires. A human scientist adapts. You approach a chemistry identification problem differently than you approach a simulation workflow. The AI doesn't. It runs the same undisciplined loop every time. The researchers also destroyed the most popular proposed fix: better scaffolding. Everyone building AI research agents has focused on engineering better prompting frameworks, better tool routing, better agent architectures. ReAct, structured tool-calling, chain-of-thought, all of it. The data shows scaffolding accounts for 1.5% of the variance in performance. The base model accounts for 41.4%. No amount of scaffold engineering can fix a model that doesn't know how to think scientifically. You are decorating the outside of a broken foundation. The paper's conclusion is the part that should concern every lab currently publishing AI scientist results. When AI produces a correct answer through a broken reasoning process, that answer is not scientifically justified. It happened to be right. That is not the same thing as being right for the right reasons. Science is self-correcting because of how it reasons, not just because of its outputs. AI scientists currently have the outputs without the process. Until the reasoning itself becomes a training target, every result produced by an AI scientist cannot be trusted the way a result produced by actual scientific inquiry can be trusted. 25,000 experiments to confirm what the data has been quietly showing for months. The AI is very good at looking like a scientist. It is not yet one.
Millie Marconi tweet media
English
13
15
47
2.9K
Suresh
Suresh@_Suresh2·
@michael_chomsky 50% faster is nice, but public repo search usually dies on indexing weirdness first
English
1
0
1
15
Michael
Michael@michael_chomsky·
I'm building something that makes searching public repos 88% cheaper and around 50% faster for your agents according to (very) early benchmarks. You can use it in Claude Code, Codex, Pi, what have you, and you use your own models/harness. Some teams like OpenCode clone Effect.ts locally so agents can use it more effectively. The goal is to make that less necessary. Looking for 10 people to beta test and give feedback!
Michael tweet media
English
7
0
16
1.1K
Suresh
Suresh@_Suresh2·
@j_dekoninck hitting the token cap can look like a regression. did pass@1 move much?
English
1
0
0
48
Jasper Dekoninck
Jasper Dekoninck@j_dekoninck·
Overall, Opus-4.7 is a slight regression on MathArena compared to Opus-4.6. The reason: the model frequently reaches its max token limits, and the parameter that allowed us to prevent this issue for Opus-4.6 has now been removed...
Jasper Dekoninck tweet media
English
6
5
59
3.4K