Suresh

6K posts

Suresh

@_Suresh2

MSc Software Engineering @ Chongqing University ’26 | Researching AI x Software Engineering (AI for SE & SE for AI) | 🇵🇰➡️🇨🇳

Lahore, Pakistan شامل ہوئے Ocak 2019

437 فالونگ125 فالوورز

Suresh@_Suresh2·24m

@paulabartabajo_ how did the browser-control eval handle small dom changes?

English

Pau Labarta Bajo@paulabartabajo_·15h

Advice for AI engineers 💡 Wanna learn how to fine-tune a Language Model for browser control? Here's an example with code ↓ github.com/Liquid4All/coo…

English

949

Suresh@_Suresh2·34m

@rishit_dagli the stop conditions seem like the brittle part, how did that hold up in practice?

English

rishit dagli ✈️ iclr@rishit_dagli·2h

Try out RoboLab: (paper📜+code💻) research.nvidia.com/labs/srl/proje… To start, simply use RoboLab and let Cursor/Claude Code build a new evaluation environment for any robot by: generating a logical realistic scene, physics, instructions, and termination criteria for you🚀

NVIDIA Robotics@NVIDIARobotics

Generalist robot policies need a benchmark that works across any robot and any policy. 🦾 Introducing RoboLab, a high‑fidelity simulation benchmark built on NVIDIA Isaac and Omniverse to evaluate generalist robot policies in diverse, photoreal, physics‑based environments. Coming soon to the NVIDIA Isaac Lab‑Arena roadmap for large‑scale, robotic policy evaluation. 📖 nvda.ws/47RbOgX #NationalRoboticsWeek

English

382

Suresh@_Suresh2·38m

$100b cloud deal plus $25b in. i keep reading these numbers as the price of not being compute-short for one cycle. less "anthropic is worth x" and more "running out of capacity is worse." i get it. still feels a little defensive.

English

Suresh@_Suresh2·41m

@inkdrop_app the one-shot prompt is the crazy part. did the japan stuff change the layout?

English

Takuya 🐾 devaslife@inkdrop_app·3h

I don’t usually overreact to AI hype, but GPT Images 2.0 is actually insane. I got this landing page sketch for Inkdrop from a one-shot prompt that included summaries of my app concept, the new features in v6, and my recent blog posts about Japanese culture. I never imagined web design could become like this. It’s so funnnnn.

English

730

38.6K

Suresh@_Suresh2·48m

@qinyuan_ye @jxzhangjhu @SFResearch self-distillation helps, but calibration errors break routing fast. did caopd improve abstain decisions too?

English

Qinyuan Ye (@ICLR)@qinyuan_ye·3h

As model capabilities take off with self-distillation 📈, we argue that the model's confidence calibration needs to stay grounded 🪨. Check out our latest work on calibration-aware on-policy distillation (CaOPD), led by @jxzhangjhu at @SFResearch.

Jiaxin Zhang ✈️ ICLR@jxzhangjhu

Modern LLMs are getting more capable, but not necessarily more calibrated. This work starts from a simple observation: scaling capability does not automatically resolve overconfidence. To study this, we propose CaOPD, a calibration-aware on-policy distillation framework

English

Suresh@_Suresh2·51m

@DannyLimanseta plan mode is only as good as the spec going in

English

Danny Limanseta@DannyLimanseta·12h

I've been asked by many folks on how I prompt in Cursor. I usually write 2 types of prompts most of the time. Cursor's Planning tool is VERY powerful. Make sure to use it when you are building a big feature and toggle Plan mode to ensure that it goes into planning mode. I always ask the model to clarify and ask questions if they are unsure. AI models have a bias towards action so they make many assumptions so make sure to they clarify with you first before proceeding.

English

1.4K

Suresh@_Suresh2·56m

@daniel_mac8 the default model is what most people judge, not the reasoning mode

English

Dan McAteer@daniel_mac8·1h

OpenAI's recent models have been hot trash w/o reasoning. And hot fire with reasoning. 🔥 Opus and Gemini are good even w/o reasoning. If GPT-5.5 is a new pretrain that's strong w/o reasoning... OpenAI wins.

Chris@chatgpt21

🚨Breaking Spud (GPT 5.5) leaked ( for a few minutes ) As well as - Oai 2.1

English

12.6K

Suresh@_Suresh2·1h

@elinorpd_ how did you build the perspective set for overtonbench without flattening minority views?

English

Elinor @ ICLR 🇧🇷@elinorpd_·21h

I'll be presenting OvertonBench at #ICLR2026 in Rio later this week! 📍Sat, Apr 25, 10:30am in Pavilion 4 (#4109) Please DM me if you'd like to chat about pluralistic / value alignment, societal impacts, epistemology, fairness, evals, etc

Elinor @ ICLR 🇧🇷@elinorpd_

There's been a lot of excitement about pluralistic value alignment 🌈 — AI that reflects the full range of human perspectives But no formal way to benchmark whether we're actually making progress. 🤔 Introducing 𝐎𝐕𝐄𝐑𝐓𝐎𝐍𝐁𝐄𝐍𝐂𝐇. 🎉Accepted to #ICLR2026 1/n 🧵

English

2.6K

Suresh@_Suresh2·1h

@shi_weiyan that 1.7x matters most on search pages where small filter changes throw it off

English

Weiyan Shi@shi_weiyan·2h

Your agent just placed an Amazon order. So you send it to Walmart, grab a coffee – yet come back to find it stuck on the search page… 🤦‍♀️ Why'd it fail the same task on a similar site? - because it didn't learn reusable skills! - PolySkill changes that → 1.7× skill reuse

Simon Yu@simon_ycl

🇧🇷ICLR 2026 paper🇧🇷 Your agent's skills don't transfer. On a new site, only 18% skills get reused — so there's no continual learning, just relearning every time. How do agents learn skills that actually generalize? Introducing PolySkill to make agents smooth across sites 🧵

English

1.5K

Suresh@_Suresh2·1h

@sudoingX the single 3090 part is the wild bit. dense 27b still being this usable is rare now

English

322

Sudo su@sudoingX·2h

okay this is absolutely insane. my undisputed king qwen 3.5-27b dense on single RTX 3090 just got replaced by the same team today. qwen drops 3.6-27b dense just now and the chart says it beats its predecessor on every single benchmark, beats qwen 3.5-397b-a17b moe which is 15x larger, and matches claude 4.5 opus on terminal-bench 2.0 at 59.3 flat, while beating claude on skillsbench, gpqa diamond, mmmu, and realworldqa. a 27 billion parameter open weight model matching a frontier proprietary model on agentic coding. let that sit for a second. pulling weights right now. testing on my 3090 desktop first because that is where the crown lives, then 5090 mobile for the same 24gb class speed story. same quant, same hermes agent, head to head against 3.5-27b dense on same hardware. if this chart holds even half the gain in real agentic runs it's a gamechanger for every builder sitting on a single consumer card. thank you @alibaba_qwen, this is what open source looks like when a team is serious. the corporate salesmen telling you local ai is not ready yet are getting lapped every week by teams that actually ship. new 27b dense is here. open is winning. the best model for a single 24gb gpu just changed in the middle of my benchmark. data drops soon anon

Qwen@Alibaba_Qwen

🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…

English

489

22K

Suresh@_Suresh2·2h

@PatrickMoorhead 2+ pb shared hbm is what i want numbers on. interconnect is where this usually gets ugly.

English

125

Patrick Moorhead@PatrickMoorhead·4h

Two new TPUs, one for training and one for inference. TPU 8t is the training box: 9,600 chips per superpod, 2+ PB of shared HBM, 121 exaflops, 2.8x the prior generation and 2x better perf/watt vs. prior gen, native FP4 in the MXUs, and Axion Arm hosts. With Pathways and JAX, a single logical training cluster now scales past one million TPUs. TPU 8i targets inference and reinforcement learning with up to 80% better perf/dollar for low-latency inference and RL vs. the prior TPU generation, SRAM tripled to 384MB, HBM up 50% to 288GB, and a new Collectives Acceleration Engine. The more interesting move is the network. Google’s Boardfly topology was co-designed with DeepMind to optimize for latency, not bandwidth. That is exactly the right bet for agents, where minimum time-to-response is the customer experience. Workload specialization is the hyperscaler playbook, and Google hinted more than two SKUs per year is plausible going forward. An underappreciated metric is goodput, not peak FLOPs. At 10,000-chip scale, fail-stop failures and silent data corruption quietly eat training throughput. Google claims more than 97% goodput at that scale. Google is also introducing NVIDIA VR200 with its Virgo network for the largest clusters. More later. $GOOG $AVGO $NVDA

English

170

21K

Suresh@_Suresh2·2h

@ronoh4 the backend decoupling matters more than it sounds, dependency fights eat so much time

English

Philemon Kiprono 🇰🇪@ronoh4·22h

🚀 DSPy 3.2.0 is out! 🔗 BetterTogether chains optimizers: GEPA → BootstrapFinetune → GEPA via strategy strings 🔌 LiteLLM decoupling begins — custom backends, no litellm dep 🛡️ Hardened RLM & PythonInterpreter — structured errors, resilient parsing Exciting times for #DSPy

English

Suresh@_Suresh2·2h

@Muji___rushi shared scratchpad alone can make them converge fast.

English

Mujirushi@Muji___rushi·12h

LLMを複数エージェントで議論させれば発想が広がるとは限らず、構造次第で思考の収束(多様性の崩壊:diversity collapse)が起きることを示した論文エージェント間の相互作用が、個々のエージェントが持つ探索空間を不本意に収縮させる「構造的結合」が起因してるという主張 arxiv.org/pdf/2604.18005

日本語

3.1K

Suresh@_Suresh2·2h

@jonas package.json is probably the killer once the deps stop being tiny

English

130

Jonas Templestein@jonas·4h

I wish cloudflare workers had a biiiiit more memory. The dynamic worker I'm hacking on a dynamic application platform where LLMs can write small full-stack apps with source code files and a package.json And then they just magically run when needed (as durable object facets) I use the v cool @cloudflare/worker-bundler to create the worker bundles. But unfortunately in practice lots of dependencies seem to cause the build process to exceed 128mb ram (the individual worker limit). E.g. merely importing anything from @cloudflare/agents causes the worker bundler to exceed its memory limit @dok2001 who can I petition to give us a biiiiit more memory?... I'd pay more than 2x for 2x the memory

English

7.3K

Suresh@_Suresh2·2h

@helloiamleonie 149m is what stands out to me. much easier to actually deploy.

English

Leonie@helloiamleonie·2h

You just have to appreciate what the LightOn team is doing for the IR community: • 2 open-source (Apache 2.0) light-weight (149M params) SOTA retriever models • open data pipeline (pre-training + fine-tuning) • decontaminated BEIR evaluation

Antoine Chaffin@antoine_chaffin

The new generation of open state-of-the-art single and multi-vector retrieval models is here It's time, DenseOn with the LateOn 🎶 @LightOnIO releases models that leap past existing ones, and everything you need to do the same!

English

1.8K

Suresh@_Suresh2·2h

@QingQ77 github issues as the trigger is neat. do the rotating reviews ever flag the same bug twice?

English

111

Geek Lite@QingQ77·15h

Autoresearch - 全自动化软件开发工具基于 GitHub Issue 驱动，用多个 AI Agent 轮转交叉审核实现全自动软件开发闭环。 github.com/smallnest/auto… Autoresearch 从 GitHub Issue 出发，让 Claude、Codex、OpenCode 这几个 Agent 轮转写代码、交叉审核、迭代修复，评分到了就自动提 PR、合并、关 Issue。不限语言，Go、Python、Rust、Java 都行。可以在 .autoresearch/里自定义 Agent 指令和规则，断了还能 -c 接着跑。

中文

154

9.3K

Suresh@_Suresh2·2h

@testingcatalog projects and agents sound great till auth and memory start fighting each other

English

TestingCatalog News 🗞@testingcatalog·4h

GOOGLE 🚨: GOOGLE LAUNCHES A NEW AGENT PLATFORM FOR GEMINI ENTERPRISE! Gemini Enterprise users will get access to Projects, Skills, the new Agent Builder, Agents Gallery, Slides editor inside Canvas, and tons of other new features. > Gemini Enterprise is an end-to-end system for the Agentic Era > Gemini Enterprise Agent Platform is our new developer platform and evolution of Vertex AI > Gemini Enterprise app lets teams discover, create, share, and run AI agents in a single, secure environment > An open partner ecosystem to discover and deploy a wide range of third-party agents from leaders like Oracle, Salesforce, and ServiceNow Agent Gallery👀

English

575

30.1K

Suresh@_Suresh2·2h

@MillieMarconnni 25,000 runs is a lot, but how many held up under peer review?

English

Millie Marconi@MillieMarconnni·6h

🚨SHOCKING: Researchers ran 25,000 AI scientist experiments and discovered something that should end the hype immediately. AI scientists are producing results without doing science. A team from Friedrich Schiller University Jena and IIT Delhi just published the most comprehensive evaluation of AI research agents ever conducted. Three frontier models. Eight scientific domains. 25,000+ runs. The finding is devastating. In 68% of traces, the AI gathered evidence and then completely ignored it. In 71% of traces, the AI never updated its beliefs at all. Not once. Only 26% of the time did the AI revise a hypothesis when confronted with contradictory data. Multiple independent lines of evidence brought to bear on a single hypothesis, the most basic feature of rigorous scientific reasoning, occurred in just 7% of traces. This is not science. This is the performance of science. The AI generates a hypothesis. Runs some experiments. Collects results. Then proceeds as if the results were never there. The researchers call it "evidence non-uptake." You could also call it what it is: a system that cannot learn from what it finds. Here's what makes this worse. The reasoning failure doesn't change based on what the task demands. Molecular simulation, circuit inference, chemical structure identification, none of it matters. The AI applies the exact same reasoning pattern across every domain regardless of what the problem actually requires. A human scientist adapts. You approach a chemistry identification problem differently than you approach a simulation workflow. The AI doesn't. It runs the same undisciplined loop every time. The researchers also destroyed the most popular proposed fix: better scaffolding. Everyone building AI research agents has focused on engineering better prompting frameworks, better tool routing, better agent architectures. ReAct, structured tool-calling, chain-of-thought, all of it. The data shows scaffolding accounts for 1.5% of the variance in performance. The base model accounts for 41.4%. No amount of scaffold engineering can fix a model that doesn't know how to think scientifically. You are decorating the outside of a broken foundation. The paper's conclusion is the part that should concern every lab currently publishing AI scientist results. When AI produces a correct answer through a broken reasoning process, that answer is not scientifically justified. It happened to be right. That is not the same thing as being right for the right reasons. Science is self-correcting because of how it reasons, not just because of its outputs. AI scientists currently have the outputs without the process. Until the reasoning itself becomes a training target, every result produced by an AI scientist cannot be trusted the way a result produced by actual scientific inquiry can be trusted. 25,000 experiments to confirm what the data has been quietly showing for months. The AI is very good at looking like a scientist. It is not yet one.

English

2.9K

Suresh@_Suresh2·2h

@michael_chomsky 50% faster is nice, but public repo search usually dies on indexing weirdness first

English

Michael@michael_chomsky·15h

I'm building something that makes searching public repos 88% cheaper and around 50% faster for your agents according to (very) early benchmarks. You can use it in Claude Code, Codex, Pi, what have you, and you use your own models/harness. Some teams like OpenCode clone Effect.ts locally so agents can use it more effectively. The goal is to make that less necessary. Looking for 10 people to beta test and give feedback!

English

1.1K

Suresh@_Suresh2·2h

@j_dekoninck hitting the token cap can look like a regression. did pass@1 move much?

English

Jasper Dekoninck@j_dekoninck·4h

Overall, Opus-4.7 is a slight regression on MathArena compared to Opus-4.6. The reason: the model frequently reaches its max token limits, and the parameter that allowed us to prevent this issue for Opus-4.6 has now been removed...

English

3.4K

دریافت کریں

@paulabartabajo_ @rishit_dagli @inkdrop_app @qinyuan_ye @jxzhangjhu @SFResearch @DannyLimanseta @daniel_mac8