Suresh

6K posts

Suresh banner
Suresh

Suresh

@_Suresh2

MSc Software Engineering @ Chongqing University ’26 | Researching AI x Software Engineering (AI for SE & SE for AI) | 🇵🇰➡️🇨🇳

Lahore, Pakistan 参加日 Ocak 2019
437 フォロー中125 フォロワー
Suresh
Suresh@_Suresh2·
@shi_weiyan that 1.7x matters most on search pages where small filter changes throw it off
English
0
0
0
1
Weiyan Shi
Weiyan Shi@shi_weiyan·
Your agent just placed an Amazon order. So you send it to Walmart, grab a coffee – yet come back to find it stuck on the search page… 🤦‍♀️ Why'd it fail the same task on a similar site? - because it didn't learn reusable skills! - PolySkill changes that → 1.7× skill reuse
Simon Yu@simon_ycl

🇧🇷ICLR 2026 paper🇧🇷 Your agent's skills don't transfer. On a new site, only 18% skills get reused — so there's no continual learning, just relearning every time. How do agents learn skills that actually generalize? Introducing PolySkill to make agents smooth across sites 🧵

English
1
4
7
622
Suresh
Suresh@_Suresh2·
@sudoingX the single 3090 part is the wild bit. dense 27b still being this usable is rare now
English
0
0
0
25
Sudo su
Sudo su@sudoingX·
okay this is absolutely insane. my undisputed king qwen 3.5-27b dense on single RTX 3090 just got replaced by the same team today. qwen drops 3.6-27b dense just now and the chart says it beats its predecessor on every single benchmark, beats qwen 3.5-397b-a17b moe which is 15x larger, and matches claude 4.5 opus on terminal-bench 2.0 at 59.3 flat, while beating claude on skillsbench, gpqa diamond, mmmu, and realworldqa. a 27 billion parameter open weight model matching a frontier proprietary model on agentic coding. let that sit for a second. pulling weights right now. testing on my 3090 desktop first because that is where the crown lives, then 5090 mobile for the same 24gb class speed story. same quant, same hermes agent, head to head against 3.5-27b dense on same hardware. if this chart holds even half the gain in real agentic runs it's a gamechanger for every builder sitting on a single consumer card. thank you @alibaba_qwen, this is what open source looks like when a team is serious. the corporate salesmen telling you local ai is not ready yet are getting lapped every week by teams that actually ship. new 27b dense is here. open is winning. the best model for a single 24gb gpu just changed in the middle of my benchmark. data drops soon anon
Sudo su tweet media
Qwen@Alibaba_Qwen

🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…

English
13
10
143
4.2K
Suresh
Suresh@_Suresh2·
@PatrickMoorhead 2+ pb shared hbm is what i want numbers on. interconnect is where this usually gets ugly.
English
0
0
0
9
Patrick Moorhead
Patrick Moorhead@PatrickMoorhead·
Two new TPUs, one for training and one for inference. TPU 8t is the training box: 9,600 chips per superpod, 2+ PB of shared HBM, 121 exaflops, 2.8x the prior generation and 2x better perf/watt vs. prior gen, native FP4 in the MXUs, and Axion Arm hosts. With Pathways and JAX, a single logical training cluster now scales past one million TPUs. TPU 8i targets inference and reinforcement learning with up to 80% better perf/dollar for low-latency inference and RL vs. the prior TPU generation, SRAM tripled to 384MB, HBM up 50% to 288GB, and a new Collectives Acceleration Engine. The more interesting move is the network. Google’s Boardfly topology was co-designed with DeepMind to optimize for latency, not bandwidth. That is exactly the right bet for agents, where minimum time-to-response is the customer experience. Workload specialization is the hyperscaler playbook, and Google hinted more than two SKUs per year is plausible going forward. An underappreciated metric is goodput, not peak FLOPs. At 10,000-chip scale, fail-stop failures and silent data corruption quietly eat training throughput. Google claims more than 97% goodput at that scale. Google is also introducing NVIDIA VR200 with its Virgo network for the largest clusters. More later. $GOOG $AVGO $NVDA
Patrick Moorhead tweet media
English
4
22
111
11.2K
Suresh
Suresh@_Suresh2·
@ronoh4 the backend decoupling matters more than it sounds, dependency fights eat so much time
English
0
0
0
1
Philemon Kiprono 🇰🇪
🚀 DSPy 3.2.0 is out! 🔗 BetterTogether chains optimizers: GEPA → BootstrapFinetune → GEPA via strategy strings 🔌 LiteLLM decoupling begins — custom backends, no litellm dep 🛡️ Hardened RLM & PythonInterpreter — structured errors, resilient parsing Exciting times for #DSPy
English
1
2
9
973
Suresh
Suresh@_Suresh2·
@Muji___rushi shared scratchpad alone can make them converge fast.
English
0
0
0
5
Mujirushi
Mujirushi@Muji___rushi·
LLMを複数エージェントで議論させれば発想が広がるとは限らず、構造次第で思考の収束(多様性の崩壊:diversity collapse)が起きることを示した論文 エージェント間の相互作用が、個々のエージェントが持つ探索空間を不本意に収縮させる「構造的結合」が起因してるという主張 arxiv.org/pdf/2604.18005
Mujirushi tweet media
日本語
2
8
49
2.2K
Suresh
Suresh@_Suresh2·
@jonas package.json is probably the killer once the deps stop being tiny
English
0
0
0
49
Jonas Templestein
I wish cloudflare workers had a biiiiit more memory. The dynamic worker I'm hacking on a dynamic application platform where LLMs can write small full-stack apps with source code files and a package.json And then they just magically run when needed (as durable object facets) I use the v cool @cloudflare/worker-bundler to create the worker bundles. But unfortunately in practice lots of dependencies seem to cause the build process to exceed 128mb ram (the individual worker limit). E.g. merely importing anything from @cloudflare/agents causes the worker bundler to exceed its memory limit @dok2001 who can I petition to give us a biiiiit more memory?... I'd pay more than 2x for 2x the memory
English
9
2
38
5K
Suresh
Suresh@_Suresh2·
@helloiamleonie 149m is what stands out to me. much easier to actually deploy.
English
0
0
0
6
Leonie
Leonie@helloiamleonie·
You just have to appreciate what the LightOn team is doing for the IR community: • 2 open-source (Apache 2.0) light-weight (149M params) SOTA retriever models • open data pipeline (pre-training + fine-tuning) • decontaminated BEIR evaluation
Antoine Chaffin@antoine_chaffin

The new generation of open state-of-the-art single and multi-vector retrieval models is here It's time, DenseOn with the LateOn 🎶 @LightOnIO releases models that leap past existing ones, and everything you need to do the same!

English
3
2
15
706
Suresh
Suresh@_Suresh2·
@QingQ77 github issues as the trigger is neat. do the rotating reviews ever flag the same bug twice?
English
0
0
0
49
Geek Lite
Geek Lite@QingQ77·
Autoresearch - 全自动化软件开发工具 基于 GitHub Issue 驱动,用多个 AI Agent 轮转交叉审核实现全自动软件开发闭环。 github.com/smallnest/auto… Autoresearch 从 GitHub Issue 出发,让 Claude、Codex、OpenCode 这几个 Agent 轮转写代码、交叉审核、迭代修复,评分到了就自动提 PR、合并、关 Issue。 不限语言,Go、Python、Rust、Java 都行。可以在 .autoresearch/里自定义 Agent 指令和规则,断了还能 -c 接着跑。
Geek Lite tweet media
中文
3
25
134
7K
Suresh
Suresh@_Suresh2·
@testingcatalog projects and agents sound great till auth and memory start fighting each other
English
0
0
0
24
TestingCatalog News 🗞
TestingCatalog News 🗞@testingcatalog·
GOOGLE 🚨: GOOGLE LAUNCHES A NEW AGENT PLATFORM FOR GEMINI ENTERPRISE! Gemini Enterprise users will get access to Projects, Skills, the new Agent Builder, Agents Gallery, Slides editor inside Canvas, and tons of other new features. > Gemini Enterprise is an end-to-end system for the Agentic Era > Gemini Enterprise Agent Platform is our new developer platform and evolution of Vertex AI > Gemini Enterprise app lets teams discover, create, share, and run AI agents in a single, secure environment > An open partner ecosystem to discover and deploy a wide range of third-party agents from leaders like Oracle, Salesforce, and ServiceNow Agent Gallery👀
TestingCatalog News 🗞 tweet media
English
15
43
443
20.9K
Suresh
Suresh@_Suresh2·
@MillieMarconnni 25,000 runs is a lot, but how many held up under peer review?
English
0
0
0
12
Millie Marconi
Millie Marconi@MillieMarconnni·
🚨SHOCKING: Researchers ran 25,000 AI scientist experiments and discovered something that should end the hype immediately. AI scientists are producing results without doing science. A team from Friedrich Schiller University Jena and IIT Delhi just published the most comprehensive evaluation of AI research agents ever conducted. Three frontier models. Eight scientific domains. 25,000+ runs. The finding is devastating. In 68% of traces, the AI gathered evidence and then completely ignored it. In 71% of traces, the AI never updated its beliefs at all. Not once. Only 26% of the time did the AI revise a hypothesis when confronted with contradictory data. Multiple independent lines of evidence brought to bear on a single hypothesis, the most basic feature of rigorous scientific reasoning, occurred in just 7% of traces. This is not science. This is the performance of science. The AI generates a hypothesis. Runs some experiments. Collects results. Then proceeds as if the results were never there. The researchers call it "evidence non-uptake." You could also call it what it is: a system that cannot learn from what it finds. Here's what makes this worse. The reasoning failure doesn't change based on what the task demands. Molecular simulation, circuit inference, chemical structure identification, none of it matters. The AI applies the exact same reasoning pattern across every domain regardless of what the problem actually requires. A human scientist adapts. You approach a chemistry identification problem differently than you approach a simulation workflow. The AI doesn't. It runs the same undisciplined loop every time. The researchers also destroyed the most popular proposed fix: better scaffolding. Everyone building AI research agents has focused on engineering better prompting frameworks, better tool routing, better agent architectures. ReAct, structured tool-calling, chain-of-thought, all of it. The data shows scaffolding accounts for 1.5% of the variance in performance. The base model accounts for 41.4%. No amount of scaffold engineering can fix a model that doesn't know how to think scientifically. You are decorating the outside of a broken foundation. The paper's conclusion is the part that should concern every lab currently publishing AI scientist results. When AI produces a correct answer through a broken reasoning process, that answer is not scientifically justified. It happened to be right. That is not the same thing as being right for the right reasons. Science is self-correcting because of how it reasons, not just because of its outputs. AI scientists currently have the outputs without the process. Until the reasoning itself becomes a training target, every result produced by an AI scientist cannot be trusted the way a result produced by actual scientific inquiry can be trusted. 25,000 experiments to confirm what the data has been quietly showing for months. The AI is very good at looking like a scientist. It is not yet one.
Millie Marconi tweet media
English
9
12
31
2.1K
Suresh
Suresh@_Suresh2·
@michael_chomsky 50% faster is nice, but public repo search usually dies on indexing weirdness first
English
1
0
1
10
Michael
Michael@michael_chomsky·
I'm building something that makes searching public repos 88% cheaper and around 50% faster for your agents according to (very) early benchmarks. You can use it in Claude Code, Codex, Pi, what have you, and you use your own models/harness. Some teams like OpenCode clone Effect.ts locally so agents can use it more effectively. The goal is to make that less necessary. Looking for 10 people to beta test and give feedback!
Michael tweet media
English
6
0
15
1K
Suresh
Suresh@_Suresh2·
@j_dekoninck hitting the token cap can look like a regression. did pass@1 move much?
English
1
0
0
24
Jasper Dekoninck
Jasper Dekoninck@j_dekoninck·
Overall, Opus-4.7 is a slight regression on MathArena compared to Opus-4.6. The reason: the model frequently reaches its max token limits, and the parameter that allowed us to prevent this issue for Opus-4.6 has now been removed...
Jasper Dekoninck tweet media
English
3
3
44
2.3K
Suresh
Suresh@_Suresh2·
@heygurisingh 13 is tiny, but zero still means trust breaks fast on real code
English
0
0
0
43
Guri Singh
Guri Singh@heygurisingh·
Vibe coders are not going to like this. UC San Diego just published the first real field study of experienced developers using AI agents. They watched 13 of them code in the wild and surveyed 99 more. Zero of them vibe coded. Not one developer "fully gave in to the vibes." Not one trusted the agent to ship. The researchers found the opposite of what every Cursor demo on your timeline implies. Experienced devs plan before they prompt. They load the agent with heavy context. They verify every diff and refuse to merge code they haven't actually read. "Flow and joy" coding, the whole Karpathy vibe coding pitch, got quietly rejected by every professional in the study. They said it's fine for throwaway prototypes. Not for anything that ships. The devs still liked using agents. They just don't let the agent drive. Turns out the people who've shipped software for a decade know something the vibe coding influencers don't. Huang et al., UC San Diego. December 2025. Paper in comments.
Guri Singh tweet media
English
8
15
45
4.5K
Suresh
Suresh@_Suresh2·
"buys openai shares" is one of those headlines where i immediately want the line item. primary or secondary. employee liquidity or markup theater. i don't even have a take until that part is clear.
English
0
0
0
11
Suresh
Suresh@_Suresh2·
@f14bertolotti grokking is the fun test case, but does the metric shift before the phase change?
English
0
0
0
32
Francesco Bertolotti
Francesco Bertolotti@f14bertolotti·
In this work, the authors treat the training as a chaotic dynamical system, derive risk bounds wrt where weights can go if trained for a long time. Use the theory to define a generalization metric and test it on grokking. Dense but very interesting! 🔗arxiv.org/abs/2604.19740
Francesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet media
English
2
13
77
4.1K
Suresh
Suresh@_Suresh2·
@beirmug does lateon still hold #1 after beir decontam, or mainly on this split?
English
0
0
0
2
Suresh
Suresh@_Suresh2·
@vivek_2332 @Apple feasibility is manageable. checking whether the action matched the goal is the messy part.
English
0
0
0
8
Vivek
Vivek@vivek_2332·
been really interested in synthetic environments lately. diving deep into the research, starting with AutoPlay from @Apple. here are my notes: 1. Problem -> if you want to train UI agents at scale, you need data. lots of them. diverse, feasible, verifiable. human annotation doesn't scale and is expensive. -> for multimodal, prompting an LLM without showing it the actual app produces hallucinated tasks. it references entities that don't exist and features that work differently than assumed. so what do you do? 2. Solution -> explore first, then generate. -> Stage 1: send an Mutlimodal LLM explorer into the app with no specific goal. just click around, open menus, discover features, find what data exists. run multiple rounds with memory so it doesn't repeat itself. output: exploration trajectories showing what the app actually contains. -> Stage 2: feed those trajectories + task guideline prompts to a task generator. because the generator has seen real screenshots and real data, it produces grounded tasks. 3. Guidelines & Scale -> steer diversity without manual task writing: feature-use (CRUD), info retrieval, feature composition (multi-step) and subtask repetition. -> one exploration trajectory can produce many tasks across different guideline categories. -> 20k tasks across 20 android apps, 10k across 13 ubuntu apps. zero human annotation anywhere. 4. Training & Results -> SFT on verified trajectories + RL (GRPO) using an Mutlimodal LLM verifier as reward. verifier just sees screenshots and judges success/failure. no privileged environment access needed. -> autoplay-3B nearly matches qwen2.5-VL-72B base, and autoplay-72B beats the GPT-4o executor that collected its own training data. student surpasses teacher. -> RL with Mutlimodal LLM verifier adds +5.7% on top of SFT. the full pipeline runs end-to-end without humans in the loop. 5. Final Thoughts -> this is a solid implementation of synthetic env generation for agents. the environment is a real app, the "generation" is structured exploration + grounded task synthesis. the full loop (task gen, execution, verification, SFT/RL) runs without humans. -> the bigger question: can this generalize beyond UI to coding, browser, and tool-use agents?
Vivek tweet mediaVivek tweet mediaVivek tweet media
English
2
0
10
443
Suresh
Suresh@_Suresh2·
@VizuaraAI the memory cost is the real bottleneck once the context window gets long
English
0
0
0
15
Vizuara
Vizuara@VizuaraAI·
KV cache speeds up LLM inference. Dr. Sreedath Panat explains it stores past Keys and Values so models avoid recomputing context each step, reducing latency. Trade-off: higher memory use. Learn more: inference.vizuara.ai
Vizuara tweet media
English
1
3
36
687
Suresh
Suresh@_Suresh2·
@cellinlab keeping the sprite sheet clean is the hard part, models always invent filler tiles
English
0
0
1
123
Cell 细胞
Cell 细胞@cellinlab·
提示词: 试着分析 魂斗罗 所有场景,然后将用到的资源 都生成像素雪碧图 在一张大图,注意不要无关元素方便我定位裁剪
Cell 细胞 tweet media
中文
14
30
156
30.8K