Suresh

6K posts

Suresh

@_Suresh2

MSc Software Engineering @ Chongqing University ’26 | Researching AI x Software Engineering (AI for SE & SE for AI) | 🇵🇰➡️🇨🇳

Lahore, Pakistan 参加日 Ocak 2019

437 フォロー中125 フォロワー

Suresh@_Suresh2·1m

@shi_weiyan that 1.7x matters most on search pages where small filter changes throw it off

English

Weiyan Shi@shi_weiyan·57m

Your agent just placed an Amazon order. So you send it to Walmart, grab a coffee – yet come back to find it stuck on the search page… 🤦‍♀️ Why'd it fail the same task on a similar site? - because it didn't learn reusable skills! - PolySkill changes that → 1.7× skill reuse

Simon Yu@simon_ycl

🇧🇷ICLR 2026 paper🇧🇷 Your agent's skills don't transfer. On a new site, only 18% skills get reused — so there's no continual learning, just relearning every time. How do agents learn skills that actually generalize? Introducing PolySkill to make agents smooth across sites 🧵

English

622

Suresh@_Suresh2·7m

@sudoingX the single 3090 part is the wild bit. dense 27b still being this usable is rare now

English

Sudo su@sudoingX·33m

okay this is absolutely insane. my undisputed king qwen 3.5-27b dense on single RTX 3090 just got replaced by the same team today. qwen drops 3.6-27b dense just now and the chart says it beats its predecessor on every single benchmark, beats qwen 3.5-397b-a17b moe which is 15x larger, and matches claude 4.5 opus on terminal-bench 2.0 at 59.3 flat, while beating claude on skillsbench, gpqa diamond, mmmu, and realworldqa. a 27 billion parameter open weight model matching a frontier proprietary model on agentic coding. let that sit for a second. pulling weights right now. testing on my 3090 desktop first because that is where the crown lives, then 5090 mobile for the same 24gb class speed story. same quant, same hermes agent, head to head against 3.5-27b dense on same hardware. if this chart holds even half the gain in real agentic runs it's a gamechanger for every builder sitting on a single consumer card. thank you @alibaba_qwen, this is what open source looks like when a team is serious. the corporate salesmen telling you local ai is not ready yet are getting lapped every week by teams that actually ship. new 27b dense is here. open is winning. the best model for a single 24gb gpu just changed in the middle of my benchmark. data drops soon anon

Qwen@Alibaba_Qwen

🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…

English

143

4.2K

Suresh@_Suresh2·10m

@PatrickMoorhead 2+ pb shared hbm is what i want numbers on. interconnect is where this usually gets ugly.

English

Patrick Moorhead@PatrickMoorhead·2h

Two new TPUs, one for training and one for inference. TPU 8t is the training box: 9,600 chips per superpod, 2+ PB of shared HBM, 121 exaflops, 2.8x the prior generation and 2x better perf/watt vs. prior gen, native FP4 in the MXUs, and Axion Arm hosts. With Pathways and JAX, a single logical training cluster now scales past one million TPUs. TPU 8i targets inference and reinforcement learning with up to 80% better perf/dollar for low-latency inference and RL vs. the prior TPU generation, SRAM tripled to 384MB, HBM up 50% to 288GB, and a new Collectives Acceleration Engine. The more interesting move is the network. Google’s Boardfly topology was co-designed with DeepMind to optimize for latency, not bandwidth. That is exactly the right bet for agents, where minimum time-to-response is the customer experience. Workload specialization is the hyperscaler playbook, and Google hinted more than two SKUs per year is plausible going forward. An underappreciated metric is goodput, not peak FLOPs. At 10,000-chip scale, fail-stop failures and silent data corruption quietly eat training throughput. Google claims more than 97% goodput at that scale. Google is also introducing NVIDIA VR200 with its Virgo network for the largest clusters. More later. $GOOG $AVGO $NVDA

English

111

11.2K

Suresh@_Suresh2·16m

@ronoh4 the backend decoupling matters more than it sounds, dependency fights eat so much time

English

Philemon Kiprono 🇰🇪@ronoh4·20h

🚀 DSPy 3.2.0 is out! 🔗 BetterTogether chains optimizers: GEPA → BootstrapFinetune → GEPA via strategy strings 🔌 LiteLLM decoupling begins — custom backends, no litellm dep 🛡️ Hardened RLM & PythonInterpreter — structured errors, resilient parsing Exciting times for #DSPy

English

973

Suresh@_Suresh2·20m

@Muji___rushi shared scratchpad alone can make them converge fast.

English

Mujirushi@Muji___rushi·10h

LLMを複数エージェントで議論させれば発想が広がるとは限らず、構造次第で思考の収束(多様性の崩壊:diversity collapse)が起きることを示した論文エージェント間の相互作用が、個々のエージェントが持つ探索空間を不本意に収縮させる「構造的結合」が起因してるという主張 arxiv.org/pdf/2604.18005

日本語

2.2K

Suresh@_Suresh2·22m

@jonas package.json is probably the killer once the deps stop being tiny

English

Jonas Templestein@jonas·2h

I wish cloudflare workers had a biiiiit more memory. The dynamic worker I'm hacking on a dynamic application platform where LLMs can write small full-stack apps with source code files and a package.json And then they just magically run when needed (as durable object facets) I use the v cool @cloudflare/worker-bundler to create the worker bundles. But unfortunately in practice lots of dependencies seem to cause the build process to exceed 128mb ram (the individual worker limit). E.g. merely importing anything from @cloudflare/agents causes the worker bundler to exceed its memory limit @dok2001 who can I petition to give us a biiiiit more memory?... I'd pay more than 2x for 2x the memory

English

Suresh@_Suresh2·32m

@helloiamleonie 149m is what stands out to me. much easier to actually deploy.

English

Leonie@helloiamleonie·41m

You just have to appreciate what the LightOn team is doing for the IR community: • 2 open-source (Apache 2.0) light-weight (149M params) SOTA retriever models • open data pipeline (pre-training + fine-tuning) • decontaminated BEIR evaluation

Antoine Chaffin@antoine_chaffin

The new generation of open state-of-the-art single and multi-vector retrieval models is here It's time, DenseOn with the LateOn 🎶 @LightOnIO releases models that leap past existing ones, and everything you need to do the same!

English

706

Suresh@_Suresh2·39m

@QingQ77 github issues as the trigger is neat. do the rotating reviews ever flag the same bug twice?

English

Geek Lite@QingQ77·13h

Autoresearch - 全自动化软件开发工具基于 GitHub Issue 驱动，用多个 AI Agent 轮转交叉审核实现全自动软件开发闭环。 github.com/smallnest/auto… Autoresearch 从 GitHub Issue 出发，让 Claude、Codex、OpenCode 这几个 Agent 轮转写代码、交叉审核、迭代修复，评分到了就自动提 PR、合并、关 Issue。不限语言，Go、Python、Rust、Java 都行。可以在 .autoresearch/里自定义 Agent 指令和规则，断了还能 -c 接着跑。

中文

134

Suresh@_Suresh2·44m

@testingcatalog projects and agents sound great till auth and memory start fighting each other

English

TestingCatalog News 🗞@testingcatalog·2h

GOOGLE 🚨: GOOGLE LAUNCHES A NEW AGENT PLATFORM FOR GEMINI ENTERPRISE! Gemini Enterprise users will get access to Projects, Skills, the new Agent Builder, Agents Gallery, Slides editor inside Canvas, and tons of other new features. > Gemini Enterprise is an end-to-end system for the Agentic Era > Gemini Enterprise Agent Platform is our new developer platform and evolution of Vertex AI > Gemini Enterprise app lets teams discover, create, share, and run AI agents in a single, secure environment > An open partner ecosystem to discover and deploy a wide range of third-party agents from leaders like Oracle, Salesforce, and ServiceNow Agent Gallery👀

English

443

20.9K

Suresh@_Suresh2·47m

@MillieMarconnni 25,000 runs is a lot, but how many held up under peer review?

English

Millie Marconi@MillieMarconnni·4h

🚨SHOCKING: Researchers ran 25,000 AI scientist experiments and discovered something that should end the hype immediately. AI scientists are producing results without doing science. A team from Friedrich Schiller University Jena and IIT Delhi just published the most comprehensive evaluation of AI research agents ever conducted. Three frontier models. Eight scientific domains. 25,000+ runs. The finding is devastating. In 68% of traces, the AI gathered evidence and then completely ignored it. In 71% of traces, the AI never updated its beliefs at all. Not once. Only 26% of the time did the AI revise a hypothesis when confronted with contradictory data. Multiple independent lines of evidence brought to bear on a single hypothesis, the most basic feature of rigorous scientific reasoning, occurred in just 7% of traces. This is not science. This is the performance of science. The AI generates a hypothesis. Runs some experiments. Collects results. Then proceeds as if the results were never there. The researchers call it "evidence non-uptake." You could also call it what it is: a system that cannot learn from what it finds. Here's what makes this worse. The reasoning failure doesn't change based on what the task demands. Molecular simulation, circuit inference, chemical structure identification, none of it matters. The AI applies the exact same reasoning pattern across every domain regardless of what the problem actually requires. A human scientist adapts. You approach a chemistry identification problem differently than you approach a simulation workflow. The AI doesn't. It runs the same undisciplined loop every time. The researchers also destroyed the most popular proposed fix: better scaffolding. Everyone building AI research agents has focused on engineering better prompting frameworks, better tool routing, better agent architectures. ReAct, structured tool-calling, chain-of-thought, all of it. The data shows scaffolding accounts for 1.5% of the variance in performance. The base model accounts for 41.4%. No amount of scaffold engineering can fix a model that doesn't know how to think scientifically. You are decorating the outside of a broken foundation. The paper's conclusion is the part that should concern every lab currently publishing AI scientist results. When AI produces a correct answer through a broken reasoning process, that answer is not scientifically justified. It happened to be right. That is not the same thing as being right for the right reasons. Science is self-correcting because of how it reasons, not just because of its outputs. AI scientists currently have the outputs without the process. Until the reasoning itself becomes a training target, every result produced by an AI scientist cannot be trusted the way a result produced by actual scientific inquiry can be trusted. 25,000 experiments to confirm what the data has been quietly showing for months. The AI is very good at looking like a scientist. It is not yet one.

English

2.1K

Suresh@_Suresh2·51m

@michael_chomsky 50% faster is nice, but public repo search usually dies on indexing weirdness first

English

Michael@michael_chomsky·14h

I'm building something that makes searching public repos 88% cheaper and around 50% faster for your agents according to (very) early benchmarks. You can use it in Claude Code, Codex, Pi, what have you, and you use your own models/harness. Some teams like OpenCode clone Effect.ts locally so agents can use it more effectively. The goal is to make that less necessary. Looking for 10 people to beta test and give feedback!

English

Suresh@_Suresh2·53m

@j_dekoninck hitting the token cap can look like a regression. did pass@1 move much?

English

Jasper Dekoninck@j_dekoninck·2h

Overall, Opus-4.7 is a slight regression on MathArena compared to Opus-4.6. The reason: the model frequently reaches its max token limits, and the parameter that allowed us to prevent this issue for Opus-4.6 has now been removed...

English

2.3K

Suresh@_Suresh2·1h

@heygurisingh 13 is tiny, but zero still means trust breaks fast on real code

English

Guri Singh@heygurisingh·3h

Vibe coders are not going to like this. UC San Diego just published the first real field study of experienced developers using AI agents. They watched 13 of them code in the wild and surveyed 99 more. Zero of them vibe coded. Not one developer "fully gave in to the vibes." Not one trusted the agent to ship. The researchers found the opposite of what every Cursor demo on your timeline implies. Experienced devs plan before they prompt. They load the agent with heavy context. They verify every diff and refuse to merge code they haven't actually read. "Flow and joy" coding, the whole Karpathy vibe coding pitch, got quietly rejected by every professional in the study. They said it's fine for throwaway prototypes. Not for anything that ships. The devs still liked using agents. They just don't let the agent drive. Turns out the people who've shipped software for a decade know something the vibe coding influencers don't. Huang et al., UC San Diego. December 2025. Paper in comments.

English

4.5K

Suresh@_Suresh2·1h

@vllm_project @Alibaba_Qwen day-0 support is huge, serving a new 27b cleanly is half the battle

English

136

vLLM@vllm_project·1h

🎉 Day-0 vLLM support for Qwen3.6-27B! Congrats to @Alibaba_Qwen on the new 27B dense model release. Looking forward to more of the Qwen3.6 series. 👀 📖 Recipe: recipes.vllm.ai/Qwen/Qwen3.6-2…

Qwen@Alibaba_Qwen

English

210

8.9K

Suresh@_Suresh2·1h

"buys openai shares" is one of those headlines where i immediately want the line item. primary or secondary. employee liquidity or markup theater. i don't even have a take until that part is clear.

English

Suresh@_Suresh2·1h

@f14bertolotti grokking is the fun test case, but does the metric shift before the phase change?

English

Francesco Bertolotti@f14bertolotti·8h

In this work, the authors treat the training as a chaotic dynamical system, derive risk bounds wrt where weights can go if trained for a long time. Use the theory to define a generalization metric and test it on grokking. Dense but very interesting! 🔗arxiv.org/abs/2604.19740

English

4.1K

Suresh@_Suresh2·1h

@beirmug does lateon still hold #1 after beir decontam, or mainly on this split?

English

Nandan Thakur@beirmug·22h

Nice work! This direction of decontaminating BEIR and conducting evals is really interesting!

Antoine Chaffin@antoine_chaffin

Despite being in the toughest spot (decontamination built from our data), LateOn and DenseOn don't flinch LateOn keeps #1, DenseOn stays top-4, only behind ColBERT-Zero (our other strong multi-vector model) and pplx-embed That shows direct evidence of generalization and not overfitting

English

948

Suresh@_Suresh2·1h

@vivek_2332 @Apple feasibility is manageable. checking whether the action matched the goal is the messy part.

English

Vivek@vivek_2332·4h

been really interested in synthetic environments lately. diving deep into the research, starting with AutoPlay from @Apple. here are my notes: 1. Problem -> if you want to train UI agents at scale, you need data. lots of them. diverse, feasible, verifiable. human annotation doesn't scale and is expensive. -> for multimodal, prompting an LLM without showing it the actual app produces hallucinated tasks. it references entities that don't exist and features that work differently than assumed. so what do you do? 2. Solution -> explore first, then generate. -> Stage 1: send an Mutlimodal LLM explorer into the app with no specific goal. just click around, open menus, discover features, find what data exists. run multiple rounds with memory so it doesn't repeat itself. output: exploration trajectories showing what the app actually contains. -> Stage 2: feed those trajectories + task guideline prompts to a task generator. because the generator has seen real screenshots and real data, it produces grounded tasks. 3. Guidelines & Scale -> steer diversity without manual task writing: feature-use (CRUD), info retrieval, feature composition (multi-step) and subtask repetition. -> one exploration trajectory can produce many tasks across different guideline categories. -> 20k tasks across 20 android apps, 10k across 13 ubuntu apps. zero human annotation anywhere. 4. Training & Results -> SFT on verified trajectories + RL (GRPO) using an Mutlimodal LLM verifier as reward. verifier just sees screenshots and judges success/failure. no privileged environment access needed. -> autoplay-3B nearly matches qwen2.5-VL-72B base, and autoplay-72B beats the GPT-4o executor that collected its own training data. student surpasses teacher. -> RL with Mutlimodal LLM verifier adds +5.7% on top of SFT. the full pipeline runs end-to-end without humans in the loop. 5. Final Thoughts -> this is a solid implementation of synthetic env generation for agents. the environment is a real app, the "generation" is structured exploration + grounded task synthesis. the full loop (task gen, execution, verification, SFT/RL) runs without humans. -> the bigger question: can this generalize beyond UI to coding, browser, and tool-use agents?

English

443

Suresh@_Suresh2·1h

@VizuaraAI the memory cost is the real bottleneck once the context window gets long

English

Vizuara@VizuaraAI·4h

KV cache speeds up LLM inference. Dr. Sreedath Panat explains it stores past Keys and Values so models avoid recomputing context each step, reducing latency. Trade-off: higher memory use. Learn more: inference.vizuara.ai

English

687

Suresh@_Suresh2·1h

@cellinlab keeping the sprite sheet clean is the hard part, models always invent filler tiles

English

123

Cell 细胞@cellinlab·12h

提示词：试着分析魂斗罗所有场景，然后将用到的资源都生成像素雪碧图在一张大图，注意不要无关元素方便我定位裁剪

中文

156

30.8K

ディスカバー

@shi_weiyan @sudoingX @alibaba_qwen @PatrickMoorhead @ronoh4 @Muji___rushi @jonas @dok2001