Piji Li

6.5K posts

Piji Li

Piji Li

@pijili

Natural Language Processing.

Shenzhen Katılım Kasım 2009
2.7K Takip Edilen989 Takipçiler
Piji Li retweetledi
DAIR.AI
DAIR.AI@dair_ai·
The Top AI Papers of the Week (April 6 - 12) - Memento - Neural Computers - The Universal Verifier - Agent Skills in the Wild - Memory Intelligence Agent (MIA) - Single-Agent vs Multi-Agent LLMs - Scaling Coding Agents via Atomic Skills Read on for more:
DAIR.AI@dair_ai

x.com/i/article/2042…

English
17
51
375
55.2K
Piji Li retweetledi
𒐪
𒐪@SHL0MS·
introducing Autoreason, a reasoning method inspired by @karpathy's AutoResearch which extends the strategy for subjective domains the paper was co-written with Hermes Agent by @NousResearch, using a research-paper-writing skill developed while writing it paper + results below
𒐪 tweet media
English
36
136
1.3K
268.5K
Piji Li retweetledi
Nous Research
Nous Research@NousResearch·
给中国用户的好消息:Hermes Agent 现在原生支持个人微信了 微信扫码即可连接,私聊群聊都支持。图片、视频、文件、语音消息全覆盖,长轮询直连,不需要公网 IP。 运行 'hermes update' 即可体验 文档:hermes-agent.nousresearch.com/docs/user-guid… 感谢 @Bravohenry_ 的贡献
Nous Research tweet media
中文
324
559
3.3K
589.6K
Piji Li retweetledi
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
MiniMax M2.7 is open-source! The most interesting part of its blog is the “model’s self-evolution.” It’s essentially Karpathy’s Autoresearch. They did two things: 1. They built a research agent to collaborate with their RL team: “A researcher starts by discussing an experimental idea with the agent, which helps with literature review, tracks a pre-set experiment spec, pipelines data and other artifacts, and launches experiments. During the experiments, the agent monitors and profiles progress and automatically triggers log reading, debugging, metric analysis, code fixes, merge requests, and smoke tests, identifying and configuring subtle yet key changes.” “M2.7 is capable of handling 30%–50% of the workflow.” 2. The model rewrites its own harness: MiniMax had M2.7 improve coding performance on an internal scaffold, running a fully automated loop for 100+ rounds. Internal evals improved by 30%. I'm pretty sure every AI lab is doing Autoresearch in some way internally.
Yuchen Jin tweet media
MiniMax (official)@MiniMax_AI

We're delighted to announce that MiniMax M2.7 is now officially open source. With SOTA performance in SWE-Pro (56.22%) and Terminal Bench 2 (57.0%). You can find it on Hugging Face now. Enjoy!🤗 huggingface:huggingface.co/MiniMaxAI/Mini… Blog: minimax.io/news/minimax-m… MiniMax API: platform.minimax.io

English
34
88
753
84.7K
Piji Li retweetledi
Gary Marcus
Gary Marcus@GaryMarcus·
Claude Code is not AGI, but it is the single biggest advance in AI since the LLM. But the thing is, Claude Code is NOT a pure LLM. And it’s not pure deep learning. Not even close. And that changes everything. The source code leak proves it. Tucked away at its center is a 3,167 line kernel called print.ts. print.ts is a pattern matching. And pattern matching is supposed to be the *strength* of LLMs. But Anthropic figured out that if you really need to get your patterns right, you can’t trust a pure LLM. They are too probabilistic. And too erratic. Instead, the way Anthropic built that kernel is straight out of classical symbolic AI. For example, it is in large part a big IF-THEN conditional, with 486 branch points and 12 levels of nesting — all inside a deterministic, symbolic loop that the real godfathers of AI, people like John McCarthy and Marvin Minsky and Herb Simon, would have instantly recognized.* Putting things differently, Anthropic, when push came to shove, went exactly where I long said the field needed to go (and where @geoffreyhinton said we didn’t need to go): to Neurosymbolic AI. That’s right, the biggest advance since the LLM was neurosymbolic. AlphaFold, AlphaEvolve, AlphaProof, and AlphaGeometry are all neurosymbolic, too; so is Code Interpreter; when you are calling code, you are asking symbolic AI do an important part of the work. Claude Code isn’t better because of scaling. It’s better because Anthropic accepted the importance of using classical AI techniques alongside neural networks — precisely marriage I have long advocated. It’s *massive* vindication for me (go see my 2019 debate with Bengio for context, or to my 2001 book, The Algebraic Mind), but it still ain’t perfect, or even close. What we really need to do to get trustworthy AI rather than the current unpredictable “jagged” mess, is to go in the knowledge-, reasoning-, and world-model driven direction I laid out in 2020, in an article called the Next Decade in AI, in which neurosymbolic AI is just the *starting point* in a longer journey.* Read that article if you want to know what else we need to do next. The first part has already come to pass. In time, other three will, too. Meanwhile, the implications for the allocation of capital are pretty massive: smartly adding in bits of symbolic AI can do a lot more than scaling alone, and even Anthropic as now discovered (though they won’t say) scaling is no longer the essence of innovation. The paradigm has changed. — *Claude Code is plainly neurosymbolic but the code part is a mess; as Ernie Davis and I argued in Rebooting AI in 2019, we also need major advances in software engineering. But that’s a story for another day.
English
178
525
2.9K
573.9K
Piji Li retweetledi
Yuandong Tian
Yuandong Tian@tydsh·
Our work on post-training models for parallel thinking (ThreadWeaver) is now open sourced! Our Data Gen/SFT/RL recipes are now fully open😀. The idea is to1️⃣rewrite the sequential thinking traces to be parallel with LLMs,2️⃣design efficient kernels for training/inference and3️⃣smartly design the reward signal for RL. Thanks @LongTonyLian and @VictoriaLinML for the great work!
Long Lian@LongTonyLian

Our parallel reasoning project ThreadWeaver is now open-sourced 🎉! Check out our Data Gen/SFT/RL recipe at github.com/facebookresear… In case you don't know, ThreadWeaver 🧵⚡️ is the first parallel reasoning method to achieve comparable reasoning performance to widely-used sequential long-CoT LLMs, with up to 3x speedup across 6 challenging tasks.

English
3
24
239
30.4K
Piji Li retweetledi
Ai2
Ai2@allen_ai·
You can now train, adapt, and eval web agents on your own tasks. We're releasing the full MolmoWeb codebase—the training code, eval harness, annotation tooling, synthetic data pipeline, & client-side code for our demo. 🧵
Ai2 tweet media
English
3
42
228
25.9K
Piji Li retweetledi
Dawn Song
Dawn Song@dawnsongtweets·
x.com/MogicianTony/s… 🧵 1/ Our agent Terminator-1 scored ~100% on 8 major AI agent benchmarks, e.g., SWE-bench Verified & Pro, Terminal-Bench, beating Claude Mythos. It solved 0 tasks. Benchmarks are the field's shared language for measuring AI progress. Our new work shows that language is broken. Here’s how.
Dawn Song tweet media
Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English
19
49
326
81.1K
Piji Li retweetledi
elvis
elvis@omarsar0·
NEW paper from Meta. (bookmark this one) What if the model wasn't just using the computer, but became the computer? New research from Meta AI and KAUST makes a serious case for Neural Computers (NCs). The paper proposes NCs as learned runtimes where computation, memory, and I/O live inside a single latent state. Their first prototypes use video models to roll out terminal and GUI interfaces from prompts, pixels, and user actions. Why does it matter? Today's agents still depend on external computers to store state, execute actions, and enforce system contracts. Neural Computers point to a different machine form: one where interface dynamics, working memory, and execution are learned together. The early results are promising but grounded. CLI rendering improves, GUI cursor control reaches 98.7% with explicit visual supervision, and reprompting boosts arithmetic-probe accuracy from 4% to 83%. But symbolic reliability, stable reuse, and runtime governance remain open. This is less "agents got better" and more "what comes after agents as a computing substrate?" Paper: arxiv.org/abs/2604.06425 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
16
91
497
57.9K
Piji Li retweetledi
Dan McAteer
Dan McAteer@daniel_mac8·
babe, wake up. a new form of continual learning just dropped. > in-place test-time training.
Dan McAteer tweet media
English
13
100
748
88.1K
Piji Li retweetledi
Akshay 🚀
Akshay 🚀@akshay_pachaar·
this is one of the most important ideas in AI right now, and it just got two independent validations. yesterday, Anthropic shipped an "advisor tool" in the Claude API that lets Sonnet or Haiku consult Opus mid-task, only when the executor needs help. the benefit is straightforward: you get near Opus-level intelligence on the hard decisions while paying Sonnet or Haiku rates for everything else. frontier reasoning only kicks in when it's actually needed, not on every token. back in February, UC Berkeley published a paper called "Advisor Models" that trains a small 7B model with RL to generate per-instance advice for a frozen black-box model. same idea. two very different implementations. the paper's approach: take Qwen2.5 7B, train it with GRPO to generate natural language advice, and inject that advice into the prompt of a black-box model. the black-box model never changes. the advisor learns what to say to make it perform better. GPT-5 scores 31.2% on a tax-filing benchmark. add the trained advisor, it jumps to 53.6%. on SWE agent tasks, a trained advisor cuts Gemini 3 Pro's steps from 31.7 to 26.3 while keeping the same resolve rate. training is cheap too. you train with GPT-4o Mini, then swap in GPT-5 at inference. the advisor even transfers across families: a GPT-trained advisor improves Claude 4.5 Sonnet. Anthropic's advisor tool takes a different path to the same idea. Sonnet runs as executor, handles tools and iteration. when it hits something it can't resolve, it consults Opus, gets a plan or correction, and continues. Sonnet with Opus as advisor gained 2.7 points on SWE-bench Multilingual over Sonnet alone, while costing 11.9% less per task. Haiku with Opus scored 41.2% on BrowseComp, more than double its solo 19.7%. it's a one-line API change. advisor tokens bill at Opus rates, and the advisor typically generates only 400-700 tokens per call. blended cost stays well below running Opus end-to-end. both approaches point at the same thing: you don't need the most powerful model on every token. you need it at the right moments, for the right inputs. Paper: arxiv.org/abs/2510.02453 Code: github.com/az1326/advisor…
Akshay 🚀 tweet media
Claude@claudeai

We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an executor, and get near Opus-level intelligence in your agents at a fraction of the cost.

English
23
83
418
42.7K
Piji Li retweetledi
elvis
elvis@omarsar0·
NEW paper from Microsoft Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded? Microsoft researchers introduce the Universal Verifier, which discusses lessons learned from building best-in-class verifiers for web tasks. It's built on four principles: non-overlapping rubrics, separate process vs. outcome rewards, distinguishing controllable from uncontrollable failures, and divide-and-conquer context management across full screenshot trajectories. It reduces false positive rates to near zero, down from 45%+ (WebVoyager) and 22%+ (WebJudge). Without reliable verifiers, both benchmarks and training data are corrupted. One interesting finding is that an auto-research agent reached 70% of expert verifier quality in 5% of the time, but couldn't discover the structural design decisions that drove the biggest gains. Human expertise and automated optimization play complementary roles. Paper: arxiv.org/abs/2604.06240 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
13
68
336
36.4K
Piji Li retweetledi
Zhuoran Yang
Zhuoran Yang@zhuoran_yang·
When I teach, I prepare my lecture notes by writing on Goodnotes, then export them as PDFs and share to the class. Recently I asked claude code to read these PDF files and convert them into markdown. This is a test I constantly gave to new LLMs and now claude is good enough to pass it. I converted my RL theory graduate-level course into markdowns and post the lecture notes here: zhuoranyang.github.io/sds685-notes/ When preparing the course, I learned a lot from the wonderful courses offered by @CsabaSzepesvari @chijinML @WenSun1 @nanjiang_cs and borrowed some good stuffs from their notes. Also I discussed what to teach in a RL course heavily with @zhaoran_wang and stole many of his insights.
Zhuoran Yang tweet media
English
5
60
480
74K
Piji Li retweetledi
Robert Youssef
Robert Youssef@rryssf_·
🚨BREAKING: OpenAI Deep Research takes 4x longer than InfoSeeker and still loses on accuracy. Gemini Deep Research takes 3x longer and still loses. Huawei Noah's Ark Lab just published the architecture that beats both parallel workers, isolated contexts, no cascading errors. Every AI research agent running today has the same structural problem. It reads web pages one at a time. It passes everything through a single context window that fills up. When an early step produces a bad result, every subsequent step inherits that error. The more sources a task requires, the worse sequential agents perform because compounding failures accumulate faster than useful information. Huawei Noah's Ark Lab, University of Liverpool, and UCL built InfoSeeker to fix all three failure modes simultaneously. The architecture has three layers: → A strategic Host maintains a compressed global plan it never sees raw search results or tool outputs. Only concise summaries of what each step accomplished. → Domain-specific Managers decompose high-level directives into parallel subtasks, dispatch them to Workers simultaneously, validate results, and return a single clean summary upward. → Workers execute atomic tool interactions web searches, browser navigation, file operations, code execution in parallel, keeping full execution traces locally and never polluting the layers above. Each layer sees only what it needs. Errors stay contained. Context never saturates. The parallelism result should alarm every team building sequential agents: → 1 worker: 911 seconds per task → 17 workers: 162 seconds per task → 5.7x speedup from architecture alone no better model, no bigger context window OpenAI Deep Research requires 3.3x more time on WideSearch. Still loses. Gemini Deep Research requires 2.6x more time on WideSearch. Still loses. On BrowseComp-zh multi-hop reasoning across Chinese web pages: → OpenAI Deep Research: 3.9x slower than InfoSeeker → Gemini Deep Research: 4.6x slower than InfoSeeker The numbers across both benchmarks: → WideSearch success rate: InfoSeeker 8.38% vs best competing system 5.10% 64% improvement → WideSearch Item F1: InfoSeeker 70.27% vs Claude Sonnet 4 Thinking 62.20% 13% better factual accuracy → WideSearch Row F1: InfoSeeker 50.13% vs Claude Sonnet 4 Thinking 38.50% 30% better structural coherence → BrowseComp-zh: InfoSeeker 52.9% vs OpenAI Deep Research 42.9% 10 points ahead of best commercial system → BrowseComp-zh vs best open-source agent: InfoSeeker 52.9% vs BrowseMaster 46.5% → Average cost per task: $2.00 on WideSearch, $1.00 on BrowseComp-zh The single-agent ablation is the most important result in the paper. GPT-5.1 running alone as a single agent with identical tools: 6% success rate, 35.74% Item F1. InfoSeeker with the same backbone: 12.5% success rate, 75.21% Item F1. More than 80% of InfoSeeker's token cost runs on cheap GPT-5-mini Workers. The stronger model running alone loses to the cheaper model running in a well-designed hierarchy.The architecture is doing the work. The context isolation mechanism is what makes the hierarchy actually function. Workers retain full execution traces locally. Managers aggregate results into a single clean summary before passing anything upward. The Host never sees a single raw web page only the sequence of step-response pairs across the full run. A task requiring hundreds of web pages never saturates the Host's context. Because the Host never touches raw data. Only what Managers chose to surface. > OpenAI and Google built deeper reasoning. > Huawei built wider parallelism. > The benchmark says wider wins.
Robert Youssef tweet media
English
6
16
80
5.3K
Piji Li retweetledi
Sean Welleck
Sean Welleck@wellecks·
New paper: "Gym-Anything: Turn any Software into an Agent Environment" First there were coding agents, soon there will be anything agents: our framework turns any software on a computer into an agent environment for training or testing. We also release a challenging new benchmark, CUA-World-Long. Everything is open source! cmu-l3.github.io/gym-anything/
Sean Welleck tweet media
Pranjal Aggarwal@PranjalAggarw16

What if computer-use agents could do real work? We built Gym-Anything: a framework that turns any software into a computer-use agent environment. We used it to create CUA-World: 200+ real software, 10,000+ tasks and environments, across all major occupation groups, from medical imaging to financial trading. 🧵

English
15
80
735
77.2K
Piji Li retweetledi
Tianle Cai
Tianle Cai@tianle_cai·
Can we turn part of an LLM's weights into long-term memory that continuously absorbs new knowledge? We took a small step toward this with In-Place Test-Time Training (In-Place TTT) — accepted as an Oral at ICLR 2026 🎉 The key idea: no new modules, optional pretraining. We repurpose the final projection matrix in every MLP block as fast weights. With an NTP-aligned objective and efficient chunk-wise updates, the model adapts on the fly — complementing attention rather than replacing it. 📄 Paper: arxiv.org/abs/2604.06169 with amazing @Guhao_Feng @Roger98079446 Kai @GeZhang86038849 Di @HuangRubio
English
24
144
1K
73.7K
Piji Li retweetledi
Rosinality
Rosinality@rosinality·
The limitation of on-policy self-distillation. It is not possible to eliminate the gap due to the availability of privileged information, which is also determined by the teacher's behavior. But maybe self-distillation could still be useful for credit assignment.
Rosinality tweet media
English
4
35
252
39.1K