GuoxinChen (@GuoxinChen22) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

GuoxinChen@GuoxinChen22·1d

Thanks for sharing our work! Context is the key in long-horizon tasks, such as deep search. But we found that the project/workspace state continuity is more matter than that of context in long-horizon tasks such as AutoResearch. That’s the reason we introduce File-as-Bus.🚌

elvis@omarsar0

Long-horizon AI research agents are mostly a state-management problem. It is not enough for an agent to reason well in the next turn. ML research requires task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This new paper introduces AiScientist, a system for autonomous long-horizon engineering for ML research. The key idea is to keep control thin and state thick. A top-level orchestrator manages stage-level progress, while specialized agents repeatedly ground themselves in durable workspace artifacts: analyses, plans, code, logs, and experimental evidence. That "File-as-Bus" design matters. AiScientist improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points. Why does it matter? Autonomous research agents need durable project memory, not just longer chats. Paper: arxiv.org/abs/2604.13018 Learn to build effective AI agents in our academy: academy.dair.ai

English

0

6

185

GuoxinChen retweetledi

sarah guo@saranormous·1d

“These results suggest long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem”

English

10

9

101

8.1K

GuoxinChen retweetledi

Viv@Vtrivedy10·8 Nis

x.com/i/article/2041…

ZXX

26

95

877

429.3K

GuoxinChen@GuoxinChen22·1d

@PraCha98 @daniel_mac8 It's structured, not one monolithic chain. AiScientist is lab-like: a PI-like orchestrator handles stage-level planning, specialists own major subproblems, and subagents handle leaf tasks. Each has its own reasoning process, and they coordinate through shared workspace artifacts.

English

1

0

1

28

Pra Cha@PraCha98·1d

@daniel_mac8 This looks so scattered, and there is no structured thought process, like trying out permutations and combinations. - would scientists do this way?

English

2

0

1

291

GuoxinChen retweetledi

Dan McAteer@daniel_mac8·1d

Autonomous, recursively self-improving AI researchers are here. They're just not evenly disributed.

English

7

31

209

10.3K

GuoxinChen retweetledi

Based Medical@BasedMedical·1d

This paper found artifact-mediated coordination is ~4x more economically efficient than chat coordination. This is huge for tasks like autoresearch that benefits from persistent agent-native workspaces. Huge props to @GuoxinChen22 and the rest of the authors. I've also been developing the same process of shared cognitive workspaces at @nookplot , so agents can have artifact-focused coordination on a decentralized network, to enable latent forms of communication, work more effectively together, and have a more informative way to measure contributions from agents, allowing for trust to be quantified and established between strangers.

English

1

2

16

1K

GuoxinChen@GuoxinChen22·1d

@omarsar0 Thanks for sharing our work! Context is the key in long-horizon tasks, such as deep search. But we found that the project/workspace state continuity is more matter than that of context in long-horizon tasks such as AI research. That’s why we introduce File-as-Bus.🚌

English

2

0

2

212

elvis@omarsar0·2d

Long-horizon AI research agents are mostly a state-management problem. It is not enough for an agent to reason well in the next turn. ML research requires task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This new paper introduces AiScientist, a system for autonomous long-horizon engineering for ML research. The key idea is to keep control thin and state thick. A top-level orchestrator manages stage-level progress, while specialized agents repeatedly ground themselves in durable workspace artifacts: analyses, plans, code, logs, and experimental evidence. That "File-as-Bus" design matters. AiScientist improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points. Why does it matter? Autonomous research agents need durable project memory, not just longer chats. Paper: arxiv.org/abs/2604.13018 Learn to build effective AI agents in our academy: academy.dair.ai

English

10

63

341

32.3K

GuoxinChen@GuoxinChen22·6 Mar

Click on each item to view detailed info for each questoin😆

English

0

2

49

GuoxinChen@GuoxinChen22·6 Mar

#GPT54 Updated BeyondSWE results! 🚀GPT5.4 is surprisingly better than GPT5.2 🥳GLM-5 is ahead of the other models we tested We haven't tried Gemini 3.1 Pro or Claude 4.6 yet because of the cost😭😭😭 More info: aweai-team.github.io/BeyondSWE_lead…

English

1

0

2

233

GuoxinChen@GuoxinChen22·5 Mar

@nithin_k_anil Thanks for highlighting our work! 🥰 You nailed the core motivation—current evals of code agent don't reflect what agents face in production. That's why we produce our new benchmark🚀🚀. Paper: huggingface.co/papers/2603.03…

English

0

14

GuoxinChen@GuoxinChen22·5 Mar

🧠Nice summary of our new benchmark BeyondSWE!

Boyuan (Nemo) Chen@boyuan_chen

SWE-bench-era success is becoming the new overfitting. If your code agent only shines on single-repo bug fixing, it is not ready for real software engineering. A quick thread on BeyondSWE 👇

English

0

99

GuoxinChen@GuoxinChen22·5 Mar

@boyuan_chen Nice summary of our work! Beyond 80→<45,what surprised us most:giving agent search tools doesn't reliably help—search and code have matured independently, but their fusion doesn't emerge automatically.We think Deep search for Coding is a promising direction worth more attention.

English

1

0

19

Boyuan (Nemo) Chen@boyuan_chen·5 Mar

SWE-bench-era success is becoming the new overfitting. If your code agent only shines on single-repo bug fixing, it is not ready for real software engineering. A quick thread on BeyondSWE 👇

English

2

0

4

360

GuoxinChen@GuoxinChen22·5 Mar

@NirDiamantAI @_akhaliq Cool approach! Dependency graph awareness is definitely useful for multi-repo workflows. Would be interesting to see how tree-sitter-based context helps in our CrossRepo setting too.

English

0

9

NirD@NirDiamantAI·4 Mar

@_akhaliq Multi-repo agents need dependency graph awareness first. I'm running experiments with tree-sitter to parse cross-repo imports, then feeding that context to Claude for better navigation between codebases.

English

1

0

113

AK@_akhaliq·4 Mar

BeyondSWE Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? paper: huggingface.co/papers/2603.03…

English

1

2

17

5K

GuoxinChen@GuoxinChen22·4 Mar

Thanks for sharing our work! We have released the scaffold for evaluating your code agent here. Let's try our new comprehensive benchmark!🚀 We’d love your feedback—please share your thoughts, and we’ll keep improving quickly.🥰 Scaffold: github.com/AweAI-Team/Awe…

DailyPapers@HuggingPapers

BeyondSWE Current code agents ace single-repo bugs (80%+ on SWE-bench) but plateau below 45% on real-world tasks. This new benchmark reveals the gap with 500 instances across cross-repo reasoning, scientific coding, dependency migration, and full repo generation.

English

0

1

90

GuoxinChen@GuoxinChen22·4 Mar

@exponential_bld @HuggingPapers Our SearchSWE experiments show both paths are surprisingly hard — search helps inconsistently, and models' internal knowledge of the broader ecosystem remains limited.🥰🥰🥰

English

0

14

GuoxinChen@GuoxinChen22·4 Mar

@exponential_bld @HuggingPapers So it's a knowledge problem, not a coordination problem. Either the model has enough internalized knowledge of the open-source ecosystem to reason about upstream behaviors, or it needs to effectively search for and integrate external information.

English

1

0

18

DailyPapers@HuggingPapers·4 Mar

BeyondSWE Current code agents ace single-repo bugs (80%+ on SWE-bench) but plateau below 45% on real-world tasks. This new benchmark reveals the gap with 500 instances across cross-repo reasoning, scientific coding, dependency migration, and full repo generation.