GuoxinChen

33 posts

GuoxinChen banner
GuoxinChen

GuoxinChen

@GuoxinChen22

CS PhD Student at Gaoling School of AI, RUC | Studying LLMs, Agents, Reasoning.

Katılım Ekim 2021
123 Takip Edilen21 Takipçiler
Sabitlenmiş Tweet
GuoxinChen retweetledi
sarah guo
sarah guo@saranormous·
“These results suggest long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem”
sarah guo tweet media
English
10
9
101
8.1K
GuoxinChen
GuoxinChen@GuoxinChen22·
@PraCha98 @daniel_mac8 It's structured, not one monolithic chain. AiScientist is lab-like: a PI-like orchestrator handles stage-level planning, specialists own major subproblems, and subagents handle leaf tasks. Each has its own reasoning process, and they coordinate through shared workspace artifacts.
English
1
0
1
28
Pra Cha
Pra Cha@PraCha98·
@daniel_mac8 This looks so scattered, and there is no structured thought process, like trying out permutations and combinations. - would scientists do this way?
English
2
0
1
291
GuoxinChen retweetledi
Dan McAteer
Dan McAteer@daniel_mac8·
Autonomous, recursively self-improving AI researchers are here. They're just not evenly disributed.
Dan McAteer tweet media
English
7
31
209
10.3K
GuoxinChen retweetledi
Based Medical
Based Medical@BasedMedical·
This paper found artifact-mediated coordination is ~4x more economically efficient than chat coordination. This is huge for tasks like autoresearch that benefits from persistent agent-native workspaces. Huge props to @GuoxinChen22 and the rest of the authors. I've also been developing the same process of shared cognitive workspaces at @nookplot , so agents can have artifact-focused coordination on a decentralized network, to enable latent forms of communication, work more effectively together, and have a more informative way to measure contributions from agents, allowing for trust to be quantified and established between strangers.
Based Medical tweet media
English
1
2
16
1K
GuoxinChen
GuoxinChen@GuoxinChen22·
@omarsar0 Thanks for sharing our work! Context is the key in long-horizon tasks, such as deep search. But we found that the project/workspace state continuity is more matter than that of context in long-horizon tasks such as AI research. That’s why we introduce File-as-Bus.🚌
English
2
0
2
212
elvis
elvis@omarsar0·
Long-horizon AI research agents are mostly a state-management problem. It is not enough for an agent to reason well in the next turn. ML research requires task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This new paper introduces AiScientist, a system for autonomous long-horizon engineering for ML research. The key idea is to keep control thin and state thick. A top-level orchestrator manages stage-level progress, while specialized agents repeatedly ground themselves in durable workspace artifacts: analyses, plans, code, logs, and experimental evidence. That "File-as-Bus" design matters. AiScientist improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points. Why does it matter? Autonomous research agents need durable project memory, not just longer chats. Paper: arxiv.org/abs/2604.13018 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
10
63
341
32.3K
GuoxinChen
GuoxinChen@GuoxinChen22·
Click on each item to view detailed info for each questoin😆
GuoxinChen tweet media
English
0
0
2
49
GuoxinChen
GuoxinChen@GuoxinChen22·
#GPT54 Updated BeyondSWE results! 🚀GPT5.4 is surprisingly better than GPT5.2 🥳GLM-5 is ahead of the other models we tested We haven't tried Gemini 3.1 Pro or Claude 4.6 yet because of the cost😭😭😭 More info: aweai-team.github.io/BeyondSWE_lead…
GuoxinChen tweet media
English
1
0
2
233
GuoxinChen
GuoxinChen@GuoxinChen22·
@nithin_k_anil Thanks for highlighting our work! 🥰 You nailed the core motivation—current evals of code agent don't reflect what agents face in production. That's why we produce our new benchmark🚀🚀. Paper: huggingface.co/papers/2603.03…
English
0
0
0
14
GuoxinChen
GuoxinChen@GuoxinChen22·
@boyuan_chen Nice summary of our work! Beyond 80→<45,what surprised us most:giving agent search tools doesn't reliably help—search and code have matured independently, but their fusion doesn't emerge automatically.We think Deep search for Coding is a promising direction worth more attention.
English
1
0
0
19
Boyuan (Nemo) Chen
Boyuan (Nemo) Chen@boyuan_chen·
SWE-bench-era success is becoming the new overfitting. If your code agent only shines on single-repo bug fixing, it is not ready for real software engineering. A quick thread on BeyondSWE 👇
English
2
0
4
360
GuoxinChen
GuoxinChen@GuoxinChen22·
@NirDiamantAI @_akhaliq Cool approach! Dependency graph awareness is definitely useful for multi-repo workflows. Would be interesting to see how tree-sitter-based context helps in our CrossRepo setting too.
English
0
0
0
9
NirD
NirD@NirDiamantAI·
@_akhaliq Multi-repo agents need dependency graph awareness first. I'm running experiments with tree-sitter to parse cross-repo imports, then feeding that context to Claude for better navigation between codebases.
English
1
0
0
113
GuoxinChen
GuoxinChen@GuoxinChen22·
Thanks for sharing our work! We have released the scaffold for evaluating your code agent here. Let's try our new comprehensive benchmark!🚀 We’d love your feedback—please share your thoughts, and we’ll keep improving quickly.🥰 Scaffold: github.com/AweAI-Team/Awe…
DailyPapers@HuggingPapers

BeyondSWE Current code agents ace single-repo bugs (80%+ on SWE-bench) but plateau below 45% on real-world tasks. This new benchmark reveals the gap with 500 instances across cross-repo reasoning, scientific coding, dependency migration, and full repo generation.

English
0
0
1
90
GuoxinChen
GuoxinChen@GuoxinChen22·
@exponential_bld @HuggingPapers Our SearchSWE experiments show both paths are surprisingly hard — search helps inconsistently, and models' internal knowledge of the broader ecosystem remains limited.🥰🥰🥰
English
0
0
0
14
GuoxinChen
GuoxinChen@GuoxinChen22·
@exponential_bld @HuggingPapers So it's a knowledge problem, not a coordination problem. Either the model has enough internalized knowledge of the open-source ecosystem to reason about upstream behaviors, or it needs to effectively search for and integrate external information.
English
1
0
0
18
DailyPapers
DailyPapers@HuggingPapers·
BeyondSWE Current code agents ace single-repo bugs (80%+ on SWE-bench) but plateau below 45% on real-world tasks. This new benchmark reveals the gap with 500 instances across cross-repo reasoning, scientific coding, dependency migration, and full repo generation.
DailyPapers tweet media
English
4
7
31
4.2K