Xingjian Zhang

37 posts

Xingjian Zhang

@_Jimmy_Zhang_

Ph.D. candidate @UMich. Intern @GoogleDeepMind. Incoming intern @AIatMeta. AI for science, LLM reasoning, and more.

Ann Arbor, MI, USA. Katılım Kasım 2021

109 Takip Edilen353 Takipçiler

Xingjian Zhang@_Jimmy_Zhang_·4d

@Jiaqi_Ma_ Many people have asked how I made these slides, so I added an appendix explaining the setup: cmux + Neovim + Typst + Claude Code. It’s much faster than a Google Slides or LaTeX Beamer workflow. The slide template is open-sourced here: typst.app/universe/packa…

English

510

Jiaqi Ma@Jiaqi_Ma_·5d

This is the best slides I've seen so far about claude code pro tips (esp. for AI researchers): xingjianz.com/assets/talks/a… By @_Jimmy_Zhang_

English

634

54.9K

Xingjian Zhang@_Jimmy_Zhang_·5d

@ghulamio @Jiaqi_Ma_ Thanks! That enables many useful workflows, like model debugging.

English

Ghulam@ghulamio·5d

@Jiaqi_Ma_ @_Jimmy_Zhang_ Good stuff @_Jimmy_Zhang_ TIL there's a wandb MCP

English

2.7K

Xingjian Zhang@_Jimmy_Zhang_·5d

Thanks @Jiaqi_Ma_ for the kind shoutout! Glad the slides are helpful. I plan to update it periodically XD p.s. If you're running Claude Code (or Codex) on Slurm-managed HPC clusters, check out our lab's shared config: github.com/umich-foreseer…

Jiaqi Ma@Jiaqi_Ma_

This is the best slides I've seen so far about claude code pro tips (esp. for AI researchers): xingjianz.com/assets/talks/a… By @_Jimmy_Zhang_

English

Xingjian Zhang@_Jimmy_Zhang_·24 Mar

Had the privilege of building the agentic RL infrastructure (tool use, search, etc.) and helping develop the multimodality framework in Simply during my @GoogleDeepMind internship last summer. Glad to see it out in the open — excited for the community to build on it!

Chen Liang@crazydonkey200

@karpathy Very inspiring as always! We are also open sourcing part of our infra on automated research for Gemini to evolve itself at github.com/google-deepmin… More complex than the nanochat setup but closer to SOTA LLM pre/post-training while staying as minimal as possible. More on the way.

English

953

Xingjian Zhang@_Jimmy_Zhang_·22 Mar

Best use I've found for my Claude Max subscription isn't coding — it's turning #ClaudeCode into a personal research desk that delivers deep briefings to my #RSS reader every morning, on exactly the topics I define. Open source. $0 extra if you already subscribe. 🧵👇

English

901

Xingjian Zhang@_Jimmy_Zhang_·22 Mar

@Xenshinu429 Thanks! That's Reeder, an existing RSS reader app — but cc-deepfeed outputs standard RSS 2.0, so it works with whatever reader you want to try!

English

105

Xueshen Liu@Xenshinu429·22 Mar

@_Jimmy_Zhang_ The UI looks amazing!

English

Xingjian Zhang@_Jimmy_Zhang_·22 Mar

Open source, MIT licensed, ~2 min to set up: github.com/xingjian-zhang… If you have a @claudeai subscription and have been looking for uses beyond coding — try this one. Favorite thing I've built.

English

170

Xingjian Zhang@_Jimmy_Zhang_·22 Mar

The part I'm most excited about: it gets smarter over time. Each run builds on the last — accumulated knowledge, tracked stories, entity memory. By run 10 it knows everything runs 1–9 discovered. Not daily snapshots. Cumulative understanding.

English

172

Xingjian Zhang retweetledi

Jiaqi Ma@Jiaqi_Ma_·18 Oca

The ARC challenge claims to measure "fluid intelligence" through tasks that are "simple for people yet difficult for AI." However, is the AI failure really due to the lack of "fluid intelligence?" Our recent work shows that the answer is NO with a carefully designed diagnostic study. ArXiv: arxiv.org/pdf/2512.21329 Joint work with Xinhe Wang, @JinHuang9306000, @_Jimmy_Zhang_ , @0920wth Our study is motivated by an observation that ARC problems are easy for humans because their representation strongly favors human vision. For example, in the attached figure, the same ARC problem presented in a serialized way becomes much more challenging for humans. 1/

English

5.9K

Xingjian Zhang retweetledi

Accepted papers at TMLR@TmlrPub·22 Şub

Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei. Action editor: Ilia Sucholutsky. openreview.net/forum?id=1jLQ6… #annotations #an

English

496

Xingjian Zhang retweetledi

Benhao Huang@huskydogewoof·2 Haz

📢 Can LLM agents autonomously detect dataset quality issues? Sharing our new paper DCA-Bench accepted by KDD 2025 DB Track as an oral paper! ➡️ Paper: arxiv.org/pdf/2406.07275 ➡️ Code: github.com/TRAIS-Lab/dca-… ➡️ Dataset: huggingface.co/datasets/trais… Let’s dive in! 🔽 (1/N)

English

Xingjian Zhang@_Jimmy_Zhang_·24 Oca

Now accepted to #NAACL2025! 🎉 HF Dataset is available, too! Check: huggingface.co/datasets/jimmy…

Xingjian Zhang@_Jimmy_Zhang_

Excited to share our recent work on #dataset & #benchmark for AI-assisted scientific workflows. With over 152,000 peer-reviewed CS papers, we provide structured insights from research workflows using LLMs. 🧐 GitHub: github.com/xingjian-zhang… arXiv: arxiv.org/abs/2406.06357

English

637

Xingjian Zhang retweetledi

Mikel Bober-Irizar@mikb0b·22 Ara

You've seen some of the puzzles o3 failed, but have you seen the attempts? Yesterday, @OpenAI's o3 dramatically beat the SOTA at @arcprize. But there were 34 tasks that even it couldn't solve with 16 hours of thinking. I've compiled and analyzed all of o3's mistakes below 🧵

English

154

1.3K

281.7K

Xingjian Zhang retweetledi

Jim Fan@DrJimFan·12 Eyl

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.

English

135

1.1K

6.1K

799.6K

Xingjian Zhang@_Jimmy_Zhang_·10 Eyl

Co-lead: @yutxie Fund: @LG_AI_Research

English

Xingjian Zhang@_Jimmy_Zhang_·10 Eyl

4. [Rich benchmark tasks] MASSW facilitates multiple novel and benchmarkable machine learning tasks, such as idea generation and outcome prediction. It supports diverse tasks centered on predicting, recommending, and expanding key elements of a scientific workflow.

English

Xingjian Zhang@_Jimmy_Zhang_·10 Eyl

English

2.1K

Keşfet

@Jiaqi_Ma_ @ghulamio @GoogleDeepMind @Xenshinu429 @claudeai @JinHuang9306000 @0920wth @OpenAI