Fan Bai

77 posts

Fan Bai

Fan Bai

@loadingfan

Building agents @TechAtBloomberg; Previously Postdoc @jhuclsp JHU; PhD @GeorgiaTech

Entrou em Eylül 2015
251 Seguindo177 Seguidores
Fan Bai
Fan Bai@loadingfan·
Excited to share our latest work on the modality gap in multimodal LLMs. Imagine an agent interacting with the real world as humans do. The text it encounters on signs, documents, or screens appears as pixels, not tokens. Yet MLLMs often perform worse when reading the same text from images. We study this gap and show how to close it.
Kaiser Sun@KaiserWhoLearns

Multimodal LLMs can read text in images, but why do they often perform worse than when the same text is given as tokens? Our work studies the modality gap of models perceiving text as pixels and shows how to close it. 📄 arxiv.org/abs/2603.09095 🧵👇 #NLProc #LLM #ComputerVision

English
0
3
10
1.8K
Fan Bai retweetou
Zhen Wang
Zhen Wang@zhenwang9102·
Love this direction from @karpathy. Autonomous agents iterating on training runs feel like a modern renaissance of AutoML, except now the search space includes research decisions themselves: architectures, training setups, analysis pipelines, etc. But if “autoresearch” becomes a thing, the obvious question is: how do we measure progress in “autoresearch”? - Paper generation, single metrics, LLM-as-judge, or traditional reproducibility research? That’s exactly the motivation behind FIRE-Bench (Full-cycle Insight Rediscovery Evaluation): x.com/zhenwang9102/s… Instead of optimizing a single metric or judging papers, FIRE-Bench evaluates whether agents can rediscover real scientific insights from recent NeurIPS / ICLR / ICML papers: - Turn papers into research-problem trees and select a proper research problem to ask (neither too broad nor too narrow) - We mask the original methods, experiments, and conclusions - Ask agents to go from idea → implementation → experiments → conclusions - Then score them via claim-level matching against expert-validated ground truth What we found so far: - Even the strongest agents struggle (best F1 ≈ 46.7) - Performance is extremely high variance (“research lottery”) - Planning and reasoning, not coding, is the main bottleneck If autoresearch systems like this keep improving, benchmarks must evolve alongside them. Better research agents 🤖 need better ways to measure real discovery 🔬 FIRE-Bench: firebench.github.io
Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English
0
2
14
945
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
Incredibly proud of the team behind this!! KARL solves a genuinely hard problem and it's only the first of many agents we're building @DbrxMosaicAI 🚀🤖
Jonathan Frankle@jefrankle

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English
2
2
17
1.5K
Fan Bai
Fan Bai@loadingfan·
Really excited to co-lead this work. 🚀 Agentic AI for scientific discovery is moving fast — but it’s not there yet. Across FIRE-Bench🔥, even SOTA agents (e.g., Claude Code and Codex) stay <50 F1, struggling with research planning and drawing valid conclusions from experiments. More strikingly, performance variance across runs is huge, highlighting how brittle current agents are on long-horizon, full-cycle research tasks. Lots of room to push forward. 🔬🤖
Zhen Wang@zhenwang9102

🤖🔬 Can AI actually do science end-to-end? 🧠📈 And how would we know when it matches, or surpasses, humans? ⚡🧪 AI is rapidly automating scientific discovery, but benchmarking full-cycle discovery, from 💡 ideation → 🧑‍💻 execution → 📊 conclusions, remains unsolved: 🧐🧐🧐 ❌🛠️ Open-ended discovery → manual validation (costly, unscalable) ❌📏 Metric-driven benchmarks (e.g., MLE-Bench) → convenient but narrow (is higher accuracy really enough?) ❌🤖⚖️ LLM-as-judge → useful, but fundamentally risky if used alone 🔥🚀 Introducing FIRE-Bench🔥: Fullcycle Insight Rediscovery Evaluation 👉🌐 firebench.github.io 📚✨ A benchmark that turns fresh, human-verified insights from recent 🏆 NeurIPS / ICLR / ICML papers into masked, end-to-end discovery challenges 🧩 🌍🔐 Constrained open-ended discovery–backed by ground truth. 📌 Key takeaways: 1⃣ 📖🧱 Reference-based evaluation still matters: constrained LLM judging helps, but human-grounded references remain essential until agents can consistently match human conclusions 2⃣ 🏆🧠 Expert-validated ground truth: all tasks come from recent NeurIPS / ICLR / ICML papers, with contamination carefully controlled 3⃣ 🔁🎭 Rediscovery, not reproduction: original 🧪 methods, 📊 experiments, 💻 implementations, and 📈 analyses are fully masked to create real discovery challenges 🔑 Key empirical findings: 💡 The "Science Gap" is Real: Even the best setup (Claude Code + Sonnet-4) caps out at an F1 score of 46.7. On hard tasks, agents struggle to break 30 💡 Success is a "Lottery": Performance has incredibly high variance. Reliability is a major unsolved issue. 💡 Coding is no longer the bottleneck; high-level reasoning and analysis are: ~74% of errors stem from flawed planning, not coding ⚙️ How it works: 🔹 Research-Problem Trees: We parse papers into trees (from broad roots to concrete leaves). This allows us to select intermediate nodes that perfectly balance open-ended exploration with verifiable ground truth. 🔹 Claim-Level Evaluation: We match AI conclusions against human conclusions using granular claim decomposition (F1 score). 🔹 Creativity Check: We score false positives to see if agents are finding novel truths (Spoiler🚨: they aren’t creative yet). 🔹 New Diagnostic Taxonomy: failures traced across four stages: 🧠 Planning → 🛠️ Implementation → ▶️ Execution → 🧾 Conclusion 🔹 Additional Analyses: cost efficiency, contamination checks, and more. 👀 The Future: 🚀 Live-FIRE-Bench: a live, continuously updated FIRE-Bench to track real-time progress on the latest research (Newest LLMs should be benchmarked with the newest research) 🚀 Stronger scaffolding (search + planning + coding) 🧠🧰 and converting FIRE-Bench into interactive environments for training research agents 🚀 Toward real creativity: We want better systems that can produce genuinely novel conclusions toward creativity 🎨⏳ 🚀 Better systems 🧠✨ and better benchmarks 📏 must co-evolve 🔄 over time 📜🎥 Paper, video, demo, and research trees: 👉🌐 firebench.github.io #AI 🤖 #MachineLearning 📚 #AI4Science 🔬 #LLMs 🧠 #Research 🧪 #AgenticAI 🚀 #FireBench 🔥

English
0
3
10
1.2K
Fan Bai
Fan Bai@loadingfan·
Even vs strong fine-tuned models like GoLLIE-34B, DEER (with GPT-4o) is competitive — all without training 💪 Training-free, label-aware, and robust. LLMs can do NER — they just need better demonstrations. 8/8
Fan Bai tweet media
English
1
0
0
74
Fan Bai
Fan Bai@loadingfan·
🤔 LLMs can ace Olympiad math, yet struggle with something as “simple” as NER — even with in-context learning (ICL)? 💡 Our #EMNLP2025 paper answers why: “LLMs are Better Than You Think: Label-Guided In-Context Learning for Named Entity Recognition.” @mdredze @jhuclsp 👉 TL;DR: The issue isn’t the LLM — it’s how we pick demonstrations. 📄 arxiv.org/pdf/2505.23722… 1/8
Fan Bai tweet media
English
2
2
8
1.2K
Fan Bai retweetou
Yao Dou
Yao Dou@Yaooo01·
Can LLM-simulated users replace expensive human evaluation for multi-turn conversations? Short answer: yes, if you model the user right. With our SimulatorArena, we find that detailed user profiles (knowledge + message style) improve alignment with real human evaluation by 26% at <3% the cost. #EMNLP2025 [1/6] 🧵
Yao Dou tweet media
English
4
25
134
9.9K
Fan Bai retweetou
Kaiser Sun
Kaiser Sun@KaiserWhoLearns·
What happens when an LLM is asked to use information that contradicts its knowledge? We explore knowledge conflict in a new preprint📑 TLDR: Performance drops, and this could affect the overall performance of LLMs in model-based evaluation.📑🧵⬇️ 1/8 #NLProc #LLM #AIResearch
Kaiser Sun tweet media
English
4
20
86
11.5K
Fan Bai
Fan Bai@loadingfan·
3. Scaling synthetic data narrows the gap with human-generated data, though the label quality of "hard-to-learn" examples remains the central bottleneck, which may be addressed through selective annotation.
Fan Bai tweet media
English
1
0
0
109
Fan Bai
Fan Bai@loadingfan·
Glad to see my first project at JHU has been accepted to #ml4h2024. Here are a few key takeaways: 1. Naive prompting often produces "easy-to-learn" examples, and methods that promote syntactic diversity in LLM output don't address this fundamental issue.
Fan Bai tweet media
Mark Dredze@mdredze

Today at #ML42024: Clinical QA can help doctors find critical information in patient records. But where do we get training data for these systems? Generating this data from an LLM is hard. 🧵 @loadingfan

English
1
3
21
3.4K
Yang Chen
Yang Chen@ychenNLP·
I've successfully defended my PhD! 🎓Really appreciate my advisor @alan_ritter @cocoweixu for everything throughout this journey🥺. Huge thanks to my amazing committee @mchang21 @kartik_goyal_ @Hexiang_Hu 🚙I'll move to CA and join @NVIDIA as a research scientist next month.
Alan Ritter@alan_ritter

Congratulations to @ychenNLP for successfully defending his PhD! Yang has done exciting work advancing both the multilingual and multimodal capabilities of LLMs. Many thanks to his committee: @cocoweixu (co-advisor), @mchang21, @Hexiang_Hu, @kartik_goyal_

English
19
8
141
19K