Leon

23 posts

Leon

@iamleonli

PhD @nyuniversity Prev @Columbia

Katılım Temmuz 2024

333 Takip Edilen66 Takipçiler

Leon retweetledi

Martin Marek@mrtnm·9h

New paper! "Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay"

Andrew Gordon Wilson@andrewgwils

How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay! arxiv.org/abs/2605.26097 w/@mrtnm @dongkyucho @ShikaiQiu @rumichunara @Pavel_Izmailov 1/8

English

2.1K

Leon retweetledi

Pavel Izmailov@Pavel_Izmailov·7h

New paper: arxiv.org/abs/2605.26097 The main idea is that we can use an LLM to generate its own replay data to prevent forgetting, as long as we have spare capacity. Very overtrained models have to forget to learn new information.

English

6.9K

Leon retweetledi

Alex N. Wang@alexandernwang·1 May

What happens to planning and control when world models condition on complex actions? For example, precisely controlling a human agent may require specifying the motion of each joint. In this setting, action dimensionality increases, the model becomes difficult to control, and the cost of planning using search-based methods like CEM explodes. We propose a solution: lift the world model to a higher level of abstraction. We use a lightweight policy to map high-level waypoint actions → low-level joint sequences, so you can control and plan in a concise space. Best of all, this is done without finetuning or losing any world model expressiveness. 1/8

GIF

English

185

31K

Leon retweetledi

Pavel Izmailov@Pavel_Izmailov·21 Nis

Excited to share our new paper! As LLMs get stronger, reliable reward signals get harder to build. We study RLVR generalization under three weak supervision settings (scarce data, noisy rewards, and proxy rewards) across Qwen and Llama on math, science, and graph reasoning. Some models learn to reason. Others just memorize. We show why, and how to fix it 🧵 📄 salmanrahman.net/rlvr-weak-supe…

English

187

16.8K

Leon retweetledi

Modal@modal·14 Nis

x.com/i/article/2043…

ZXX

365

85.6K

Leon retweetledi

Peter Tong@TongPetersb·4 Mar

Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]

English

221

1.1K

216.8K

Leon retweetledi

Modal@modal·11 Şub

GLM-5, the latest frontier open model from @Zai_org, is available now on Modal. We partnered with Z.ai to release an endpoint that will be free for a limited time.

English

225

59K

Leon retweetledi

Z.ai@Zai_org·11 Şub

Introducing GLM-5: From Vibe Coding to Agentic Engineering GLM-5 is built for complex systems engineering and long-horizon agentic tasks. Compared to GLM-4.5, it scales from 355B params (32B active) to 744B (40B active), with pre-training data growing from 23T to 28.5T tokens. Try it now: chat.z.ai Weights: huggingface.co/zai-org/GLM-5 Tech Blog: z.ai/blog/glm-5 OpenRouter (Previously Pony Alpha): openrouter.ai/z-ai/glm-5 Rolling out from Coding Plan Max users: z.ai/subscribe

English

314

783

5.3K

1.5M

Leon retweetledi

Peter Tong@TongPetersb·24 Oca

We have been training with TPUs in academia for two years now (huge thanks to Google TRC!). Works like Cambrian-1, Cambrian-S, RAE, and Scale-RAE would not have been possible without TPUs. We wrote a blog post sharing our experiences, optimizations, and lessons learned: cambrian-mllm.github.io/blog/tpu-train… We hope this can help more people having a smoother experience working with TPUs, they are very powerful!

English

266

38.1K

Leon retweetledi

Muratcan Koylan@koylanai·10 Ara

You should NOT use LLMs to generate synthetic human-like profiles. I just read the NeurIPS paper "LLM Generated Persona is a Promise with a Catch" and it confirms a suspicion we’ve held for a long time: You cannot "invent" a realistic human being using just statistics and an LLM. Yes, they are more scalable and cost-effective alternative to human interviews to create digital expert personas but this paper also proves that these synthetic profiles contain systematic biases that skew simulation results away from real-world outcomes. The more creative freedom you give an LLM to generate a persona’s backstory, the further it drifts from reality. Another important finding is that as LLM-generated content increases, simulated personas shift progressively toward left-leaning stances. LLMs also systematically generate personas with overly optimistic outlooks, using positively valenced terms like "love," "proud," and "community" while omitting life challenges or negative experiences. This emotional bias is horrible for strategy and creativity-related decision-making tasks! If you are building AI agents for strategy or decision-making, you don't want an idealized "Yes Man." This is why I keep posting about the importance of Tacit Knowledge, Context Engineering, and AI Interviewer to extract human knowledge. The research paper critiques the practice of "inventing" people from statistical margins (Census data + LLM imagination), whereas the system should focus on "extracting" people from ground truth (Real Expert + Interview). After testing and evaluating LLM personas generated by public datasets, we observed that they are not ready for production AI agents. That's why my focus is on building an interviewer experience that extracts as much learning as possible from the human expert, and creating a context system that grounds that expert's outputs in truth; using a real-time, long-form interview to capture "implicit knowledge" and "distinctive methodologies". Another architectural difference that I find is relying heavily on single-pass prompting. They feed demographic data into an LLM and ask it to generate a "Descriptive Persona" (a narrative bio). They found this introduces massive bias. To address these critical flaws in the current persona generation, I propose the following to resolve or at least mitigate these specific issues: 1. Addressing the "Joint Distribution" Issue: Researchers report that they cannot precisely simulate an individual due to fragmented datasets (e.g., they have data on "Income" and "Education" separately but lack information on their overlap for a specific person), resulting in "incongruous combinations." By interviewing a real human, you capture the natural joint distribution of their beliefs. You don't have to guess if a "high-income expert" cares about "sustainability"; the expert tells you. We need to bypass the statistical reconstruction problem entirely by building scalable interviewer solutions. 2. Avoiding "Positivity Bias" & "Leftward Drift": The paper proves that when LLMs are asked to write a persona description (Descriptive Persona), they default to "pollyannaish," overly positive, and politically progressive profiles. The interviewer system should be designed to gather insights into "mistakes," "judgment," and "distinctive methodologies" rather than generic best practices. By forcing the model to ingest a transcript of hard-won lessons and failures, you will override the model's default tendency to be "nice" and "generic." The paper also mentions a lack of "ground truth" to validate if a persona is accurate. My solution includes a built-in validation loop where the human expert reviews and scores the output. This "Human-in-the-Loop" verification is exactly what the researchers argue is missing from the field. "Descriptive Personas" generated by LLMs are articulate but statistically flawed. To scale true expertise, we must stop trying to simulate people and start interviewing them.

English

548

93.4K

Leon retweetledi

Micah Goldblum@micahgoldblum·11 Ara

For a long time, Yann LeCun and others believed in gradient-based planning, but it didn’t work very well … until now. Here’s how we did it using incredibly simple techniques. But first, an introduction to gradient-based planning: 🧵1/11

English

175

1.4K

160K

Leon retweetledi

Sean McLeish@SeanMcleish·11 Kas

Looped latent reasoning models like TRM, HRM, Ouro and Huginn are great for reasoning, but they’re inefficient to train at larger scales. We fix this by post training regular language models into looped models, achieving higher accuracy on a per training FLOP basis. 📜1/7

English

391

65.1K

Leon retweetledi

Micah Goldblum@micahgoldblum·11 Kas

🚨We converted pretrained LLMs into looped LLMs that can crank up performance by looping for more iterations. Our looped models surpass the performance of the pretrained models we started out with, showing that existing models benefit from increased computational depth. 📜1/9

English

151

34.5K

Leon retweetledi

Preston Zh@pfactorialz·8 Eki

.@relace_ai has raised $23M to build the rails for AI code generation. This round is led by @a16z, with participation from @matrixvc and @ycombinator. LLMs have proven they can write code—but scaling that code into production still needs better infrastructure. Relace is building exactly that: the infra layer where models and systems are co-optimized for code generation. We’ve already shipped: - The fastest apply model on OpenRouter (10k tok/s) - State-of-the-art code reranking and embeddings models These models have already processed tens of millions of requests from customers like Lovable, Magic Patterns, and Orchids. Now, we’re taking it a step further: with Relace Repos, we’re working on a new source control system that’s built for the age of AI-generated code, with native retrieval and deep integration into our models. If you're looking to build code generation into your product, please reach out!

English

188

144.9K

Leon retweetledi

Tony Chen@tonychenxyz·1 Eki

In 2024 @iamleonli and I generated voter personas directly from unbiased census data and asked LLMs how they’d vote. Nearly all picked Kamala Harris. We dug into why—and uncovered surprising risks - and cure - in simulating humans with LLMs. 🧵 (1/n)

English

343

Leon@iamleonli·6 Ağu

@olliezliu 🥲

QME

Ollie Liu@olliezliu·6 Ağu

when your coauthors have posted a well-reasoned rebuttal that addressed all the reviewer's concerns...😑

English

478

Leon retweetledi

Micah Goldblum@micahgoldblum·23 Tem

We trained two models on our dataset: (1) We fine-tuned Anole‑7b and saw significant boosts on both our in-distribution test set and standard VLM benchmarks. (2) We also used our data to train Bagel-7b to generate multimodal reasoning traces. We released both models! 3/n

English

884

Leon@iamleonli·23 Tem

CoT transformed text reasoning. What about multimodal? 🤔 Check out our new dataset of interleaved text and image reasoning traces. We also show interesting visual CoT examples generated inherently by the model finetuned on our dataset!

Micah Goldblum@micahgoldblum

🚨Announcing Zebra-CoT, a large-scale dataset of high quality interleaved image-text reasoning traces 📜. Humans often draw visual aids like diagrams when solving problems, but existing VLMs reason mostly in pure text. 1/n

English

1.5K

Leon retweetledi

tom zollo@SquareZollo·14 Tem

Joint work with Jimmy Wang, Rich Zemel @zemelgroup , and @hsnamkoong To check out our work in more detail: arxiv: arxiv.org/abs/2504.04204 code: github.com/namkoong-lab/a… We also release a new Twenty Questions dataset to benchmark UQ strategies: huggingface.co/datasets/namko…

English

512

Leon retweetledi

Micah Goldblum@micahgoldblum·10 Tem

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n

English

113

834

396.6K

Keşfet

@Zai_org @relace_ai @a16z @matrixvc @ycombinator @olliezliu @zemelgroup @hsnamkoong