Leon

23 posts

Leon banner
Leon

Leon

@iamleonli

PhD @nyuniversity Prev @Columbia

Katılım Temmuz 2024
333 Takip Edilen66 Takipçiler
Leon retweetledi
Pavel Izmailov
Pavel Izmailov@Pavel_Izmailov·
New paper: arxiv.org/abs/2605.26097 The main idea is that we can use an LLM to generate its own replay data to prevent forgetting, as long as we have spare capacity. Very overtrained models have to forget to learn new information.
Pavel Izmailov tweet media
English
3
16
91
6.9K
Leon retweetledi
Alex N. Wang
Alex N. Wang@alexandernwang·
What happens to planning and control when world models condition on complex actions? For example, precisely controlling a human agent may require specifying the motion of each joint. In this setting, action dimensionality increases, the model becomes difficult to control, and the cost of planning using search-based methods like CEM explodes. We propose a solution: lift the world model to a higher level of abstraction. We use a lightweight policy to map high-level waypoint actions → low-level joint sequences, so you can control and plan in a concise space. Best of all, this is done without finetuning or losing any world model expressiveness. 1/8
GIF
English
4
26
185
31K
Leon retweetledi
Pavel Izmailov
Pavel Izmailov@Pavel_Izmailov·
Excited to share our new paper! As LLMs get stronger, reliable reward signals get harder to build. We study RLVR generalization under three weak supervision settings (scarce data, noisy rewards, and proxy rewards) across Qwen and Llama on math, science, and graph reasoning. Some models learn to reason. Others just memorize. We show why, and how to fix it 🧵 📄 salmanrahman.net/rlvr-weak-supe…
Pavel Izmailov tweet media
English
6
30
187
16.8K
Leon retweetledi
Peter Tong
Peter Tong@TongPetersb·
Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]
Peter Tong tweet media
English
35
221
1.1K
216.8K
Leon retweetledi
Modal
Modal@modal·
GLM-5, the latest frontier open model from @Zai_org, is available now on Modal. We partnered with Z.ai to release an endpoint that will be free for a limited time.
English
10
21
225
59K
Leon retweetledi
Z.ai
Z.ai@Zai_org·
Introducing GLM-5: From Vibe Coding to Agentic Engineering GLM-5 is built for complex systems engineering and long-horizon agentic tasks. Compared to GLM-4.5, it scales from 355B params (32B active) to 744B (40B active), with pre-training data growing from 23T to 28.5T tokens. Try it now: chat.z.ai Weights: huggingface.co/zai-org/GLM-5 Tech Blog: z.ai/blog/glm-5 OpenRouter (Previously Pony Alpha): openrouter.ai/z-ai/glm-5 Rolling out from Coding Plan Max users: z.ai/subscribe
Z.ai tweet media
English
314
783
5.3K
1.5M
Leon retweetledi
Peter Tong
Peter Tong@TongPetersb·
We have been training with TPUs in academia for two years now (huge thanks to Google TRC!). Works like Cambrian-1, Cambrian-S, RAE, and Scale-RAE would not have been possible without TPUs. We wrote a blog post sharing our experiences, optimizations, and lessons learned: cambrian-mllm.github.io/blog/tpu-train… We hope this can help more people having a smoother experience working with TPUs, they are very powerful!
English
8
25
266
38.1K
Leon retweetledi
Muratcan Koylan
Muratcan Koylan@koylanai·
You should NOT use LLMs to generate synthetic human-like profiles. I just read the NeurIPS paper "LLM Generated Persona is a Promise with a Catch" and it confirms a suspicion we’ve held for a long time: You cannot "invent" a realistic human being using just statistics and an LLM. Yes, they are more scalable and cost-effective alternative to human interviews to create digital expert personas but this paper also proves that these synthetic profiles contain systematic biases that skew simulation results away from real-world outcomes. The more creative freedom you give an LLM to generate a persona’s backstory, the further it drifts from reality. Another important finding is that as LLM-generated content increases, simulated personas shift progressively toward left-leaning stances. LLMs also systematically generate personas with overly optimistic outlooks, using positively valenced terms like "love," "proud," and "community" while omitting life challenges or negative experiences. This emotional bias is horrible for strategy and creativity-related decision-making tasks! If you are building AI agents for strategy or decision-making, you don't want an idealized "Yes Man." This is why I keep posting about the importance of Tacit Knowledge, Context Engineering, and AI Interviewer to extract human knowledge. The research paper critiques the practice of "inventing" people from statistical margins (Census data + LLM imagination), whereas the system should focus on "extracting" people from ground truth (Real Expert + Interview). After testing and evaluating LLM personas generated by public datasets, we observed that they are not ready for production AI agents. That's why my focus is on building an interviewer experience that extracts as much learning as possible from the human expert, and creating a context system that grounds that expert's outputs in truth; using a real-time, long-form interview to capture "implicit knowledge" and "distinctive methodologies". Another architectural difference that I find is relying heavily on single-pass prompting. They feed demographic data into an LLM and ask it to generate a "Descriptive Persona" (a narrative bio). They found this introduces massive bias. To address these critical flaws in the current persona generation, I propose the following to resolve or at least mitigate these specific issues: 1. Addressing the "Joint Distribution" Issue: Researchers report that they cannot precisely simulate an individual due to fragmented datasets (e.g., they have data on "Income" and "Education" separately but lack information on their overlap for a specific person), resulting in "incongruous combinations." By interviewing a real human, you capture the natural joint distribution of their beliefs. You don't have to guess if a "high-income expert" cares about "sustainability"; the expert tells you. We need to bypass the statistical reconstruction problem entirely by building scalable interviewer solutions. 2. Avoiding "Positivity Bias" & "Leftward Drift": The paper proves that when LLMs are asked to write a persona description (Descriptive Persona), they default to "pollyannaish," overly positive, and politically progressive profiles. The interviewer system should be designed to gather insights into "mistakes," "judgment," and "distinctive methodologies" rather than generic best practices. By forcing the model to ingest a transcript of hard-won lessons and failures, you will override the model's default tendency to be "nice" and "generic." The paper also mentions a lack of "ground truth" to validate if a persona is accurate. My solution includes a built-in validation loop where the human expert reviews and scores the output. This "Human-in-the-Loop" verification is exactly what the researchers argue is missing from the field. "Descriptive Personas" generated by LLMs are articulate but statistically flawed. To scale true expertise, we must stop trying to simulate people and start interviewing them.
Muratcan Koylan tweet media
English
57
74
548
93.4K
Leon retweetledi
Micah Goldblum
Micah Goldblum@micahgoldblum·
For a long time, Yann LeCun and others believed in gradient-based planning, but it didn’t work very well … until now. Here’s how we did it using incredibly simple techniques. But first, an introduction to gradient-based planning: 🧵1/11
Micah Goldblum tweet media
English
24
175
1.4K
160K
Leon retweetledi
Sean McLeish
Sean McLeish@SeanMcleish·
Looped latent reasoning models like TRM, HRM, Ouro and Huginn are great for reasoning, but they’re inefficient to train at larger scales. We fix this by post training regular language models into looped models, achieving higher accuracy on a per training FLOP basis. 📜1/7
Sean McLeish tweet media
English
9
65
391
65.1K
Leon retweetledi
Micah Goldblum
Micah Goldblum@micahgoldblum·
🚨We converted pretrained LLMs into looped LLMs that can crank up performance by looping for more iterations. Our looped models surpass the performance of the pretrained models we started out with, showing that existing models benefit from increased computational depth. 📜1/9
Micah Goldblum tweet media
English
10
26
151
34.5K
Leon retweetledi
Preston Zh
Preston Zh@pfactorialz·
.@relace_ai has raised $23M to build the rails for AI code generation. This round is led by @a16z, with participation from @matrixvc and @ycombinator. LLMs have proven they can write code—but scaling that code into production still needs better infrastructure. Relace is building exactly that: the infra layer where models and systems are co-optimized for code generation. We’ve already shipped: - The fastest apply model on OpenRouter (10k tok/s) - State-of-the-art code reranking and embeddings models These models have already processed tens of millions of requests from customers like Lovable, Magic Patterns, and Orchids. Now, we’re taking it a step further: with Relace Repos, we’re working on a new source control system that’s built for the age of AI-generated code, with native retrieval and deep integration into our models. If you're looking to build code generation into your product, please reach out!
English
50
35
188
144.9K
Leon retweetledi
Tony Chen
Tony Chen@tonychenxyz·
In 2024 @iamleonli and I generated voter personas directly from unbiased census data and asked LLMs how they’d vote. Nearly all picked Kamala Harris. We dug into why—and uncovered surprising risks - and cure - in simulating humans with LLMs. 🧵 (1/n)
Tony Chen tweet media
English
1
2
5
343
Ollie Liu
Ollie Liu@olliezliu·
when your coauthors have posted a well-reasoned rebuttal that addressed all the reviewer's concerns...😑
Ollie Liu tweet media
English
1
0
11
478
Leon retweetledi
Micah Goldblum
Micah Goldblum@micahgoldblum·
We trained two models on our dataset: (1) We fine-tuned Anole‑7b and saw significant boosts on both our in-distribution test set and standard VLM benchmarks. (2) We also used our data to train Bagel-7b to generate multimodal reasoning traces. We released both models! 3/n
Micah Goldblum tweet media
English
1
1
10
884
Leon
Leon@iamleonli·
CoT transformed text reasoning. What about multimodal? 🤔 Check out our new dataset of interleaved text and image reasoning traces. We also show interesting visual CoT examples generated inherently by the model finetuned on our dataset!
Micah Goldblum@micahgoldblum

🚨Announcing Zebra-CoT, a large-scale dataset of high quality interleaved image-text reasoning traces 📜. Humans often draw visual aids like diagrams when solving problems, but existing VLMs reason mostly in pure text. 1/n

English
0
2
11
1.5K
Leon retweetledi
Micah Goldblum
Micah Goldblum@micahgoldblum·
🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n
Micah Goldblum tweet media
English
27
113
834
396.6K