Jasper Lu

102 posts

Jasper Lu

@lu__jasper

teaching models to design @figma, formerly @nuro. Knicks in 5

NYC 가입일 Temmuz 2009

101 팔로잉90 팔로워

Jasper Lu 리트윗함

Josh Hart@joshhart·11h

FROM NOW ON ADDRESS ME AS CHAMP! 🧡💙

English

4.1K

27.7K

225.1K

2.6M

Jasper Lu 리트윗함

Mayor Zohran Kwame Mamdani@NYCMayor·11h

Parade. Thursday. Manhattan.

English

2.6K

30.4K

344.4K

Jasper Lu@lu__jasper·11h

@LegionHoops @ChrisBHaynes What club are we going to

English

3.7K

Legion Hoops@LegionHoops·12h

BREAKING: Knicks will be flying back to New York to celebrate tonight, per @ChrisBHaynes

English

829

19.9K

1.1M

Jasper Lu@lu__jasper·18h

@ar0cket1 Have you tried this in the top k distillation setting as well (as opposed to just samples tokens)?

English

156

ar0cket1@ar0cket1·1d

x.com/i/article/2065…

ZXX

172

58.7K

Jasper Lu@lu__jasper·21h

@QiaochuYuan You can probably test this hypothesis by comparing a diffusion LLM vs an auto regressive one of the same family, e.g diffusion Gemma vs the usual one

English

QC@QiaochuYuan·2d

interesting hypothesis that the "not X, but Y" LLMism is an artifact of "not" being a high-probability completion since it can continue in so many different ways, and that other LLMisms can be understood similarly. anyone know if any work has been done on this?

English

247

11.4K

Jasper Lu@lu__jasper·22h

@barrowjoseph One fun direction I've been wanting to play around with (once I figure out how to do it without breaking the bank) is to turn indexing into precomputing KV caches over an entire corpus and then dumping them into an object store for faster filtering

English

Joe Barrow@barrowjoseph·22h

@lu__jasper Thinking the same! Thankfully “sonnet-level models” are getting cheaper and smaller.

English

206

Jasper Lu@lu__jasper·22h

Been thinking about this topic a lot while playing around with OBLIQ-bench. IMO, hard search will increasingly converge towards map-filter workflows in the future. As small models will get smarter and compute gets cheaper, it's hard to imagine that search doesn't just become: have an agent retrieve as many relevant docs as possible and then filter through all of them with a Sonnet level model.

Joe Barrow@barrowjoseph

x.com/i/article/2065…

English

947

Jasper Lu@lu__jasper·1d

@edwardzhou_ /goal make me a benchmark to test /goal loops to the point of performance degradation

English

Edward@edwardzhou_·1d

now that loops are trendy… are there any benchmarks where we test a models’s ability to extend its TTC infinitely via standard loops & measure the point of performance degradation e.g. how good is it at following a minimal /goal setup?

English

Jasper Lu@lu__jasper·1d

@signulll Excited for this. Once on device ai is good enough, I think we’ll start to see intelligence embedded into apps in some more fun ways than just being a chatbot

English

235

signüll@signulll·1d

my lord i am convinced on device ai will be good enough very very soon which will finally enable zero marginal cost ai products. that means network effects can actually take place. this will be a huge shift for consumer experiences.

English

993

90.8K

Jasper Lu@lu__jasper·1d

Is there a name for this kind of collage aesthetic I've been seeing lately

English

Jasper Lu@lu__jasper·1d

I've noticed in my own daily use that previous LLMs are pretty bad at writing complex sft / rl pipelines unless I send VERY detailed prompts. From their report, seems like these are the use-cases they were targeting with nerfs. But...competing labs probably already have the right talent inhouse, so these nerfs probably wouldn't hurt that much.

English

1.4K

Jack Morris@jxmnop·1d

An underrated part of this discussion is that (a) there's huge leverage in improving data, and (b) there's no way Anthropic could safeguard this xAI could instruct Fable to look through EVERY row of pretraining data and fix any typos and errors. this probably the single highest-leverage activity for a lab playing catchup and it's not possible for Anthropic to prevent this without completely kneecapping the model itself, because data quality work looks like any other kind of knowledge work ("check this text for errors", "rewrite this in a formal tone")

Max Zeff@ZeffMax

NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash. “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

English

161

107.4K

Jasper Lu@lu__jasper·1d

The new test of if you're really doing "cutting edge work" is whether Fable nerfs itself for you

English

Jasper Lu@lu__jasper·1d

Interesting that Google is investing in post-training their models towards such a narrow domain instead of betting on all-around scaling

Google Research@GoogleResearch

🚀 Introducing Gemini-SQL2, our breakthrough text-to-SQL capability powered by Gemini 3.1 Pro! We've achieved state-of-the-art results on the highly competitive BIRD benchmark, translating natural language into execution-ready SQL queries. 🧵👇

English

Jasper Lu@lu__jasper·2d

@SaiMandhan Always found it a little odd that RL env companies command such high multiples. Manual creation of environments has always felt a little bitter lesson pilled to me.

English

Sai Mandhan@SaiMandhan·2d

I’m curious how durable these RL env / human data companies are long term They’re essentially just selling shovels until the mine learns to dig itself If RSI takes off, models will generate, solve, critique, and expand their own curricula faster than any human can design new environments Feels very much like a business model with an expiration date They print money tho lol

English

104

22.4K

Jasper Lu@lu__jasper·2d

@teortaxesTex Their choice of benchmarks is a little..odd

English

273

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·2d

I want to see this compared with Composer 2.5 Like, really hard Cursor has a ton of proprietary data, a large head start, and threw a Colossus at RLing Kimi K2.5 checkpoint. What is the gap now?

Kimi.ai@Kimi_Moonshot

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6. 🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates. ⚡️ 6x High-Speed Mode coming soon! 🔌 Available today via Kimi API and Kimi Code. 🔗 Kimi Code: kimi.com/code 🔗 API: platform.moonshot.ai

English

493

42.8K

Jasper Lu@lu__jasper·2d

Links! Pedagogical RL - noahziems.com/pedagogical-rl Trajectory-Refined Distillation - arxiv.org/pdf/2606.08432 Speculative Knowledge Distillation - arxiv.org/pdf/2410.11325

English

Jasper Lu@lu__jasper·2d

Starting to see more work targeting a key failure mode in on-policy (self) distillation: when the student's rollout drifts too far from the teacher's distribution, the reward signal on later tokens can get noisy. The common thread people seem to be converging on is the idea of teacher intervention: instead of training on raw student rollouts, you set up a teacher to help shape the trajectory before distillation. Three techniques that stood out to me: 1/ Pedagogical RL You first RL a copy of the base model to be good at taking in privileged information (e.g. an answer key) and generating distillation-friendly rollouts. Then, you sample from this model during training instead of student and run distillation. 2/ Trajectory-Refined Distillation For each training example, you first sample a rollout from the student, then ask a teacher model to rewrite the trajectory, potentially using some privileged information. In the end, you distill the rewritten rollout to the student. 3/ Speculative Knowledge Distillation A little older than current wave of OP(S)D techniques, but still interesting: during student sampling, you compare each token with the teacher's top-k. If it falls outside, you sample the token from the teacher instead. This helps keep the rollout from going too off-track. My thoughts: - All three approaches sort of blur the line of what "on-policy" really means. - Pedagogical RL feels the most elegant, but IMO requires a little too much effort to gain widespread adoption. - To me, something like SKD but rebuilt for self distillation really feels the most natural next step. Anything out there doing this already?

English

247

Jasper Lu@lu__jasper·2d

Awesome to see that Google is pushing on open-source diffusion LLMs. I've long thought that dLLMs could be a better fit than autoregressive LLMs for tasks like frontend design, and have been patiently waiting for open-source models to mature so that I can play around with them.

Google Gemma@googlegemma

Meet DiffusionGemma! An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license. Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

English

126

Jasper Lu 리트윗함

Figma@figma·2d

Imagine a world where you could copy/paste websites into editable Figma layers (jk you don’t have to imagine you can do this now with our Chrome extension)

GIF

English

201

555

5.9K

2.3M

탐색

@LegionHoops @ChrisBHaynes @ar0cket1 @QiaochuYuan @barrowjoseph @edwardzhou_ @signulll @elonmusk