Jasper Lu

106 posts

Jasper Lu

@lu__jasper

teaching models to design @figma, formerly @nuro. Knicks in 5

NYC Inscrit le Temmuz 2009

101 Abonnements93 Abonnés

Jasper Lu@lu__jasper·9h

Huge if these numbers hold up in real use. Long-horizon capability felt like one of the more durable moats for closed frontier models.

Z.ai@Zai_org

For GLM-5.2, we strengthened 1M-context training for coding agents across large-scale implementation, automated research, performance optimization, and complex debugging. The result is a long-context system that is both broad in scope and reliable in execution.

English

Jasper Lu@lu__jasper·1d

@immortaldip Oh nice! What do you do for the actual score assignment when using gemini flash btw? My approach right now reuses the embedding scores but just uses gpt to filter out irrelevant docs.

English

immortal@immortaldip·1d

We use it in production (reranker is gemini flash), infact if you look at the latency difference, it's not much in the whole end-to-end pipeline, cost is only problem if your customer can't digest it. And as you said, turning knos, you can inject additional guidelines per customer which none of the non-llm methods can provide.

English

106

Jasper Lu@lu__jasper·1d

Some early numbers to back this up (blasted through my usage limits to get these). I tested a simple "retrieve top 1k using embeddings and filter with GPT 5.5" approach over 5 rows of the OBLIQ-bench wildchat split. It's a tiny sample, but the signal is pretty clear: on these kinds of hard samples, filtering beats cross-encoders on every metric. I found the results pretty consistent across a number of different reranker models. The beauty of modern LLMs as it relates to hard search is that we now have a generic classifier whose precision / recall we can tune at will. In comparison, today's rerankers just aren't smart enough to parse these kinds of hard queries, and offer no real knobs to turn.

Jasper Lu@lu__jasper

Been thinking about this topic a lot while playing around with OBLIQ-bench. IMO, hard search will increasingly converge towards map-filter workflows in the future. As small models will get smarter and compute gets cheaper, it's hard to imagine that search doesn't just become: have an agent retrieve as many relevant docs as possible and then filter through all of them with a Sonnet level model.

English

4.1K

Jasper Lu@lu__jasper·1d

Agreed. There’s two interesting ways to improve retrieval right now: - Better embedding models, which can be harder than it seems - Agentic retrieval before the filter step I think search is always a balance of how much cost (in man-hours or compute) vs quality you’re willing to pay, and right now we’re a little too conservative in terms of cost.

English

Marek Galovic@marek_galovic·1d

@lu__jasper Assuming small models are still orders of magnitude more expensive, you’ll benefit from better retrieval before giving the model stuff to filter through.

English

Jasper Lu@lu__jasper·3d

Joe Barrow@barrowjoseph

x.com/i/article/2065…

English

5.4K

Jasper Lu retweeté

Mayor Zohran Kwame Mamdani@NYCMayor·3d

Parade. Thursday. Manhattan.

English

3.2K

33.9K

401.1K

14.7M

Jasper Lu@lu__jasper·3d

@LegionHoops @ChrisBHaynes What club are we going to

English

Legion Hoops@LegionHoops·3d

BREAKING: Knicks will be flying back to New York to celebrate tonight, per @ChrisBHaynes

English

835

20.3K

1.4M

Jasper Lu@lu__jasper·3d

@ar0cket1 Have you tried this in the top k distillation setting as well (as opposed to just samples tokens)?

English

281

ar0cket1@ar0cket1·3d

x.com/i/article/2065…

ZXX

216

83.3K

Jasper Lu@lu__jasper·3d

@QiaochuYuan You can probably test this hypothesis by comparing a diffusion LLM vs an auto regressive one of the same family, e.g diffusion Gemma vs the usual one

English

QC@QiaochuYuan·5d

interesting hypothesis that the "not X, but Y" LLMism is an artifact of "not" being a high-probability completion since it can continue in so many different ways, and that other LLMisms can be understood similarly. anyone know if any work has been done on this?

English

248

11.5K

Jasper Lu@lu__jasper·3d

@barrowjoseph One fun direction I've been wanting to play around with (once I figure out how to do it without breaking the bank) is to turn indexing into precomputing KV caches over an entire corpus and then dumping them into an object store for faster filtering

English

Joe Barrow@barrowjoseph·3d

@lu__jasper Thinking the same! Thankfully “sonnet-level models” are getting cheaper and smaller.

English

283

Jasper Lu@lu__jasper·4d

@edwardzhou_ /goal make me a benchmark to test /goal loops to the point of performance degradation

English

Edward@edwardzhou_·4d

now that loops are trendy… are there any benchmarks where we test a models’s ability to extend its TTC infinitely via standard loops & measure the point of performance degradation e.g. how good is it at following a minimal /goal setup?

English

Jasper Lu@lu__jasper·4d

@signulll Excited for this. Once on device ai is good enough, I think we’ll start to see intelligence embedded into apps in some more fun ways than just being a chatbot

English

238

signüll@signulll·4d

my lord i am convinced on device ai will be good enough very very soon which will finally enable zero marginal cost ai products. that means network effects can actually take place. this will be a huge shift for consumer experiences.

English

994

94.1K

Jasper Lu@lu__jasper·4d

Is there a name for this kind of collage aesthetic I've been seeing lately

English

Jasper Lu@lu__jasper·4d

I've noticed in my own daily use that previous LLMs are pretty bad at writing complex sft / rl pipelines unless I send VERY detailed prompts. From their report, seems like these are the use-cases they were targeting with nerfs. But...competing labs probably already have the right talent inhouse, so these nerfs probably wouldn't hurt that much.

English

1.4K

Jack Morris@jxmnop·4d

An underrated part of this discussion is that (a) there's huge leverage in improving data, and (b) there's no way Anthropic could safeguard this xAI could instruct Fable to look through EVERY row of pretraining data and fix any typos and errors. this probably the single highest-leverage activity for a lab playing catchup and it's not possible for Anthropic to prevent this without completely kneecapping the model itself, because data quality work looks like any other kind of knowledge work ("check this text for errors", "rewrite this in a formal tone")

Max Zeff@ZeffMax

NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash. “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

English

161

108.8K

Jasper Lu@lu__jasper·4d

The new test of if you're really doing "cutting edge work" is whether Fable nerfs itself for you

English

Jasper Lu@lu__jasper·4d

Interesting that Google is investing in post-training their models towards such a narrow domain instead of betting on all-around scaling

Google Research@GoogleResearch

🚀 Introducing Gemini-SQL2, our breakthrough text-to-SQL capability powered by Gemini 3.1 Pro! We've achieved state-of-the-art results on the highly competitive BIRD benchmark, translating natural language into execution-ready SQL queries. 🧵👇

English

Jasper Lu@lu__jasper·4d

@SaiMandhan Always found it a little odd that RL env companies command such high multiples. Manual creation of environments has always felt a little bitter lesson pilled to me.

English

Sai Mandhan@SaiMandhan·5d

I’m curious how durable these RL env / human data companies are long term They’re essentially just selling shovels until the mine learns to dig itself If RSI takes off, models will generate, solve, critique, and expand their own curricula faster than any human can design new environments Feels very much like a business model with an expiration date They print money tho lol

English

104

22.6K

Jasper Lu@lu__jasper·4d

@teortaxesTex Their choice of benchmarks is a little..odd

English

278

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·4d

I want to see this compared with Composer 2.5 Like, really hard Cursor has a ton of proprietary data, a large head start, and threw a Colossus at RLing Kimi K2.5 checkpoint. What is the gap now?

Kimi.ai@Kimi_Moonshot

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6. 🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates. ⚡️ 6x High-Speed Mode coming soon! 🔌 Available today via Kimi API and Kimi Code. 🔗 Kimi Code: kimi.com/code 🔗 API: platform.moonshot.ai

English

493

43K

Découvrir

@immortaldip @LegionHoops @ChrisBHaynes @ar0cket1 @QiaochuYuan @barrowjoseph @edwardzhou_ @signulll