Jasper Lu

106 posts

Jasper Lu

Jasper Lu

@lu__jasper

teaching models to design @figma, formerly @nuro. Knicks in 5

NYC Inscrit le Temmuz 2009
101 Abonnements93 Abonnés
Jasper Lu
Jasper Lu@lu__jasper·
@immortaldip Oh nice! What do you do for the actual score assignment when using gemini flash btw? My approach right now reuses the embedding scores but just uses gpt to filter out irrelevant docs.
English
1
0
0
30
immortal
immortal@immortaldip·
We use it in production (reranker is gemini flash), infact if you look at the latency difference, it's not much in the whole end-to-end pipeline, cost is only problem if your customer can't digest it. And as you said, turning knos, you can inject additional guidelines per customer which none of the non-llm methods can provide.
English
1
0
1
106
Jasper Lu
Jasper Lu@lu__jasper·
Some early numbers to back this up (blasted through my usage limits to get these). I tested a simple "retrieve top 1k using embeddings and filter with GPT 5.5" approach over 5 rows of the OBLIQ-bench wildchat split. It's a tiny sample, but the signal is pretty clear: on these kinds of hard samples, filtering beats cross-encoders on every metric. I found the results pretty consistent across a number of different reranker models. The beauty of modern LLMs as it relates to hard search is that we now have a generic classifier whose precision / recall we can tune at will. In comparison, today's rerankers just aren't smart enough to parse these kinds of hard queries, and offer no real knobs to turn.
Jasper Lu tweet media
Jasper Lu@lu__jasper

Been thinking about this topic a lot while playing around with OBLIQ-bench. IMO, hard search will increasingly converge towards map-filter workflows in the future. As small models will get smarter and compute gets cheaper, it's hard to imagine that search doesn't just become: have an agent retrieve as many relevant docs as possible and then filter through all of them with a Sonnet level model.

English
2
2
17
4.1K
Jasper Lu
Jasper Lu@lu__jasper·
Agreed. There’s two interesting ways to improve retrieval right now: - Better embedding models, which can be harder than it seems - Agentic retrieval before the filter step I think search is always a balance of how much cost (in man-hours or compute) vs quality you’re willing to pay, and right now we’re a little too conservative in terms of cost.
English
0
0
0
30
Marek Galovic
Marek Galovic@marek_galovic·
@lu__jasper Assuming small models are still orders of magnitude more expensive, you’ll benefit from better retrieval before giving the model stuff to filter through.
English
1
0
0
68
Jasper Lu
Jasper Lu@lu__jasper·
Been thinking about this topic a lot while playing around with OBLIQ-bench. IMO, hard search will increasingly converge towards map-filter workflows in the future. As small models will get smarter and compute gets cheaper, it's hard to imagine that search doesn't just become: have an agent retrieve as many relevant docs as possible and then filter through all of them with a Sonnet level model.
Joe Barrow@barrowjoseph

x.com/i/article/2065…

English
3
0
10
5.4K
Legion Hoops
Legion Hoops@LegionHoops·
BREAKING: Knicks will be flying back to New York to celebrate tonight, per @ChrisBHaynes
English
62
835
20.3K
1.4M
Jasper Lu
Jasper Lu@lu__jasper·
@ar0cket1 Have you tried this in the top k distillation setting as well (as opposed to just samples tokens)?
English
1
0
0
281
Jasper Lu
Jasper Lu@lu__jasper·
@QiaochuYuan You can probably test this hypothesis by comparing a diffusion LLM vs an auto regressive one of the same family, e.g diffusion Gemma vs the usual one
English
0
0
0
38
QC
QC@QiaochuYuan·
interesting hypothesis that the "not X, but Y" LLMism is an artifact of "not" being a high-probability completion since it can continue in so many different ways, and that other LLMisms can be understood similarly. anyone know if any work has been done on this?
QC tweet media
English
23
15
248
11.5K
Jasper Lu
Jasper Lu@lu__jasper·
@barrowjoseph One fun direction I've been wanting to play around with (once I figure out how to do it without breaking the bank) is to turn indexing into precomputing KV caches over an entire corpus and then dumping them into an object store for faster filtering
English
1
0
1
69
Joe Barrow
Joe Barrow@barrowjoseph·
@lu__jasper Thinking the same! Thankfully “sonnet-level models” are getting cheaper and smaller.
English
1
0
1
283
Jasper Lu
Jasper Lu@lu__jasper·
@edwardzhou_ /goal make me a benchmark to test /goal loops to the point of performance degradation
English
0
0
0
48
Edward
Edward@edwardzhou_·
now that loops are trendy… are there any benchmarks where we test a models’s ability to extend its TTC infinitely via standard loops & measure the point of performance degradation e.g. how good is it at following a minimal /goal setup?
English
1
0
0
93
Jasper Lu
Jasper Lu@lu__jasper·
@signulll Excited for this. Once on device ai is good enough, I think we’ll start to see intelligence embedded into apps in some more fun ways than just being a chatbot
English
0
0
0
238
signüll
signüll@signulll·
my lord i am convinced on device ai will be good enough very very soon which will finally enable zero marginal cost ai products. that means network effects can actually take place. this will be a huge shift for consumer experiences.
English
79
33
994
94.1K
Jasper Lu
Jasper Lu@lu__jasper·
Is there a name for this kind of collage aesthetic I've been seeing lately
Jasper Lu tweet mediaJasper Lu tweet media
English
0
0
2
93
Jasper Lu
Jasper Lu@lu__jasper·
I've noticed in my own daily use that previous LLMs are pretty bad at writing complex sft / rl pipelines unless I send VERY detailed prompts. From their report, seems like these are the use-cases they were targeting with nerfs. But...competing labs probably already have the right talent inhouse, so these nerfs probably wouldn't hurt that much.
English
0
0
0
1.4K
Jack Morris
Jack Morris@jxmnop·
An underrated part of this discussion is that (a) there's huge leverage in improving data, and (b) there's no way Anthropic could safeguard this xAI could instruct Fable to look through EVERY row of pretraining data and fix any typos and errors. this probably the single highest-leverage activity for a lab playing catchup and it's not possible for Anthropic to prevent this without completely kneecapping the model itself, because data quality work looks like any other kind of knowledge work ("check this text for errors", "rewrite this in a formal tone")
Max Zeff@ZeffMax

NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash. “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

English
15
2
161
108.8K
Jasper Lu
Jasper Lu@lu__jasper·
The new test of if you're really doing "cutting edge work" is whether Fable nerfs itself for you
English
0
0
1
56
Jasper Lu
Jasper Lu@lu__jasper·
@SaiMandhan Always found it a little odd that RL env companies command such high multiples. Manual creation of environments has always felt a little bitter lesson pilled to me.
English
0
0
0
74
Sai Mandhan
Sai Mandhan@SaiMandhan·
I’m curious how durable these RL env / human data companies are long term They’re essentially just selling shovels until the mine learns to dig itself If RSI takes off, models will generate, solve, critique, and expand their own curricula faster than any human can design new environments Feels very much like a business model with an expiration date They print money tho lol
English
18
1
104
22.6K