Haidar Khan

312 posts

Haidar Khan banner
Haidar Khan

Haidar Khan

@haidarkk1

Research Scientist (Currently @Meta, previously @SDAIA_SA, @Amazon). PhD CS @rpi. Scale smarter, not harder. Opinions generated from an independent LLM.

Atlanta, GA Katılım Ocak 2017
138 Takip Edilen303 Takipçiler
Sabitlenmiş Tweet
Haidar Khan
Haidar Khan@haidarkk1·
Here is some news I'm happy to share before the @iclr_conf FOMO really starts to set in 😢. We have been playing with the idea of using games as LLM evals (pun intended) for a while now and it's finally ready! ZeroSumEval is a scalable evaluation methodology that pits models against each other and calculate ratings. Paper 📜, code💻, and details🗒️ in the 🧵. Here is the TL;DR: - Since it's PvP ⚔️ the hardness scales with model capabilities 🤖 making it hard to saturate. - The evaluations in ZSEval are dynamic 🔄 and verifiable ✅, so it's difficult to overfit. Rote memorization is especially penalized. - Observing model behavior in games leads to interesting insights 🔍, such as creative attempts (or lack thereof) at jailbreaking other models. Had a blast working on ZeroSumEval with @HishamAlyahya , @y_alnumay , @sbmaruf, and Bülent Yener (great to collaborate 🤝with my advisor again).
Haidar Khan tweet media
English
3
7
36
6.5K
Haidar Khan
Haidar Khan@haidarkk1·
@imranye Can you show what the final result looks like?
English
0
0
0
241
Haidar Khan
Haidar Khan@haidarkk1·
Its trained on a good chunk of coding and math data as well as scientific papers. It's pretty good at answering factual questions that require straight recall. It doesn't have reasoning, so multi-step QA will be difficult. The intention was to transfer these capabilities to Arabic and according to benchmarks it worked. More details in the paper (arxiv.org/pdf/2407.15390), this table shows the pretraining data distribution by domain:
Haidar Khan tweet media
English
0
0
2
23
Joe
Joe@joebradford·
@haidarkk1 As far as the capabilities of this model, what do you think it is best suited for in light of what other models have to offer, (although not trained on Arabic)
English
1
0
0
36
Joe
Joe@joebradford·
Who out there from my followers is fine tuning LLMs on an Arabic corpus? Im looking for groups and communities I can join to get advice and better understand the space.
English
10
0
23
5.1K
Joe
Joe@joebradford·
@haidarkk1 Oh interesting. Did they ever release is publicly?
English
1
0
0
347
Haidar Khan
Haidar Khan@haidarkk1·
@imranye And somehow we still have the muffintop going :D
English
0
0
6
1.9K
imran
imran@imranye·
for suhoor i made a kirkland water bottle and a medjool date
English
25
461
10.1K
211.4K
Ian Goodfellow
Ian Goodfellow@goodfellow_ian·
I'd like to thank @daniel_rossett for his help in my recovery from the POTS version of Long COVID. Daniel was key in bringing me back from highly disabled and suffering to being able to do what I want to again. This X account is mostly focused on ML / AI. From that point of view, many of you know that in December 2024, I wasn't able to do the test of time award talk at NeurIPS, even by video call. Daniel started working with me in March 2025. By April, I started to have days of no POTS symptoms, by June I was off all heart rate lowering medications, by September I was back to work. I'm back to full exercise, running, lifting weights, mountain biking, and have even done things I hadn't done before I got sick, like riding Whistler Mountain Bike Park. I'm now getting the word out to help Daniel build a company that will bring this approach to more people.
English
171
83
2.7K
206.3K
Haidar Khan
Haidar Khan@haidarkk1·
Anthropic = Apple OpenAI = Microsoft Both win.
English
0
0
0
87
Haidar Khan
Haidar Khan@haidarkk1·
Ask claude to review your AWS infrastructure to save costs. You’re welcome.
English
0
0
0
73
Haidar Khan
Haidar Khan@haidarkk1·
@sbmaruf I dont think thats a credit to the city LOL
English
1
0
0
41
Haidar Khan
Haidar Khan@haidarkk1·
@jyangballin Nice work, I like the coding focus. In ZeroSumEval we saw similar results, especially in the Pyjail simulation. x.com/haidarkk1/stat…
Haidar Khan@haidarkk1

Here is some news I'm happy to share before the @iclr_conf FOMO really starts to set in 😢. We have been playing with the idea of using games as LLM evals (pun intended) for a while now and it's finally ready! ZeroSumEval is a scalable evaluation methodology that pits models against each other and calculate ratings. Paper 📜, code💻, and details🗒️ in the 🧵. Here is the TL;DR: - Since it's PvP ⚔️ the hardness scales with model capabilities 🤖 making it hard to saturate. - The evaluations in ZSEval are dynamic 🔄 and verifiable ✅, so it's difficult to overfit. Rote memorization is especially penalized. - Observing model behavior in games leads to interesting insights 🔍, such as creative attempts (or lack thereof) at jailbreaking other models. Had a blast working on ZeroSumEval with @HishamAlyahya , @y_alnumay , @sbmaruf, and Bülent Yener (great to collaborate 🤝with my advisor again).

English
1
0
2
172
John Yang
John Yang@jyangballin·
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
English
31
99
416
101.9K
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
Zareen is one of the go to places for many SF Bay Area AI researchers to get a quick bite. Most of the food is very good and was even on the Michelin guide in 2020. AI researchers not experienced with the Indian cuisine will commonly order their chicken tikka masala with garlic naan and mango lassi.
SemiAnalysis tweet media
English
51
14
593
242.8K
Haidar Khan
Haidar Khan@haidarkk1·
@xeophon This is an interesting benchmark, as others have pointed out its a mistake to allow silent correction. Should actually penalize for that.
English
1
0
3
287
Florian Brand
Florian Brand@xeophon·
After thinking about this problem for months, I am so happy to finally introduce DetailBench! It answers a simple question: How good are current LLMs at finding small errors, when they are *not* explicitly asked to do so? (Yes, the graph is right!)
Florian Brand tweet media
English
77
63
928
141K
Dylan Patel
Dylan Patel@dylan522p·
The children yearn to be working in fabs Taiwan high school science exhibition this year are discussing about 1.5nm Gate All Around transistor structure optimization The kids are unbelievably cracked
Dylan Patel tweet media
Haarlemmermeer, Nederland 🇳🇱 English
40
83
1.2K
94.8K
Haidar Khan
Haidar Khan@haidarkk1·
@jxmnop Right, also good to remember that this wasnt possible before the current generation of models. We couldn’t just get better by defining the task and getting data - thats new and a reason to be excited.
English
0
0
1
795
Jack Morris
Jack Morris@jxmnop·
How frontier AI models are built 1. identify task model cannot solve 2. create task eval 3. collect new *eval-specific* training data 4. train model 5. model can do the task now 6. if not (AGI achieved): goto step 1 GLUE, MMLU, MATH, AIME, HLE, GPQA, ARC a tale as old as time
English
27
25
671
39.2K
xAI
xAI@xai·
The Grok 4 livestream will begin soon. Stay tuned.
English
4.4K
1.9K
20.4K
5.8M
Chase Passive Income
Chase Passive Income@chasedownleads·
Jeff Bezos is rich for one simple reason: He's BALD and has saved $238.4 billion by never needing to pay for a haircut Study the rich if you want to be wealthy
Chase Passive Income tweet media
English
263
2.4K
50.9K
1.5M
Zeyuan Allen-Zhu, Sc.D.
Zeyuan Allen-Zhu, Sc.D.@ZeyuanAllenZhu·
No matter how AI evolves overnight—tech, career, how it may impact me—I remain committed to using "physics of language models" approach to predict next-gen AI. Due to my limited GPU access at Meta, Part 4.1 (+new 4.2) are still in progress, but results on Canon layers are shining
Zeyuan Allen-Zhu, Sc.D. tweet media
Zeyuan Allen-Zhu, Sc.D.@ZeyuanAllenZhu

(1/8)🍎A Galileo moment for LLM design🍎 As Pisa Tower experiment sparked modern physics, our controlled synthetic pretraining playground reveals LLM architectures' true limits. A turning point that might divide LLM research into "before" and "after." physics.allen-zhu.com/part-4-archite…

English
22
65
833
459.2K