Haidar Khan (@haidarkk1) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Haidar Khan@haidarkk1·21 Nis

Here is some news I'm happy to share before the @iclr_conf FOMO really starts to set in 😢. We have been playing with the idea of using games as LLM evals (pun intended) for a while now and it's finally ready! ZeroSumEval is a scalable evaluation methodology that pits models against each other and calculate ratings. Paper 📜, code💻, and details🗒️ in the 🧵. Here is the TL;DR: - Since it's PvP ⚔️ the hardness scales with model capabilities 🤖 making it hard to saturate. - The evaluations in ZSEval are dynamic 🔄 and verifiable ✅, so it's difficult to overfit. Rote memorization is especially penalized. - Observing model behavior in games leads to interesting insights 🔍, such as creative attempts (or lack thereof) at jailbreaking other models. Had a blast working on ZeroSumEval with @HishamAlyahya , @y_alnumay , @sbmaruf, and Bülent Yener (great to collaborate 🤝with my advisor again).

English

3

7

36

6.5K

Haidar Khan@haidarkk1·24 Mar

@imranye Can you show what the final result looks like?

English

0

241

imran@imranye·24 Mar

i copied and pasted this tweet into my hermes agent and it one shotted this whole thing 😭

Claude@claudeai

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

English

9

1

42

13.1K

Haidar Khan@haidarkk1·20 Mar

Its trained on a good chunk of coding and math data as well as scientific papers. It's pretty good at answering factual questions that require straight recall. It doesn't have reasoning, so multi-step QA will be difficult. The intention was to transfer these capabilities to Arabic and according to benchmarks it worked. More details in the paper (arxiv.org/pdf/2407.15390), this table shows the pretraining data distribution by domain:

English

0

2

23

Joe@joebradford·18 Mar

@haidarkk1 As far as the capabilities of this model, what do you think it is best suited for in light of what other models have to offer, (although not trained on Arabic)

English

1

0

36

Joe@joebradford·15 Mar

Who out there from my followers is fine tuning LLMs on an Arabic corpus? Im looking for groups and communities I can join to get advice and better understand the space.

English

10

0

23

5.1K

Haidar Khan@haidarkk1·15 Mar

@joebradford yes, the 7B model is open-weights on HF: huggingface.co/humain-ai/ALLa… You can talk to the 34B model on the chat app, I think API access coming later: chat.humain.ai

English

1

0

2

80

Joe@joebradford·15 Mar

@haidarkk1 Oh interesting. Did they ever release is publicly?

English

1

0

347

Haidar Khan@haidarkk1·15 Mar

@imranye And somehow we still have the muffintop going :D

English

0

6

1.9K

imran@imranye·15 Mar

for suhoor i made a kirkland water bottle and a medjool date

English

25

461

10.1K

211.4K

Haidar Khan@haidarkk1·23 Şub

@goodfellow_ian @daniel_rossett Thats awesome, can you share some of the changes you made?

English

0

1

455

Ian Goodfellow@goodfellow_ian·23 Şub

I'd like to thank @daniel_rossett for his help in my recovery from the POTS version of Long COVID. Daniel was key in bringing me back from highly disabled and suffering to being able to do what I want to again. This X account is mostly focused on ML / AI. From that point of view, many of you know that in December 2024, I wasn't able to do the test of time award talk at NeurIPS, even by video call. Daniel started working with me in March 2025. By April, I started to have days of no POTS symptoms, by June I was off all heart rate lowering medications, by September I was back to work. I'm back to full exercise, running, lifting weights, mountain biking, and have even done things I hadn't done before I got sick, like riding Whistler Mountain Bike Park. I'm now getting the word out to help Daniel build a company that will bring this approach to more people.

English

171

83

2.7K

206.3K

Haidar Khan@haidarkk1·6 Şub

Anthropic = Apple OpenAI = Microsoft Both win.

English

0

87

Haidar Khan@haidarkk1·4 Şub

Ask claude to review your AWS infrastructure to save costs. You’re welcome.

English

0

73

Haidar Khan@haidarkk1·28 Oca

x.com/i/article/2016…

ZXX

0

1

75

Haidar Khan@haidarkk1·18 Ara

@sbmaruf I dont think thats a credit to the city LOL

English

1

0

41

M Saiful Bari (MARUF)@sbmaruf·18 Ara

true. @haidarkk1

mikeBuildsMore@mkliku

Moving to SF is realizing this show wasn't a comedy, it was a documentary.

English

1

0

1

160

Haidar Khan@haidarkk1·6 Kas

@jyangballin Nice work, I like the coding focus. In ZeroSumEval we saw similar results, especially in the Pyjail simulation. x.com/haidarkk1/stat…

Haidar Khan@haidarkk1

Here is some news I'm happy to share before the @iclr_conf FOMO really starts to set in 😢. We have been playing with the idea of using games as LLM evals (pun intended) for a while now and it's finally ready! ZeroSumEval is a scalable evaluation methodology that pits models against each other and calculate ratings. Paper 📜, code💻, and details🗒️ in the 🧵. Here is the TL;DR: - Since it's PvP ⚔️ the hardness scales with model capabilities 🤖 making it hard to saturate. - The evaluations in ZSEval are dynamic 🔄 and verifiable ✅, so it's difficult to overfit. Rote memorization is especially penalized. - Observing model behavior in games leads to interesting insights 🔍, such as creative attempts (or lack thereof) at jailbreaking other models. Had a blast working on ZeroSumEval with @HishamAlyahya , @y_alnumay , @sbmaruf, and Bülent Yener (great to collaborate 🤝with my advisor again).

English

1

0

2

172

John Yang@jyangballin·5 Kas

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

English

31

99

416

101.9K

Haidar Khan@haidarkk1·26 Eki

@SemiAnalysis_ @sbmaruf :)

QAM

0

133

SemiAnalysis@SemiAnalysis_·26 Eki

Zareen is one of the go to places for many SF Bay Area AI researchers to get a quick bite. Most of the food is very good and was even on the Michelin guide in 2020. AI researchers not experienced with the Indian cuisine will commonly order their chicken tikka masala with garlic naan and mango lassi.

English

51

14

593

242.8K

Haidar Khan@haidarkk1·16 Ağu

@sbmaruf @cursor_ai @mntruell @sualehasif996 Lol, not really. Just genuinely curious who finds that useful

English

0

42

M Saiful Bari (MARUF)@sbmaruf·16 Ağu

@haidarkk1 @cursor_ai @mntruell @sualehasif996 someone’s mad…!!!! 🤣

English

1

0

58

Haidar Khan@haidarkk1·15 Ağu

why is @cursor_ai agent creating sloppy READMEs for every change? @mntruell @sualehasif996

English

1

0

177

Haidar Khan@haidarkk1·14 Ağu

@xeophon This is an interesting benchmark, as others have pointed out its a mistake to allow silent correction. Should actually penalize for that.

English

1

0

3

287

Florian Brand@xeophon·14 Ağu

After thinking about this problem for months, I am so happy to finally introduce DetailBench! It answers a simple question: How good are current LLMs at finding small errors, when they are *not* explicitly asked to do so? (Yes, the graph is right!)

English

77

63

928

141K

Haidar Khan@haidarkk1·23 Tem

@dylan522p They yearn for the mines

English

0

2

941

Dylan Patel@dylan522p·22 Tem

The children yearn to be working in fabs Taiwan high school science exhibition this year are discussing about 1.5nm Gate All Around transistor structure optimization The kids are unbelievably cracked

Haarlemmermeer, Nederland 🇳🇱 English

40

83

1.2K

94.8K

Haidar Khan@haidarkk1·11 Tem

@jxmnop Right, also good to remember that this wasnt possible before the current generation of models. We couldn’t just get better by defining the task and getting data - thats new and a reason to be excited.

English

0

1

795

Jack Morris@jxmnop·11 Tem

How frontier AI models are built 1. identify task model cannot solve 2. create task eval 3. collect new *eval-specific* training data 4. train model 5. model can do the task now 6. if not (AGI achieved): goto step 1 GLUE, MMLU, MATH, AIME, HLE, GPQA, ARC a tale as old as time

English

27

25

671

39.2K

Haidar Khan@haidarkk1·10 Tem

@stevenheidel @xai rofl

English

0

198

Steven Heidel@stevenheidel·10 Tem

@xai does it start at 8 or nein?

English

121

70

3.9K

110.9K

xAI@xai·10 Tem

The Grok 4 livestream will begin soon. Stay tuned.

English

4.4K

1.9K

20.4K

5.8M

Haidar Khan@haidarkk1·8 Tem

@karpathy @chasedownleads This is on my timeline because you commented :D

English

0

2

201

Andrej Karpathy@karpathy·8 Tem

@chasedownleads Why is this on my timeline

English

425

53

6.3K

163.7K

Chase Passive Income@chasedownleads·7 Tem

Jeff Bezos is rich for one simple reason: He's BALD and has saved $238.4 billion by never needing to pay for a haircut Study the rich if you want to be wealthy

English

263

2.4K

50.9K

1.5M

Haidar Khan@haidarkk1·3 Tem

@ZeyuanAllenZhu You need more allocation!

English

0

1.1K

Zeyuan Allen-Zhu, Sc.D.@ZeyuanAllenZhu·2 Tem

No matter how AI evolves overnight—tech, career, how it may impact me—I remain committed to using "physics of language models" approach to predict next-gen AI. Due to my limited GPU access at Meta, Part 4.1 (+new 4.2) are still in progress, but results on Canon layers are shining

Zeyuan Allen-Zhu, Sc.D.@ZeyuanAllenZhu

(1/8)🍎A Galileo moment for LLM design🍎 As Pisa Tower experiment sparked modern physics, our controlled synthetic pretraining playground reveals LLM architectures' true limits. A turning point that might divide LLM research into "before" and "after." physics.allen-zhu.com/part-4-archite…

English

22

65

833

459.2K

Haidar Khan

Keşfet