Talha Chafekar

57 posts

Talha Chafekar

@TalhaChafekar

CS@UMass. Interested in multimodal machine learning, language grounding, factuality and cats.

Katılım Eylül 2019

2.4K Takip Edilen125 Takipçiler

Talha Chafekar retweetledi

François Fleuret@francoisfleuret·22 Kas

I do not think you can pursue meaningful research without (1) some grandiose delusion about your abilities (2) a sense of esthetics and harmony to judge ideas still free of experimental confirmation (3) an unreasonable taste for the required tangible work (e.g. programming)

English

141

1.8K

189.8K

Talha Chafekar retweetledi

Zirui Liu@ziruirayliu·12 Haz

🔥Exited to share our new work on reproducibility challenges in reasoning models caused by numerical precision. Ever run the same prompt twice and get completely different answers from your LLM under greedy decoding? You're not alone. Most LLMs today default to BF16 precision, but we show this choice severely impacts the reproducibility of long generations — even under greedy decoding with a fixed seed. While issues like this are known in tools like vLLM and sgLang, the severity of the problem is widely underestimated. Many in the community still rely on single-run greedy decoding for evaluation — which can lead to misleading results. 🤯 To get a sense, switching from 2 GPUs to 4 GPUs may completely change your model outputs, with up to 9% drop in accuracy and a difference of 9,000 token length on standard benchmarks like AIME. Key takeaways: • ⚠️ Floating-point non-associativity causes tiny numerical errors to snowball in multi-step reasoning. • 🔄 Greedy decoding ≠ deterministic output — we observe up to 9% accuracy variance and 9,000 token difference in response length • 📉 When using random sampling with non-zero tempurature, the accuracy variance purely from numerical precision is 0.3%~2%, depending on the dataset size and the number of repeated runs. 🌍 Suggestions to the community: We urge the community to adopt better evaluation practices for LLMs — especially for tasks like math reasoning, code generation, and auto-grading: 1. Use random sampling + report Pass@k, average length, and error bars — especially on small datasets and low precision. 2. If using greedy decoding for token-by-token reproducibility, run it in FP32. To help, we released a vLLM patch for FP32 inference. 📄 Paper: lnkd.in/gZAjbWKA 💻 Code: lnkd.in/gwdGWFP5 📈 HF Summary: lnkd.in/gFjsK7Y9

English

14K

Talha Chafekar@TalhaChafekar·12 Haz

For folks interested in audio driven lip-sync, do checkout our work at #CVPR2025 tomorrow at AI4CC workshop!

Anushka@_anushkaagarwal

Catch us at #CVPR25 on June 12th at the AI for Content Creation workshop! With : @TalhaChafekar @UMassAmherst

English

404

Talha Chafekar@TalhaChafekar·12 Haz

Heading to SF for YC’s AI Startup School next week! If you're into NLP, multimodal ML, or just want to geek out over research, let’s meet up! #AI #NLP #SanFrancisco

English

338

Talha Chafekar retweetledi

Leena Mathur@lmathur_·10 Haz

Future AI systems interacting with humans will need to perform social reasoning that is grounded in behavioral cues and external knowledge. We introduce Social Genome to study and advance this form of reasoning in models! New paper w/ Marian Qian, @pliang279, & @lpmorency!

English

6.1K

Talha Chafekar retweetledi

Paul Liang@pliang279·12 Tem

Excited to release HEMM (Holistic Evaluation of Multimodal Foundation Models), the largest and most comprehensive evaluation for multimodal models like Gemini, GPT-4V, BLIP-2, OpenFlamingo, and more. HEMM contains 30 datasets carefully selected and categorized based on: 1. The **basic multimodal skills** needed to solve them – the type of multimodal interaction, granularity of multimodal alignment, level of reasoning, and need for external knowledge, 2. How **information flows** between modalities – querying, translation, editing, and fusion, 3. The real-world **use cases** they impact – multimedia, affective computing, healthcare, science & environment, HCI. paper: arxiv.org/abs/2407.03418 code: github.com/pliang279/HEMM we encourage the community to add their favorite models and datasets! w @AkshayGoindani1 @TalhaChafekar @lmathur_ @haofeiyu44 @lpmorency @rsalakhu @mldcmu @LTIatCMU

English

138

20.4K

Talha Chafekar retweetledi

Alex Albert@alexalbert__·4 Mar

Fun story from our internal testing on Claude 3 Opus. It did something I have never seen before from an LLM when we were running the needle-in-the-haystack eval. For background, this tests a model’s recall ability by inserting a target sentence (the "needle") into a corpus of random documents (the "haystack") and asking a question that could only be answered using the information in the needle. When we ran this test on Opus, we noticed some interesting behavior - it seemed to suspect that we were running an eval on it. Here was one of its outputs when we asked Opus to answer a question about pizza toppings by finding a needle within a haystack of a random collection of documents: Here is the most relevant sentence in the documents: "The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association." However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings. Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities. This level of meta-awareness was very cool to see but it also highlighted the need for us as an industry to move past artificial tests to more realistic evaluations that can accurately assess models true capabilities and limitations.

English

560

2.1K

11.8K

3.4M

Talha Chafekar retweetledi

Runway@runwayml·9 Eyl

Make any idea real. Just write it. Text to video, coming soon to Runway. Sign up for early access: runwayml.com

English

331

16.3K

Talha Chafekar retweetledi

Ana Lorena Fabrega@anafabrega11·28 Oca

Over 70% of kids play video games every day. Everyone thinks they should play less… …but here’s an interesting case for why they should play MORE 🎮👇🏼

English

212

1.5K

6.7K

Talha Chafekar@TalhaChafekar·19 Şub

@paperswithcode Is this link old? Since the mentioned dates are for 2020.

English

Papers with Code@paperswithcode·18 Şub

ML Reproducibility Challenge 💥 New Edition! 💥 A new Spring edition of the Reproducibility Challenge in response to increased demand from university courses. Submissions open 1 April, Deadline 15 July. More info coming soon! paperswithcode.com/rc2020?spring21

English

161

Talha Chafekar retweetledi

nishchith@inishchith·1 Ara

if unexamined, speculation at scale is (considered) reality.

English

Talha Chafekar retweetledi

joel@JoelDoesCyber·13 Kas

Me on the first day at my first tech job:

English

853

7.6K

Talha Chafekar retweetledi

Naval@naval·8 Kas

If you want to make the wrong decision, ask everyone.

English

297

5.1K

27.2K

Talha Chafekar retweetledi

Alvaro アルバロ@alvarosabu·2 Eki

#Hacktoberfest is the new dependabot. #ChangeMyMind

GIF

English

Talha Chafekar retweetledi

Sharif Shameem@sharifshameem·13 Tem

This is mind blowing. With GPT-3, I built a layout generator where you just describe any layout you want, and it generates the JSX code for you. W H A T

English

611

9.5K

37.5K

Talha Chafekar retweetledi

Neel Shah@9eel_·6 Tem

Github student pack is probably the single largest collection of student discounted resources. And for the longest time I found it hard to redeem those offers. After scouring the Internet for directions, finally found the link - education.github.com/pack/offers

English

Talha Chafekar retweetledi

Naval@naval·25 May

The modern devil is cheap dopamine.

English

259

2.8K

13.5K

Talha Chafekar retweetledi

Simon Kozlov@sim0nsays·19 Nis

Sometimes I wish I could spare a year to just look at neurons in the pretrained CNN and try to reverse-engineer what every neuron is doing. @ch402 and OpenAI Clarity team are crazy enough to actually do that. They found tons of cool stuff! More below distill.pub/2020/circuits/… /

English

208

Talha Chafekar retweetledi