Talha Chafekar

57 posts

Talha Chafekar banner
Talha Chafekar

Talha Chafekar

@TalhaChafekar

CS@UMass. Interested in multimodal machine learning, language grounding, factuality and cats.

Katılım Eylül 2019
2.4K Takip Edilen125 Takipçiler
Talha Chafekar retweetledi
François Fleuret
François Fleuret@francoisfleuret·
I do not think you can pursue meaningful research without (1) some grandiose delusion about your abilities (2) a sense of esthetics and harmony to judge ideas still free of experimental confirmation (3) an unreasonable taste for the required tangible work (e.g. programming)
English
36
141
1.8K
189.8K
Talha Chafekar retweetledi
Zirui Liu
Zirui Liu@ziruirayliu·
🔥Exited to share our new work on reproducibility challenges in reasoning models caused by numerical precision. Ever run the same prompt twice and get completely different answers from your LLM under greedy decoding? You're not alone. Most LLMs today default to BF16 precision, but we show this choice severely impacts the reproducibility of long generations — even under greedy decoding with a fixed seed. While issues like this are known in tools like vLLM and sgLang, the severity of the problem is widely underestimated. Many in the community still rely on single-run greedy decoding for evaluation — which can lead to misleading results. 🤯 To get a sense, switching from 2 GPUs to 4 GPUs may completely change your model outputs, with up to 9% drop in accuracy and a difference of 9,000 token length on standard benchmarks like AIME. Key takeaways: • ⚠️ Floating-point non-associativity causes tiny numerical errors to snowball in multi-step reasoning. • 🔄 Greedy decoding ≠ deterministic output — we observe up to 9% accuracy variance and 9,000 token difference in response length • 📉 When using random sampling with non-zero tempurature, the accuracy variance purely from numerical precision is 0.3%~2%, depending on the dataset size and the number of repeated runs. 🌍 Suggestions to the community: We urge the community to adopt better evaluation practices for LLMs — especially for tasks like math reasoning, code generation, and auto-grading: 1. Use random sampling + report Pass@k, average length, and error bars — especially on small datasets and low precision. 2. If using greedy decoding for token-by-token reproducibility, run it in FP32. To help, we released a vLLM patch for FP32 inference. 📄 Paper: lnkd.in/gZAjbWKA 💻 Code: lnkd.in/gwdGWFP5 📈 HF Summary: lnkd.in/gFjsK7Y9
English
4
21
95
14K
Talha Chafekar
Talha Chafekar@TalhaChafekar·
Heading to SF for YC’s AI Startup School next week! If you're into NLP, multimodal ML, or just want to geek out over research, let’s meet up! #AI #NLP #SanFrancisco
English
1
0
5
338
Talha Chafekar retweetledi
Leena Mathur
Leena Mathur@lmathur_·
Future AI systems interacting with humans will need to perform social reasoning that is grounded in behavioral cues and external knowledge. We introduce Social Genome to study and advance this form of reasoning in models! New paper w/ Marian Qian, @pliang279, & @lpmorency!
Leena Mathur tweet media
English
2
13
38
6.1K
Talha Chafekar retweetledi
Paul Liang
Paul Liang@pliang279·
Excited to release HEMM (Holistic Evaluation of Multimodal Foundation Models), the largest and most comprehensive evaluation for multimodal models like Gemini, GPT-4V, BLIP-2, OpenFlamingo, and more. HEMM contains 30 datasets carefully selected and categorized based on: 1. The **basic multimodal skills** needed to solve them – the type of multimodal interaction, granularity of multimodal alignment, level of reasoning, and need for external knowledge, 2. How **information flows** between modalities – querying, translation, editing, and fusion, 3. The real-world **use cases** they impact – multimedia, affective computing, healthcare, science & environment, HCI. paper: arxiv.org/abs/2407.03418 code: github.com/pliang279/HEMM we encourage the community to add their favorite models and datasets! w @AkshayGoindani1 @TalhaChafekar @lmathur_ @haofeiyu44 @lpmorency @rsalakhu @mldcmu @LTIatCMU
Paul Liang tweet media
English
2
35
138
20.4K
Talha Chafekar retweetledi
Alex Albert
Alex Albert@alexalbert__·
Fun story from our internal testing on Claude 3 Opus. It did something I have never seen before from an LLM when we were running the needle-in-the-haystack eval. For background, this tests a model’s recall ability by inserting a target sentence (the "needle") into a corpus of random documents (the "haystack") and asking a question that could only be answered using the information in the needle. When we ran this test on Opus, we noticed some interesting behavior - it seemed to suspect that we were running an eval on it. Here was one of its outputs when we asked Opus to answer a question about pizza toppings by finding a needle within a haystack of a random collection of documents: Here is the most relevant sentence in the documents: "The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association." However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings. Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities. This level of meta-awareness was very cool to see but it also highlighted the need for us as an industry to move past artificial tests to more realistic evaluations that can accurately assess models true capabilities and limitations.
Alex Albert tweet media
English
560
2.1K
11.8K
3.4M
Talha Chafekar retweetledi
Runway
Runway@runwayml·
Make any idea real. Just write it. Text to video, coming soon to Runway. Sign up for early access: runwayml.com
English
331
3K
16.3K
0
Talha Chafekar retweetledi
Ana Lorena Fabrega
Ana Lorena Fabrega@anafabrega11·
Over 70% of kids play video games every day. Everyone thinks they should play less… …but here’s an interesting case for why they should play MORE 🎮👇🏼
English
212
1.5K
6.7K
0
Papers with Code
Papers with Code@paperswithcode·
ML Reproducibility Challenge 💥 New Edition! 💥 A new Spring edition of the Reproducibility Challenge in response to increased demand from university courses. Submissions open 1 April, Deadline 15 July. More info coming soon! paperswithcode.com/rc2020?spring21
English
4
39
161
0
Talha Chafekar retweetledi
nishchith
nishchith@inishchith·
if unexamined, speculation at scale is (considered) reality.
English
2
1
6
0
Talha Chafekar retweetledi
joel
joel@JoelDoesCyber·
Me on the first day at my first tech job:
joel tweet media
English
44
853
7.6K
0
Talha Chafekar retweetledi
Naval
Naval@naval·
If you want to make the wrong decision, ask everyone.
English
297
5.1K
27.2K
0
Talha Chafekar retweetledi
Sharif Shameem
Sharif Shameem@sharifshameem·
This is mind blowing. With GPT-3, I built a layout generator where you just describe any layout you want, and it generates the JSX code for you. W H A T
English
611
9.5K
37.5K
0
Talha Chafekar retweetledi
Neel Shah
Neel Shah@9eel_·
Github student pack is probably the single largest collection of student discounted resources. And for the longest time I found it hard to redeem those offers. After scouring the Internet for directions, finally found the link - education.github.com/pack/offers
English
0
1
15
0
Talha Chafekar retweetledi
Naval
Naval@naval·
The modern devil is cheap dopamine.
English
259
2.8K
13.5K
0
Talha Chafekar retweetledi
Simon Kozlov
Simon Kozlov@sim0nsays·
Sometimes I wish I could spare a year to just look at neurons in the pretrained CNN and try to reverse-engineer what every neuron is doing. @ch402 and OpenAI Clarity team are crazy enough to actually do that. They found tons of cool stuff! More below distill.pub/2020/circuits/… /
English
2
36
208
0