Di Wu

169 posts

Di Wu banner
Di Wu

Di Wu

@diwuNLP

PhD candidate in MT/NLP/ML @UvA_Amsterdam, working with @c_monz.

Amsterdam Katılım Haziran 2019
341 Takip Edilen141 Takipçiler
Di Wu retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Beautiful @GoogleResearch paper. LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change. That behavior looks impossible if learning always means gradient descent. The mechanisms through which this can happen are still largely unknown. The authors ask whether the transformer’s own math hides an update inside the forward pass. They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune. Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt. 🧵 Read on 👇
Rohan Paul tweet media
English
64
333
2.5K
366.5K
Di Wu retweetledi
Zirui Liu
Zirui Liu@ziruirayliu·
🔥Exited to share our new work on reproducibility challenges in reasoning models caused by numerical precision. Ever run the same prompt twice and get completely different answers from your LLM under greedy decoding? You're not alone. Most LLMs today default to BF16 precision, but we show this choice severely impacts the reproducibility of long generations — even under greedy decoding with a fixed seed. While issues like this are known in tools like vLLM and sgLang, the severity of the problem is widely underestimated. Many in the community still rely on single-run greedy decoding for evaluation — which can lead to misleading results. 🤯 To get a sense, switching from 2 GPUs to 4 GPUs may completely change your model outputs, with up to 9% drop in accuracy and a difference of 9,000 token length on standard benchmarks like AIME. Key takeaways: • ⚠️ Floating-point non-associativity causes tiny numerical errors to snowball in multi-step reasoning. • 🔄 Greedy decoding ≠ deterministic output — we observe up to 9% accuracy variance and 9,000 token difference in response length • 📉 When using random sampling with non-zero tempurature, the accuracy variance purely from numerical precision is 0.3%~2%, depending on the dataset size and the number of repeated runs. 🌍 Suggestions to the community: We urge the community to adopt better evaluation practices for LLMs — especially for tasks like math reasoning, code generation, and auto-grading: 1. Use random sampling + report Pass@k, average length, and error bars — especially on small datasets and low precision. 2. If using greedy decoding for token-by-token reproducibility, run it in FP32. To help, we released a vLLM patch for FP32 inference. 📄 Paper: lnkd.in/gZAjbWKA 💻 Code: lnkd.in/gwdGWFP5 📈 HF Summary: lnkd.in/gFjsK7Y9
English
4
21
95
14K
Di Wu retweetledi
Peyman Milanfar
Peyman Milanfar@docmilanfar·
Model Distillation
Peyman Milanfar tweet media
Dansk
16
48
1.1K
77K
Di Wu retweetledi
Taku Kudo
Taku Kudo@taku910·
Whitespace-ignoring tokenization is an fundamental feature of Sentenepiece, implemented since its early stages (around 2017) Using whitespace yielded better results on MT. It would be helpful if you could mention this. github.com/google/sentenc…
Alisa Liu@alisawuffles

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

English
2
28
104
38.7K
Di Wu retweetledi
HPLT
HPLT@hplt_eu·
We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) 🤩 - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) 😮 Available at the HPLT dataset catalogue and OPUS.
English
0
12
15
1.3K
Di Wu retweetledi
Dan Deutsch
Dan Deutsch@_danieldeutsch·
🚨New machine translation dataset alert! 🚨We expanded the language coverage of WMT24 from 9 to 55 en->xx language pairs by collecting new reference translations for 46 languages in a dataset called WMT24++ Paper: arxiv.org/abs/2502.12404… Data: huggingface.co/datasets/googl…
Dan Deutsch tweet media
English
3
24
88
6.8K
Di Wu retweetledi
Longyue Wang
Longyue Wang@wangly0229·
🎯 ComfyUI-Copilot (AIGC Assistant) is now open-source, brought to you by Alibaba International! 🎉 🍀 Enhance ComfyUI workflow design and optimization with LLM-Agent ✨ Empowering AIGC and exploring Multimodal Agents 🚀 Stay tuned for more features like dynamic parameter optimization and auto workflow generation! github.com/AIDC-AI/ComfyU…
Longyue Wang tweet media
English
0
12
24
2.6K
Di Wu retweetledi
John Nguyen
John Nguyen@__JohnNguyen__·
🥪New Paper! 🥪Introducing Byte Latent Transformer (BLT) - A tokenizer free model scales better than BPE based models with better inference efficiency and robustness. 🧵
John Nguyen tweet media
English
12
62
441
89.6K
Di Wu retweetledi
Benjamin Marie
Benjamin Marie@bnjmn_marie·
Unsloth has identified and fixed the gradient accumulation issue I reported last week. The problem turned out to be more significant than I expected, impacting multi-GPU training as well. This means we’ve likely been training models that didn’t perform as well as they could have. Only Unsloth's Trainer is fixed for now. Hugging Face is working on it. For a detailed and well-explained breakdown of the issue, check out Unsloth's blog post—it’s definitely worth a read! unsloth.ai/blog/gradient
Daniel Han@danielhanchen

Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes. 1. First reported by @bnjmn_marie, GA is supposed to be mathematically equivalent to full batch training, but losses did not match. 2. We reproed the issue, and further investigation showed the L2 Norm betw bsz=16 and ga=16 was 10x larger. 3. The culprit was the cross entropy loss normalizer. 4. We ran training runs with denormalized CE Loss, and all training losses match. 5. We then re-normalized CE Loss with the correct denominator across all gradient accumulation steps, and verified all training loss curves match now. 6. We've already updated @UnslothAI with the fix, and wrote up more details in our blog post here: unsloth.ai/blog/gradient This issue impacts all libraries which use GA, and simple averaging of GA does not work for varying sequence lengths. This also impacts DDP and multi GPU training which accumulates gradients. Please update Unsloth via pip install --upgrade --no-cache-dir unsloth and use from unsloth import unsloth_train We have a Colab notebook using our fixed GA: colab.research.google.com/drive/1z0XJU2F… and a Kaggle notebook: kaggle.com/code/danielhan…

English
11
30
222
37.1K
Di Wu retweetledi
LTL-UvA
LTL-UvA@ltl_uva·
Language Technology Lab got four papers accepted for #EMNLP2024! Congrats to authors Kata Naszadi, Shaomu Tan, Baohao Liao @baohao_liao, Di Wu @diwuNLP 🥳🥳
English
0
1
6
962
Di Wu retweetledi
Evgeniia Tokarchuk
Evgeniia Tokarchuk@evgtokarchuk·
Come check our poster tomorrow at @GRaM_org_ @icmlconf if you want to discuss dispersion of text embeddings on hyperspheres! 27.07 at Poster session 2. #ICML2024
Evgeniia Tokarchuk tweet media
English
2
16
94
8.2K
Di Wu retweetledi
David Stap
David Stap@davidstap·
1/4 #ACL2024 Excited to share our new paper on the impact of fine-tuning on the qualitative advantages of LLMs in machine translation! 🤖 Our work highlights the importance of preserving LLM capabilities during fine-tuning. arxiv.org/abs/2405.20089
English
2
6
19
1.5K
Di Wu retweetledi
Evgeniia Tokarchuk
Evgeniia Tokarchuk@evgtokarchuk·
Next week I'll be in Vienna at @icmlconf! Want to learn more on how to explicitly model embeddings on hypersphere and encourage dispersion during training? Come to the @GRaM_workshop poster session 2 on 27.07 Shoutout to my collaborators Hua Chang Bakker and @vnfrombucharest 💫
Evgeniia Tokarchuk tweet media
English
1
3
17
1.6K
Di Wu retweetledi
Kyunghyun Cho
Kyunghyun Cho@kchonyc·
modern LM research seems to be the exact repetition of MT research. here goes the prediction; someone will reinvent minimum Bayes risk decoding but will call it super-aligned, super-reasoning majority voting of galaxy-of-thoughts.
English
17
28
393
80.3K
Di Wu retweetledi
Marzena Karpinska
Marzena Karpinska@mar_kar_·
Can #LLMs truly reason over loooong context? 🤔 NoCha asks LLMs to verify claims about *NEW* fictional books 🪄 📚 ⛔ LLMs that solve needle-in-the-haystack (~100%) struggle on NoCha! ⛔ None of 11 tested LLMs reach human performance → 97%. The best, #GPT-4o, gets only 55.8%.
Marzena Karpinska tweet media
English
31
91
460
121.3K