Di Wu

169 posts

Di Wu

@diwuNLP

PhD candidate in MT/NLP/ML @UvA_Amsterdam, working with @c_monz.

Amsterdam Katılım Haziran 2019

341 Takip Edilen141 Takipçiler

Di Wu retweetledi

Rohan Paul@rohanpaul_ai·25 Tem

Beautiful @GoogleResearch paper. LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change. That behavior looks impossible if learning always means gradient descent. The mechanisms through which this can happen are still largely unknown. The authors ask whether the transformer’s own math hides an update inside the forward pass. They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune. Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt. 🧵 Read on 👇

English

333

2.5K

366.5K

Di Wu retweetledi

Jingcheng (Frank) Niu@frankniujc·23 Tem

📢 Next week, I will be presenting our paper "Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs" at ACL 2025! Paper: arxiv.org/abs/2505.09338 Blog Post: frankniujc.github.io/publications/a… Talk: youtube.com/watch?v=XcsKon…

YouTube

English

510

Di Wu retweetledi

Zirui Liu@ziruirayliu·12 Haz

🔥Exited to share our new work on reproducibility challenges in reasoning models caused by numerical precision. Ever run the same prompt twice and get completely different answers from your LLM under greedy decoding? You're not alone. Most LLMs today default to BF16 precision, but we show this choice severely impacts the reproducibility of long generations — even under greedy decoding with a fixed seed. While issues like this are known in tools like vLLM and sgLang, the severity of the problem is widely underestimated. Many in the community still rely on single-run greedy decoding for evaluation — which can lead to misleading results. 🤯 To get a sense, switching from 2 GPUs to 4 GPUs may completely change your model outputs, with up to 9% drop in accuracy and a difference of 9,000 token length on standard benchmarks like AIME. Key takeaways: • ⚠️ Floating-point non-associativity causes tiny numerical errors to snowball in multi-step reasoning. • 🔄 Greedy decoding ≠ deterministic output — we observe up to 9% accuracy variance and 9,000 token difference in response length • 📉 When using random sampling with non-zero tempurature, the accuracy variance purely from numerical precision is 0.3%~2%, depending on the dataset size and the number of repeated runs. 🌍 Suggestions to the community: We urge the community to adopt better evaluation practices for LLMs — especially for tasks like math reasoning, code generation, and auto-grading: 1. Use random sampling + report Pass@k, average length, and error bars — especially on small datasets and low precision. 2. If using greedy decoding for token-by-token reproducibility, run it in FP32. To help, we released a vLLM patch for FP32 inference. 📄 Paper: lnkd.in/gZAjbWKA 💻 Code: lnkd.in/gwdGWFP5 📈 HF Summary: lnkd.in/gFjsK7Y9

English

14K

Di Wu retweetledi

Peyman Milanfar@docmilanfar·23 Mar

Model Distillation

Dansk

1.1K

77K

Di Wu retweetledi

Taku Kudo@taku910·23 Mar

Whitespace-ignoring tokenization is an fundamental feature of Sentenepiece, implemented since its early stages (around 2017) Using whitespace yielded better results on MT. It would be helpful if you could mention this. github.com/google/sentenc…

Alisa Liu@alisawuffles

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

English

104

38.7K

Di Wu retweetledi

HPLT@hplt_eu·28 Şub

We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) 🤩 - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) 😮 Available at the HPLT dataset catalogue and OPUS.

English

1.3K

Di Wu retweetledi

Dan Deutsch@_danieldeutsch·19 Şub

🚨New machine translation dataset alert! 🚨We expanded the language coverage of WMT24 from 9 to 55 en->xx language pairs by collecting new reference translations for 46 languages in a dataset called WMT24++ Paper: arxiv.org/abs/2502.12404… Data: huggingface.co/datasets/googl…

English

6.8K

Di Wu retweetledi

Longyue Wang@wangly0229·19 Şub

🎯 ComfyUI-Copilot (AIGC Assistant) is now open-source, brought to you by Alibaba International! 🎉 🍀 Enhance ComfyUI workflow design and optimization with LLM-Agent ✨ Empowering AIGC and exploring Multimodal Agents 🚀 Stay tuned for more features like dynamic parameter optimization and auto workflow generation! github.com/AIDC-AI/ComfyU…

English

2.6K

Di Wu retweetledi

John Nguyen@__JohnNguyen__·13 Ara

🥪New Paper! 🥪Introducing Byte Latent Transformer (BLT) - A tokenizer free model scales better than BPE based models with better inference efficiency and robustness. 🧵

English

441

89.6K

Di Wu@diwuNLP·5 Kas

@jlibovicky @jindra_helcl hah, the video is interesting

English

Jindřich Libovický@jlibovicky·4 Kas

In a week, @jindra_helcl and I will present our paper Lexically Grounded Subword Segmentation at #EMNLP2024 in Miami 🌴🇺🇸. You can already watch our video 🎥 youtube.com/watch?v=NjUmNg… or stop by our poster 👋 next Tuesday at 2 p.m...

YouTube

English

374

Di Wu retweetledi

François Fleuret@francoisfleuret·21 Eki

Do we like this? arxiv.org/abs/2410.05258

English

265

51.3K

Di Wu retweetledi

Benjamin Marie@bnjmn_marie·15 Eki

Unsloth has identified and fixed the gradient accumulation issue I reported last week. The problem turned out to be more significant than I expected, impacting multi-GPU training as well. This means we’ve likely been training models that didn’t perform as well as they could have. Only Unsloth's Trainer is fixed for now. Hugging Face is working on it. For a detailed and well-explained breakdown of the issue, check out Unsloth's blog post—it’s definitely worth a read! unsloth.ai/blog/gradient

Daniel Han@danielhanchen

Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes. 1. First reported by @bnjmn_marie, GA is supposed to be mathematically equivalent to full batch training, but losses did not match. 2. We reproed the issue, and further investigation showed the L2 Norm betw bsz=16 and ga=16 was 10x larger. 3. The culprit was the cross entropy loss normalizer. 4. We ran training runs with denormalized CE Loss, and all training losses match. 5. We then re-normalized CE Loss with the correct denominator across all gradient accumulation steps, and verified all training loss curves match now. 6. We've already updated @UnslothAI with the fix, and wrote up more details in our blog post here: unsloth.ai/blog/gradient This issue impacts all libraries which use GA, and simple averaging of GA does not work for varying sequence lengths. This also impacts DDP and multi GPU training which accumulates gradients. Please update Unsloth via pip install --upgrade --no-cache-dir unsloth and use from unsloth import unsloth_train We have a Colab notebook using our fixed GA: colab.research.google.com/drive/1z0XJU2F… and a Kaggle notebook: kaggle.com/code/danielhan…

English

222

37.1K

Di Wu@diwuNLP·11 Eki

We show that a grammar book provides little or even no help for translation in LLMs, questioning the recent "truly zero-shot translation" --- no data no gain, still 🧐

Seth Aycock@sethjsa

Our work “Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?” is now on arXiv! arxiv.org/abs/2409.19151 - in collaboration with @davidstap, @diwuNLP, @c_monz , and Khalil Sima'an from @illc_amsterdam and @ltl_uva 🧵

English

673

Di Wu retweetledi

LTL-UvA@ltl_uva·10 Eki

Language Technology Lab got four papers accepted for #EMNLP2024! Congrats to authors Kata Naszadi, Shaomu Tan, Baohao Liao @baohao_liao, Di Wu @diwuNLP 🥳🥳

English

962

Di Wu retweetledi

Evgeniia Tokarchuk@evgtokarchuk·26 Tem

Come check our poster tomorrow at @GRaM_org_ @icmlconf if you want to discuss dispersion of text embeddings on hyperspheres! 27.07 at Poster session 2. #ICML2024

English

8.2K

Di Wu retweetledi

David Stap@davidstap·25 Tem

1/4 #ACL2024 Excited to share our new paper on the impact of fine-tuning on the qualitative advantages of LLMs in machine translation! 🤖 Our work highlights the importance of preserving LLM capabilities during fine-tuning. arxiv.org/abs/2405.20089

English

1.5K

Di Wu retweetledi

Evgeniia Tokarchuk@evgtokarchuk·19 Tem

Next week I'll be in Vienna at @icmlconf! Want to learn more on how to explicitly model embeddings on hypersphere and encourage dispersion during training? Come to the @GRaM_workshop poster session 2 on 27.07 Shoutout to my collaborators Hua Chang Bakker and @vnfrombucharest 💫

English

1.6K

Di Wu retweetledi

Kyunghyun Cho@kchonyc·25 Haz

modern LM research seems to be the exact repetition of MT research. here goes the prediction; someone will reinvent minimum Bayes risk decoding but will call it super-aligned, super-reasoning majority voting of galaxy-of-thoughts.

English

393

80.3K

Di Wu retweetledi

Marzena Karpinska@mar_kar_·25 Haz

Can #LLMs truly reason over loooong context? 🤔 NoCha asks LLMs to verify claims about *NEW* fictional books 🪄 📚 ⛔ LLMs that solve needle-in-the-haystack (~100%) struggle on NoCha! ⛔ None of 11 tested LLMs reach human performance → 97%. The best, #GPT-4o, gets only 55.8%.

English

460

121.3K

Di Wu retweetledi

Barry Haddow@bazril·26 Haz

@alexandrabirch1 asking the important questions in the #eamt2024 keynote

English

744

Keşfet

@GoogleResearch @jlibovicky @jindra_helcl @baohao_liao @GRaM_org_ @icmlconf @GRaM_workshop @vnfrombucharest