EdinburghNLP

1.3K posts

EdinburghNLP banner
EdinburghNLP

EdinburghNLP

@EdinburghNLP

The Natural Language Processing Group at the University of Edinburgh.

Edinburgh, Scotland Katılım Mayıs 2017
160 Takip Edilen13.5K Takipçiler
Sabitlenmiş Tweet
EdinburghNLP
EdinburghNLP@EdinburghNLP·
Join our PhD programme in Designing Responsible Natural Language Processing at the UKRI AI Centre for Doctoral Training, University of Edinburgh. Applications are now re-opened for Home fee status candidates (past candidates need not re-apply). responsiblenlp.org
English
0
5
10
4.8K
EdinburghNLP retweetledi
Vivek Iyer
Vivek Iyer@remorax98·
Super excited to share my internship project at FAIR @AIatMeta 🚀 We introduce Spectrum -- an encoder-decoder LM pretrained using omnilingual & cross-modal sentence embeddings. Trained on English datasets alone, it outperforms strong baselines like Llama and SpiritLM on multilingual (900+ languages) and speech understanding benchmarks — despite never being directly exposed to multilingual or speech data during training. Curious how? Read on -- and check out the OmniSONAR technical report for the full details: ai.meta.com/research/publi… 👀🧵
Vivek Iyer tweet mediaVivek Iyer tweet mediaVivek Iyer tweet media
English
1
5
14
509
EdinburghNLP retweetledi
Yifu Qiu
Yifu Qiu@yifuqiu98·
Glad to see model steering in the spectral space works for attention and the long context as well! We also show that spectral editing of activations can steer model behavior to alleviate hallucination and bias! proceedings.neurips.cc/paper_files/pa…
Waylon Li@li_waylon

🚀 Excited to share our paper "Spectral Attention Steering for Prompt Highlighting" has been accepted to ICLR 2026 and the camera-ready version is finally live! We’ve found a way to steer LLM attention that is actually effective, fast and compatible with modern hardware.

English
2
5
18
2.7K
EdinburghNLP retweetledi
Waylon Li
Waylon Li@li_waylon·
🚀 Excited to share our paper "Spectral Attention Steering for Prompt Highlighting" has been accepted to ICLR 2026 and the camera-ready version is finally live! We’ve found a way to steer LLM attention that is actually effective, fast and compatible with modern hardware.
Waylon Li tweet media
English
6
21
78
7.9K
EdinburghNLP retweetledi
Farooq Wani
Farooq Wani@wanifarooq848·
Your VLM gives the same answer before and after a tiny image change. So it's robust, right? Wrong. In our new paper, we show that VLMs can preserve their predictions while their internal representations drift to regions normally occupied by completely unrelated images. 🧵👇
Farooq Wani tweet media
English
1
8
11
1.6K
EdinburghNLP retweetledi
Filip Szatkowski
Filip Szatkowski@f_szatkowski·
We are presenting "Universal Properties of Activation Sparsity in Modern Large Language Models" at ICLR 2026! We ask a simple question: how sparse are modern LLMs, really — and does it matter? 👇
English
1
8
24
2.4K
EdinburghNLP retweetledi
EdinburghNLP retweetledi
Zheng Zhao
Zheng Zhao@zhengzhao97·
🎉 Thrilled to announce our paper "Verifying Chain-of-Thought Reasoning via Its Computational Graph" has been accepted as an ICLR 2026 ORAL! 🚨 We look inside the "black box" to detect reasoning errors by analyzing the model's internal circuit. 🧠⚡️ Read more on CRV 👇
Zheng Zhao@zhengzhao97

Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.

English
5
32
152
26.8K
EdinburghNLP retweetledi
Edoardo Ponti
Edoardo Ponti@PontiEdoardo·
World modelling simulates possible futures from past states and actions. But actions are scarce and ambiguous. Can we teach foundation models (LLMs/VLMs) world modelling from data without action labels? We introduce “Self-Improving World Modelling with Iterative RL” (SWIRL ꩜)
GIF
English
1
24
112
14.8K
EdinburghNLP retweetledi
Hongru Wang
Hongru Wang@HongruWang007·
We make a huge revision about theory of agent, especially considering valuable feedback from RL/ML friends!! 🧠 Theory of Agent (ToA): a position on tool-augmented intelligence arxiv.org/pdf/2506.00886… It provide a early unified theory to model the relationships between reasoning and acting, to shed lights about what is a good agent, to lead a roadmap to train such agents. I promise it worth reading if you study agents. 🥳🥳 Why this matters 🚨 If we train agents that always outsource thinking/reasoning: • they look powerful • they stay shallow • they don’t grow intelligence ToA reframes agent alignment as epistemic self-restraint, not tool maximalism.
Hongru Wang tweet media
English
2
8
35
6K
EdinburghNLP retweetledi
Pasquale Minervini
Pasquale Minervini@PMinervini·
Folks, for those of you who may need to hear this -- if things didn't go well with #ICLR2026 don't worry, it doesn't define you as a person, and you can use the feedback to improve your work and aim for another deadline! (ICML/KDD..) Just do your best, and everything will be fine
English
2
5
87
5.8K
EdinburghNLP retweetledi
Piotr Nawrot
Piotr Nawrot@p_nawrot·
🚀📉 Efficient Inference Just Got a Major Upgrade #NVIDIAResearch We’ve just released Qwen3-8B-DMS-8x fine-tuned for 8x KV cache compression. It maintains dense model accuracy on demanding tasks like AIME24, and is perfect for inference-time scaling. The code on HF works out-of-the-box. With DMS we fine-tune models end-to-end via distillation; this works much better than “token importance” proxies found in usual eviction methods. It’s state-of-art for KV eviction tailored for fast inference: adds negligible amount of parameters and computation to each KV head, and requires as little as 1K fine-tuning steps to reach 8x compression. It speeds-up both prefill and generation phase of Transformer LLMs, and can be combined with Sparse Attention methods such as DSA. Co-Authors: @AdrianLancucki, @CStanKonrad, @PontiEdoardo Links in the comments 👇
Piotr Nawrot tweet media
English
8
58
266
60K
EdinburghNLP retweetledi
Pasquale Minervini
Pasquale Minervini@PMinervini·
If you are interested in manually curated and leak-free benchmarks, check out MMLU-Redux: aclanthology.org/2025.naacl-lon… (NAACL 2025) -- we found significant issues in several MMLU topics/subsets and manually fixed them with a pool of human experts
Pasquale Minervini tweet media
Eric W. Tramel@fujikanaeda

The presence of a leading whitespace leaks the correct choice selection in the MMLU-Pro benchmark. Am I missing something? Seems to impact Chemistry, Physics, and Math. HF Issue in reply.

English
2
7
40
9.4K
EdinburghNLP retweetledi
Cyrus Wai-Chung Kwan
Cyrus Wai-Chung Kwan@cyruskwan1997·
OpenSIR: Open-Ended Self-Improving Reasoner Can LLMs teach themselves math without any training data? OpenSIR is an open-ended self-play framework where: Teacher proposes diverse, appropriately challenging problems Student learns to solve them Both co-evolve together 1/6
Cyrus Wai-Chung Kwan tweet media
English
1
8
16
700
EdinburghNLP retweetledi
Frank Keller
Frank Keller@frank_e_keller·
I'm excited to announce that our work on contrastive learning for story salience has been accepted at EACL 2026. Thanks to my brilliant co-authors, Igor Sterner and Alex Lascarides. Paper: arxiv.org/abs/2601.07765 Date and Code: github.com/igorsterner/Na…
English
0
1
3
481
EdinburghNLP retweetledi
Pasquale Minervini
Pasquale Minervini@PMinervini·
This guide from the amazing folks at @huggingface features Intra-Document Causal Masking (@yuzhaouoe et al., arxiv.org/abs/2402.13991, ACL'24 Oral), a key ingredient of all frontier LLM pre-training recipes!
Pasquale Minervini tweet media
Ahmad@TheAhmadOsman

Hugging Face has released a 214-page MASTERCLASS on how to train LLMs > it’s called The Smol Training Playbook > and if want to learn how to train LLMs, > this GIFT is for you > this training bible walks you through the ENTIRE pipeline > covers every concept that matters from why you train, > to what you train, to how you actually pull it off > from pre-training, to mid-training, to post-training > it turns vague buzzwords into step-by-step decisions > architecture, tokenization, data strategy, and infra > highlights the real-world gotchas > instabilities, scaling headaches, debugging nightmares > distills lessons from building actual > state-of-the-art LLMs, not just toy models how modern transformer models are actually built > tokenization: the secret foundation of every LLM > tokenizer fundamentals > vocabulary size > byte pair encoding > custom vs existing tokenizers > all the modern attention mechanisms are here > multi-head attention > multi-query attention > grouped-query attention > multi-latent attention > every positional encoding trick in the book > absolute position embedding > rotary position embedding > yaRN (yet another rotary network) > ablate-by-frequency positional encoding > no position embedding > randomized no position embedding > stability hacks that actually work > z-loss regularization > query-key normalization > removing weight decay from embedding layers > sparse scaling, handled > mixture-of-experts scaling > activation ratio tuning > choosing the right granularity > sharing experts between layers > load balancing across experts > long-context handling via ssm > hybrid models: transformer plus state space models data curation = most of your real model quality > data curation is the main driver of your model’s actual quality > architecture alone won’t save you > building the right data mixture is an art, > not just dumping in more web scrapes > curriculum learning, adaptive mixes, ablate everything > you need curriculum learning: > design data mixes hat evolve as training progresses > use adaptive mixtures that shift emphasis > based on model stage and performance > ablate everything: run experiments to systematically > test how each data source or filter impacts results > smollm3 data > the smollm3 recipe: balanced english web data, > broad multilingual sources, high-quality code, and diverse math datasets > without the right data pipeline, > even the best architecture will underperform the training marathon > do your preflight checklist or die > check your infrastructure, > validate your evaluation pipelines, > set up logging, and configure alerts > so you don’t miss silent failures > scaling surprises are inevitable > things will break at scale in ways they never did in testing > vanishing throughput? that usually means > you’ve got a hidden shape mismatch or > batch dimension bug killing your GPU utilization > sudden drops in throughput? > check your software stack for inefficiencies, > resource leaks, or bad dataloader code > seeing noisy, spiky loss values? > your data shuffling is probably broken, > and the model is seeing repeated or ordered data > performance worse than expected? > look for subtle parallelism bugs > tensor parallel, data parallel, > or pipeline parallel gone rogue > monitor like your GPUs depend on it (because they do) > watch every metric, track utilization, spot anomalies fast > mid-training is not autopilot > swap in higher-quality data to improve learning, > extend the context window if you want bigger inputs, > and use multi-stage training curricula to maximize gains > the difference between a good model and a failed run is > almost always vigilance and relentless debugging during this marathon post-training > post-training is where your raw base model > actually becomes a useful assistant > always start with supervised fine-tuning (sft) > use high-quality, well-structured chat data and > pick a solid template for consistent turns > sft gives you a stable, cost-effective baseline > don’t skip it, even if you plan to go deeper > next, optimize for user preferences > direct preference optimization (dpo), > or its variants like kernelized (kto), > online (orpo), or adversarial (apo) > these methods actually teach the model > what “better” looks like beyond simple mimicry > once you’ve got preference alignment,go on-policy: > reinforcement learning from human feedback (rlhf) > or on-policy distillation, which lets your model learn > from real interactions or stronger models > this is how you get reliability and sharper behaviors > the post-training pipeline is where > assistants are truly sculpted; > skipping steps means leaving performance, > safety, and steerability on the table infra is the boss fight > this is where most teams lose time, > money, and sanity if they’re not careful > inside every gpu > you’ve got tensor cores and cuda cores for the heavy math, > plus a memory hierarchy (registers, shared memory, hbm) > that decides how fast you can feed data to the compute units > outside the gpu, your interconnects matter > pcie for gpu-to-cpu, > nvlink for ultra-fast gpu-to-gpu within a node, > infiniband or roce for communication between nodes, > and gpudirect storage for feeding massive datasets > straight from disk to gpu memory > make your infra resilient: > checkpoint your training constantly, > because something will crash; > monitor node health so you can kill or restart > sick nodes before they poison your run > scaling isn’t just “add more gpus” > you have to pick and tune the right parallelism: > data parallelism (dp), pipeline parallelism (pp), tensor parallelism (tp), > or fully sharded data parallel (fsdp); > the right combo can double your throughput, > the wrong one can bottleneck you instantly to recap > always start with WHY > define the core reason you’re training a model > is it research, a custom production need, or to fill an open-source gap? > spec what you need: architecture, model size, data mix, assistant type > transformer or hybrid > set your model size > design the right data mixture > decide what kind of assistant or > use case you’re targeting > build infra for the job, plan for chaos, pick your stability tricks > build infrastructure that matches your goals > choose the right GPUs > set up reliable storage > and plan for network bottlenecks > expect failures, weird bugs, > and sudden bottlenecks at scale > select your stability tricks in advance: > know which techniques you’ll use to fight loss spikes, > unstable gradients, and hardware hiccups closing notes > the pace of LLM development is relentless, > but the underlying principles never go out of style > and this PDF covers what actually matters > no matter how fast the field changes > systematic experimentation is everything > run controlled tests, change one variable at a time, and document every step > sharp debugging instincts will save you > more time (and compute budget) than any paper or library > deep knowledge of both your software stack > and your hardware is the ultimate unfair advantage; > know your code, know your chips > in the end, success comes from relentless curiosity, > tight feedback loops, and a willingness to question everything > even your own assumptions if i had this two years ago, it would have saved me so much time > if you’re building llms, > read this before you burn gpu months happy hacking

English
0
1
16
2.8K
EdinburghNLP retweetledi
Irina Saparina
Irina Saparina@irisaparina·
Reasoning models are powerful, but they burn thousands of tokens on potentially wrong interpretations for ambiguous requests! 👉 We teach models to think about intent first and provide all interpretations and answers in a single response via RL with dual reward. 🧵1/6
Irina Saparina tweet media
English
1
12
35
2.6K
EdinburghNLP retweetledi
Edoardo Ponti
Edoardo Ponti@PontiEdoardo·
Finally, you can count the r's in strawberry and check if 3.11 is higher than 3.9 without tokenisation interfering: Here's Bolmo, a fully open byte-level LLM with latent tokenisation, derived from a SOTA LLM (Olmo 3). Promising on coding and char-level understanding!
Ai2@allen_ai

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

English
2
6
44
4.2K