EdinburghNLP

1.3K posts

EdinburghNLP

@EdinburghNLP

The Natural Language Processing Group at the University of Edinburgh.

Edinburgh, Scotland Beigetreten Mayıs 2017

160 Folgt13.5K Follower

Angehefteter Tweet

EdinburghNLP@EdinburghNLP·25 Mar

Join our PhD programme in Designing Responsible Natural Language Processing at the UKRI AI Centre for Doctoral Training, University of Edinburgh. Applications are now re-opened for Home fee status candidates (past candidates need not re-apply). responsiblenlp.org

English

4.8K

EdinburghNLP retweetet

Vivek Iyer@remorax98·7h

Super excited to share my internship project at FAIR @AIatMeta 🚀 We introduce Spectrum -- an encoder-decoder LM pretrained using omnilingual & cross-modal sentence embeddings. Trained on English datasets alone, it outperforms strong baselines like Llama and SpiritLM on multilingual (900+ languages) and speech understanding benchmarks — despite never being directly exposed to multilingual or speech data during training. Curious how? Read on -- and check out the OmniSONAR technical report for the full details: ai.meta.com/research/publi… 👀🧵

English

1.2K

EdinburghNLP retweetet

Pasquale Minervini@PMinervini·4d

My amazing colleagues Sid and Michael are looking for a postdoc! 👇

ExLab@an_exlab

We are advertising a postdoc position to work on #generative #models, #structure #induction, and MI #estimation with Michael Gutmann as part of #GenAI hub! elxw.fa.em3.oraclecloud.com/hcmUI/Candidat… Get in touch! (#ML #AI) 👉 homepages.inf.ed.ac.uk/snaraya3/ 👉 michaelgutmann.github.io

English

1.7K

EdinburghNLP retweetet

Yifu Qiu@yifuqiu98·3 Mar

Glad to see model steering in the spectral space works for attention and the long context as well! We also show that spectral editing of activations can steer model behavior to alleviate hallucination and bias! proceedings.neurips.cc/paper_files/pa…

Waylon Li@li_waylon

🚀 Excited to share our paper "Spectral Attention Steering for Prompt Highlighting" has been accepted to ICLR 2026 and the camera-ready version is finally live! We’ve found a way to steer LLM attention that is actually effective, fast and compatible with modern hardware.

English

2.7K

EdinburghNLP retweetet

Waylon Li@li_waylon·3 Mar

English

7.9K

EdinburghNLP retweetet

Pasquale Minervini@PMinervini·2 Mar

MMLU-Redux is a manually curated/corrected version of MMLU--if a model does better at MMLU/MMLU-Pro and same/worse and MMLU-Redux, it's likely that the test set leaked in the training data 🙂 The @Alibaba_Qwen Qwen3.5 model family seems surprisingly strong, congrats to the team!

Ahmad@TheAhmadOsman

not only does the Qwen 3.5 9B beat the GPT OSS 20B it BEATS the 120B INCREDIBLE stuff

English

1.4K

EdinburghNLP retweetet

Farooq Wani@wanifarooq848·26 Şub

Your VLM gives the same answer before and after a tiny image change. So it's robust, right? Wrong. In our new paper, we show that VLMs can preserve their predictions while their internal representations drift to regions normally occupied by completely unrelated images. 🧵👇

English

1.6K

EdinburghNLP retweetet

Filip Szatkowski@f_szatkowski·23 Şub

We are presenting "Universal Properties of Activation Sparsity in Modern Large Language Models" at ICLR 2026! We ask a simple question: how sparse are modern LLMs, really — and does it matter? 👇

English

2.4K

EdinburghNLP retweetet

Pasquale Minervini@PMinervini·21 Şub

Ok, we can safely say that one of @yuzhaouoe's first contributions as a @EdinburghNLP PhD student (intra-document causal masking -- arxiv.org/abs/2402.13991, ACL'24) is now a standard/mainstream industry practice 🙂🚀🚀🚀

Alex Wa@_djdumpling

new blog! What methodologies do labs use to train frontier models? The blog distills 7 open-weight model reports from frontier labs, covering architecture, stability, optimizers, data curation, pre/mid/post-training + RL, and behaviors/safety djdumpling.github.io/2026/01/31/fro…

English

6.3K

EdinburghNLP retweetet

Zheng Zhao@zhengzhao97·16 Şub

🎉 Thrilled to announce our paper "Verifying Chain-of-Thought Reasoning via Its Computational Graph" has been accepted as an ICLR 2026 ORAL! 🚨 We look inside the "black box" to detect reasoning errors by analyzing the model's internal circuit. 🧠⚡️ Read more on CRV 👇

Zheng Zhao@zhengzhao97

Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.

English

152

26.8K

EdinburghNLP retweetet

Edoardo Ponti@PontiEdoardo·9 Şub

World modelling simulates possible futures from past states and actions. But actions are scarce and ambiguous. Can we teach foundation models (LLMs/VLMs) world modelling from data without action labels? We introduce “Self-Improving World Modelling with Iterative RL” (SWIRL ꩜)

GIF

English

112

14.8K

EdinburghNLP retweetet

Hongru Wang@HongruWang007·30 Oca

We make a huge revision about theory of agent, especially considering valuable feedback from RL/ML friends!! 🧠 Theory of Agent (ToA): a position on tool-augmented intelligence arxiv.org/pdf/2506.00886… It provide a early unified theory to model the relationships between reasoning and acting, to shed lights about what is a good agent, to lead a roadmap to train such agents. I promise it worth reading if you study agents. 🥳🥳 Why this matters 🚨 If we train agents that always outsource thinking/reasoning: • they look powerful • they stay shallow • they don’t grow intelligence ToA reframes agent alignment as epistemic self-restraint, not tool maximalism.

English

EdinburghNLP retweetet

Pasquale Minervini@PMinervini·26 Oca

Folks, for those of you who may need to hear this -- if things didn't go well with #ICLR2026 don't worry, it doesn't define you as a person, and you can use the feedback to improve your work and aim for another deadline! (ICML/KDD..) Just do your best, and everything will be fine

English

5.8K

EdinburghNLP retweetet

Piotr Nawrot@p_nawrot·23 Oca

🚀📉 Efficient Inference Just Got a Major Upgrade #NVIDIAResearch We’ve just released Qwen3-8B-DMS-8x fine-tuned for 8x KV cache compression. It maintains dense model accuracy on demanding tasks like AIME24, and is perfect for inference-time scaling. The code on HF works out-of-the-box. With DMS we fine-tune models end-to-end via distillation; this works much better than “token importance” proxies found in usual eviction methods. It’s state-of-art for KV eviction tailored for fast inference: adds negligible amount of parameters and computation to each KV head, and requires as little as 1K fine-tuning steps to reach 8x compression. It speeds-up both prefill and generation phase of Transformer LLMs, and can be combined with Sparse Attention methods such as DSA. Co-Authors: @AdrianLancucki, @CStanKonrad, @PontiEdoardo Links in the comments 👇

English

266

60K

EdinburghNLP retweetet

Pasquale Minervini@PMinervini·15 Oca

If you are interested in manually curated and leak-free benchmarks, check out MMLU-Redux: aclanthology.org/2025.naacl-lon… (NAACL 2025) -- we found significant issues in several MMLU topics/subsets and manually fixed them with a pool of human experts

Eric W. Tramel@fujikanaeda

The presence of a leading whitespace leaks the correct choice selection in the MMLU-Pro benchmark. Am I missing something? Seems to impact Chemistry, Physics, and Math. HF Issue in reply.

English

9.4K

EdinburghNLP retweetet

Cyrus Wai-Chung Kwan@cyruskwan1997·13 Oca

OpenSIR: Open-Ended Self-Improving Reasoner Can LLMs teach themselves math without any training data? OpenSIR is an open-ended self-play framework where: Teacher proposes diverse, appropriately challenging problems Student learns to solve them Both co-evolve together 1/6

English

700

EdinburghNLP retweetet

Frank Keller@frank_e_keller·13 Oca

I'm excited to announce that our work on contrastive learning for story salience has been accepted at EACL 2026. Thanks to my brilliant co-authors, Igor Sterner and Alex Lascarides. Paper: arxiv.org/abs/2601.07765 Date and Code: github.com/igorsterner/Na…

English

481

EdinburghNLP retweetet

Gorjan@gorjanradevski·9 Oca

🚀 New work: Compositional Steering Tokens for LLMs 📄 arxiv.org/abs/2601.05062… We self-distill a compositional operator on behavior pairs that generalizes to unseen combinations, behaviors, and composition sizes. w/ @kgashteo, @GiwonHong413849, @caro__lawrence, @gg42554

English

1.1K

EdinburghNLP retweetet

Pasquale Minervini@PMinervini·28 Ara

This guide from the amazing folks at @huggingface features Intra-Document Causal Masking (@yuzhaouoe et al., arxiv.org/abs/2402.13991, ACL'24 Oral), a key ingredient of all frontier LLM pre-training recipes!

Ahmad@TheAhmadOsman

Hugging Face has released a 214-page MASTERCLASS on how to train LLMs > it’s called The Smol Training Playbook > and if want to learn how to train LLMs, > this GIFT is for you > this training bible walks you through the ENTIRE pipeline > covers every concept that matters from why you train, > to what you train, to how you actually pull it off > from pre-training, to mid-training, to post-training > it turns vague buzzwords into step-by-step decisions > architecture, tokenization, data strategy, and infra > highlights the real-world gotchas > instabilities, scaling headaches, debugging nightmares > distills lessons from building actual > state-of-the-art LLMs, not just toy models how modern transformer models are actually built > tokenization: the secret foundation of every LLM > tokenizer fundamentals > vocabulary size > byte pair encoding > custom vs existing tokenizers > all the modern attention mechanisms are here > multi-head attention > multi-query attention > grouped-query attention > multi-latent attention > every positional encoding trick in the book > absolute position embedding > rotary position embedding > yaRN (yet another rotary network) > ablate-by-frequency positional encoding > no position embedding > randomized no position embedding > stability hacks that actually work > z-loss regularization > query-key normalization > removing weight decay from embedding layers > sparse scaling, handled > mixture-of-experts scaling > activation ratio tuning > choosing the right granularity > sharing experts between layers > load balancing across experts > long-context handling via ssm > hybrid models: transformer plus state space models data curation = most of your real model quality > data curation is the main driver of your model’s actual quality > architecture alone won’t save you > building the right data mixture is an art, > not just dumping in more web scrapes > curriculum learning, adaptive mixes, ablate everything > you need curriculum learning: > design data mixes hat evolve as training progresses > use adaptive mixtures that shift emphasis > based on model stage and performance > ablate everything: run experiments to systematically > test how each data source or filter impacts results > smollm3 data > the smollm3 recipe: balanced english web data, > broad multilingual sources, high-quality code, and diverse math datasets > without the right data pipeline, > even the best architecture will underperform the training marathon > do your preflight checklist or die > check your infrastructure, > validate your evaluation pipelines, > set up logging, and configure alerts > so you don’t miss silent failures > scaling surprises are inevitable > things will break at scale in ways they never did in testing > vanishing throughput? that usually means > you’ve got a hidden shape mismatch or > batch dimension bug killing your GPU utilization > sudden drops in throughput? > check your software stack for inefficiencies, > resource leaks, or bad dataloader code > seeing noisy, spiky loss values? > your data shuffling is probably broken, > and the model is seeing repeated or ordered data > performance worse than expected? > look for subtle parallelism bugs > tensor parallel, data parallel, > or pipeline parallel gone rogue > monitor like your GPUs depend on it (because they do) > watch every metric, track utilization, spot anomalies fast > mid-training is not autopilot > swap in higher-quality data to improve learning, > extend the context window if you want bigger inputs, > and use multi-stage training curricula to maximize gains > the difference between a good model and a failed run is > almost always vigilance and relentless debugging during this marathon post-training > post-training is where your raw base model > actually becomes a useful assistant > always start with supervised fine-tuning (sft) > use high-quality, well-structured chat data and > pick a solid template for consistent turns > sft gives you a stable, cost-effective baseline > don’t skip it, even if you plan to go deeper > next, optimize for user preferences > direct preference optimization (dpo), > or its variants like kernelized (kto), > online (orpo), or adversarial (apo) > these methods actually teach the model > what “better” looks like beyond simple mimicry > once you’ve got preference alignment,go on-policy: > reinforcement learning from human feedback (rlhf) > or on-policy distillation, which lets your model learn > from real interactions or stronger models > this is how you get reliability and sharper behaviors > the post-training pipeline is where > assistants are truly sculpted; > skipping steps means leaving performance, > safety, and steerability on the table infra is the boss fight > this is where most teams lose time, > money, and sanity if they’re not careful > inside every gpu > you’ve got tensor cores and cuda cores for the heavy math, > plus a memory hierarchy (registers, shared memory, hbm) > that decides how fast you can feed data to the compute units > outside the gpu, your interconnects matter > pcie for gpu-to-cpu, > nvlink for ultra-fast gpu-to-gpu within a node, > infiniband or roce for communication between nodes, > and gpudirect storage for feeding massive datasets > straight from disk to gpu memory > make your infra resilient: > checkpoint your training constantly, > because something will crash; > monitor node health so you can kill or restart > sick nodes before they poison your run > scaling isn’t just “add more gpus” > you have to pick and tune the right parallelism: > data parallelism (dp), pipeline parallelism (pp), tensor parallelism (tp), > or fully sharded data parallel (fsdp); > the right combo can double your throughput, > the wrong one can bottleneck you instantly to recap > always start with WHY > define the core reason you’re training a model > is it research, a custom production need, or to fill an open-source gap? > spec what you need: architecture, model size, data mix, assistant type > transformer or hybrid > set your model size > design the right data mixture > decide what kind of assistant or > use case you’re targeting > build infra for the job, plan for chaos, pick your stability tricks > build infrastructure that matches your goals > choose the right GPUs > set up reliable storage > and plan for network bottlenecks > expect failures, weird bugs, > and sudden bottlenecks at scale > select your stability tricks in advance: > know which techniques you’ll use to fight loss spikes, > unstable gradients, and hardware hiccups closing notes > the pace of LLM development is relentless, > but the underlying principles never go out of style > and this PDF covers what actually matters > no matter how fast the field changes > systematic experimentation is everything > run controlled tests, change one variable at a time, and document every step > sharp debugging instincts will save you > more time (and compute budget) than any paper or library > deep knowledge of both your software stack > and your hardware is the ultimate unfair advantage; > know your code, know your chips > in the end, success comes from relentless curiosity, > tight feedback loops, and a willingness to question everything > even your own assumptions if i had this two years ago, it would have saved me so much time > if you’re building llms, > read this before you burn gpu months happy hacking

English

2.8K

EdinburghNLP retweetet

Irina Saparina@irisaparina·17 Ara

Reasoning models are powerful, but they burn thousands of tokens on potentially wrong interpretations for ambiguous requests! 👉 We teach models to think about intent first and provide all interpretations and answers in a single response via RL with dual reward. 🧵1/6

English

2.6K

EdinburghNLP retweetet

Edoardo Ponti@PontiEdoardo·15 Ara

Finally, you can count the r's in strawberry and check if 3.11 is higher than 3.9 without tokenisation interfering: Here's Bolmo, a fully open byte-level LLM with latent tokenisation, derived from a SOTA LLM (Olmo 3). Promising on coding and char-level understanding!

Ai2@allen_ai

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

English

4.2K

Entdecken

@AIatMeta @Alibaba_Qwen @yuzhaouoe @AdrianLancucki @CStanKonrad @PontiEdoardo @kgashteo @GiwonHong413849