Mattia Verasani

6K posts

Mattia Verasani

@MatRazor

Katılım Aralık 2017

320 Takip Edilen75 Takipçiler

Mattia Verasani retweetledi

PyTorch@PyTorch·14h

Model Optimization and Post-Training Quantization Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments. This post walks through how to use NVIDIA Model Optimizer to quantize a CLIP model in FP8 format with the post-training quantization (PTQ) method, including an example workflow exporting a PyTorch checkpoint. Read the complete blog post: developer.nvidia.com/blog/model-qua…

English

111

9.4K

Mattia Verasani retweetledi

Rosinality@rosinality·22h

arxiv.org/abs/2605.24326 Meta's experience on multi-datacenter training. They have used a PP schedule called Doraemon PP which allows integration with ZeRO-2/3.

English

3.5K

Mattia Verasani retweetledi

Raytar@Raytar·1d

he tested 5760 architectures at Google for a full year. the winner was the original Transformer from 2017. Hyung Won Chung told that story at MIT with a small smile. then went to OpenAI and trained o1. 1 hour. free. by one of the few people on earth who actually moves the frontier. meanwhile your feed is full of guys writing architecture threads who have never trained a model anyone uses. he just told MIT that 99% of AI research is theater. your AI worldview was built by men who read his papers. badly. now you can read him directly. you will rewatch this. save it now.

Raytar@Raytar

"I was definitely the first prompt engineer at Anthropic. Might have been the first in the world." Alex Albert just spent 35 minutes explaining how they train Claude's personality from the inside. 35 minutes. free. by the person who invented the role. most people think Claude's character is a system prompt. it's not. you'll never look at Claude the same way.

English

102.4K

Mattia Verasani retweetledi

Rosinality@rosinality·1d

arxiv.org/abs/2605.23857 Could it be useful to distill from a smaller model? I think, beyond distillation, we could get some signal from the loss difference across the scales.

English

132

11.1K

Mattia Verasani retweetledi

zhyncs@zhyncs42·2d

Correctness is critical for LLM inference engines. Recently, I found TRT-LLM’s work on Hypothesis Testing Methodology to be extremely professional. #hypothesis-testing-methodology" target="_blank" rel="nofollow noopener">github.com/NVIDIA/TensorR…

English

233

14.1K

Mattia Verasani retweetledi

Greg Brockman@gdb·2d

self improvement prompt for codex

Vaibhav (VB) Srivastav@reach_vb

UPDATE: Came up with an even better version of this prompt after the feedback Ask Codex to look across your sessions, Memories, and Chronicle, identify patterns, reuse what already exists, and only create the smallest useful skill, subagent, or automation. "Look back over my recent work from the last 30 days, or all available history if shorter, and identify repeated manual workflows worth packaging. Use available evidence in this order: - Recent Codex sessions and task summaries. - Codex Memories and rollout summaries to find patterns repeated across sessions. - Chronicle, if enabled, to spot repeated work outside Codex. Use Chronicle for discovery only; confirm important details in the relevant source system when possible. - Existing skills, custom agents, and automations, so you reuse or extend what already exists instead of duplicating it. Look broadly for work that is repeated, time-consuming, error-prone, context-heavy, or benefits from a consistent process. Include workflows across coding, research, writing, planning, communication, operations, analysis, and personal administration. Only act on a candidate when it: - occurred at least twice, or is clearly likely to recur and costly to repeat; - has stable inputs, a repeatable procedure, and a clear output or stopping condition; - would materially improve speed, quality, consistency, or reliability; - is not already adequately covered. Choose the smallest appropriate form: - Skill: a reusable workflow or playbook. - Custom subagent: a bounded specialist role or investigation task suitable for delegation. - Automation: a scheduled or recurring check, report, reminder, or monitor. - Skip: work that is too one-off, ambiguous, sensitive, or poorly evidenced to package. First produce a compact shortlist with: - repeated workflow - supporting evidence and dates - frequency/confidence - recommended form: skill, subagent, automation, extend existing, or skip - why it is or is not worth creating Then create only the high-confidence missing items. Keep them narrow, practical, source-aware, and easy to validate. Do not create speculative, overlapping, or overly broad assets. Finish with: - what you created or extended - what you deliberately skipped - what needs more evidence before packaging"

English

113

350

3.9K

476K

Mattia Verasani retweetledi

Saining Xie@sainingxie·5d

check out RAEv2 led by Jas. through extensive exps, we found some really intriguing behaviors showing why strong representation encoders are key for pixel decoders. spoiler: it’s not about hillclimbing fid; new metrics like ep@fid-k/fdr^k show there’s a lot more left to explore!

Jaskirat Singh@1jaskiratsingh

In Oct last year, Representation Autoencoders provided an elegant solution to unified tokenization for understanding and generation. Today we make them a bit more simple. a bit more general. Result: >10x faster convergence, better reconstruction, better generation. And yes we test them on T2I and world models :) Introducing RAEv2

English

336

52K

Mattia Verasani retweetledi

Gabriele Berton@gabriberton·4d

Apply here to join the frontier of computer vision!

Nithish Kannen@NithishKannen

Our Gemini Vision team @GoogleDeepMind is hiring in MTV/SF. Join us to push the frontiers of visual perception, reasoning and generation, and contribute to Gemini, Nano Banana and Omni. Also get to do cool research such as Vision Banana 🍌: deepmind.google/research/publi…. Job posting below. It's one of the best times to be working on Vision as the frontier is moving rapidly, come join us!

English

192

35K

Mattia Verasani retweetledi

Ivan Fioravanti ᯅ@ivanfioravanti·3d

This series of articles is great! Understanding System Design is key to be able to drive your coding agents correctly!

Fernando@Franc0Fernand0

If you're a software engineer who wants to upskill in system design, read these 14 articles (links below):

English

10.8K

Mattia Verasani retweetledi

Song Han@songhan_mit·3d

Explore our kernel design agents:

English

4.7K

Mattia Verasani retweetledi

Swaroop Mishra@Swarooprm7·4d

Apply to join the Gemini vision team! Highly Recommend!

Nithish Kannen@NithishKannen

English

147

27.4K

Mattia Verasani retweetledi

Sebastian Raschka@rasbt·6d

It's been *almost* a bit quiet around LLM architecture releases in the past two weeks 😅 Interesting tidbit is the parallel block design. Via the Cmd-A the tech report "equivalent performance but significant improvement in throughput compared to the vanilla transformer block."

Cohere@cohere

Introducing: Cohere Command A+ We’ve created our most powerful LLM yet, optimized it to run on as little hardware as possible, and released it open-source for all.

English

664

64.7K

Mattia Verasani retweetledi

JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱@JFPuget·6d

Interesting slides. Using LLMs to generate code easily results in hacking the reward, see slides 41-49. A very similar phenomenon is at play in kaggle neurogolf competition where the host has to fix the evaluator every week to catch new reward hacking tricks. There is much more in the presentation, have a look.

Mark Saroufim@marksaroufim

It was an honor to give the keynote at MLSys Covered how AI systems have evolved, why AI is needed to improve them, why results have disappointed, why the future looks amazing, and why I’m working on this at Core Auto Recording should be out soon, in the meantime slides

English

5.4K

Mattia Verasani retweetledi

Arnav Chavan@ArnavChavan6·20 May

🚀 Organizing the Efficient Qwen Competition @icmlconf ! Goal: Minimize LLM inference latency for a single GPU without breaking model quality. Prizes: $3K / $2K / $1K + present at ICML 2026, Seoul Getting Started - adaptfm.gitlab.io/call-for-compe… Leaderboard - d1krc5fcnf73gi.cloudfront.net

English

144

10.3K

Mattia Verasani retweetledi

Kaichao You@KaichaoYou·20 May

Welcome a new member in vLLM's RL ecosystem, expanding frontier RL support for Omni-models, a true pioneer in this category!

vLLM@vllm_project

🎉 Congrats to the VeRL-Omni team on the pre-release of a general RL post-training framework for multimodal generative models. Built on verl + vllm-omni. vLLM-Omni handles the multimodal rollout with step-wise continuous batching and embedding caching; vLLM serves the VLM-as-judge / OCR reward model, overlapped with rollout and training. In the Qwen-Image OCR demo, moving the reward to its own GPU cuts per-step wall-clock by ~14%. Released: Qwen-Image with FlowGRPO / MixGRPO / GRPO-Guard. BAGEL and Qwen3-Omni-Thinker PR-ready. Excited to push multimodal generative RL forward together with VeRL-Omni and the broader community. 🙌 📖 vllm.ai/blog/2026-05-1… 🔗 github.com/verl-project/v…

English

115

11.1K

Mattia Verasani retweetledi

vLLM@vllm_project·20 May

KV cache shouldn't disappear every time vLLM restarts. With @novita_labs, we're sharing PegaFlow — a production-grade external KV cache service that plugs into vLLM through the external KV connector interface. PegaFlow runs as a standalone Rust daemon owning the host KV pool, SSD cache, and RDMA resources. vLLM workers attach via CUDA IPC + gRPC, and cache survives engine crashes, upgrades, and model switches. In production-oriented evaluations: 🚀 2.15× faster vLLM startup with a pre-warmed 500 GiB host pool 📈 56% higher throughput for 8 Qwen3-8B instances sharing one cache ⚡ 72% higher throughput for DeepSeek-V3.2 MLA TP8 (logical KV stored once, not per rank) 🌐 194 GB/s average remote-read throughput across nodes Three-level hierarchy: pinned DRAM, remote DRAM over RDMA, local SSD on io_uring. Integrates through the existing `kv_transfer_config` path — no vLLM source changes. 📖 vllm.ai/blog/2026-05-1…

English

288

28.8K

Mattia Verasani retweetledi

Noam Brown@polynoamial·19 May

Andrej @karpathy is back in the game! I would have loved for him to rejoin @OpenAI, but I'm happy he's at any frontier lab pushing the field forward. It’s easy to frame this as zero-sum among the labs, but in truth we’re collectively advancing the most important tech of our era.

Andrej Karpathy@karpathy

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

English

128

3.7K

211.6K

Mattia Verasani retweetledi

Gabriele Berton@gabriberton·18 May

Python code should be full of asserts Duck typing is just dangerous for deep learning: pass a list instead of a tensor, and you get a weird bug instead of "pass a tensor, not a list" If the code needs a tensor, add an assert

Gabriele Berton@gabriberton

This is good code. Those asserts make any comment superfluous and stop execution if something's wrong

English

361

49.1K

Mattia Verasani retweetledi

finbarr@finbarrtimbers·18 May

This is an elegant paper; hope to try it out soon.

SemiAnalysis@SemiAnalysis_

Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sparse Attention, and recently @NousResearch 's Lighthouse Attention. BLASST by NVIDIA, from paper Dynamic Blocked Attention Sparsity via Softmax Thresholding, attempts to sparsify attention in a different way, leveraging a similar rescale factor threshold idea from Flash Attention 4. We expect to see more interesting sparse attention techniques in the future. arxiv.org/abs/2512.12087 (2/4)

English

16.6K

Mattia Verasani retweetledi

Dimitris Papailiopoulos@DimitrisPapail·17 May

btw we measured this in Memento: flushing your KV cache leads to measurably worse performance, no matter how good the model is x.com/DimitrisPapail…

Dimitris Papailiopoulos@DimitrisPapail

Please stop flushing the KV cache in Claude Code every x hrs of being idle. When i wake up and go back to a session that was running through the night, but stalled for whatever reason, Claude is noticeably far worse than resuming within the time frame of not flushing. Also i hate hearing I’m absolutely right when I’m not. :) has significantly reduced my trust in the model.

English

118

17.6K

Keşfet

@icmlconf @novita_labs @karpathy @OpenAI @elonmusk @BarackObama @taylorswift13 @cristiano