Kamran Chitsaz

132 posts

Kamran Chitsaz

Kamran Chitsaz

@KChitsaz

Machine Learning Researcher @Mila_Quebec, MSc of Electrical Engineering at @polymtl

Montreal, QC Katılım Mart 2021
273 Takip Edilen154 Takipçiler
Sabitlenmiş Tweet
Kamran Chitsaz
Kamran Chitsaz@KChitsaz·
@ZyphraAI @AMD Very exciting to see Markovian Thinking used in ZAYA1-8B. Scaling test-time compute to millions of tokens within 32K context, with strong gains from a <1B active parameter model, is exactly what we hoped this idea would enable. Congrats to the Zyphra team! x.com/MAghajohari/st…
Milad Aghajohari@MAghajohari

Excited to see that Markovian Thinker contributed to Zyphra's strong release 🚀. Their Markovian RSA: markovian thinking (carrying forward bounded-length reasoning tails) + RSA (recursive self-aggregation) boosted test-time compute to be on-par with larger reasoning models. 1/

English
0
0
1
108
Zyphra
Zyphra@ZyphraAI·
Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵
Zyphra tweet media
English
102
295
2.5K
1.3M
Kamran Chitsaz retweetledi
Milad Aghajohari
Milad Aghajohari@MAghajohari·
Excited to see that Markovian Thinker contributed to Zyphra's strong release 🚀. Their Markovian RSA: markovian thinking (carrying forward bounded-length reasoning tails) + RSA (recursive self-aggregation) boosted test-time compute to be on-par with larger reasoning models. 1/
Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English
1
5
50
7.4K
Kamran Chitsaz retweetledi
Nilaksh
Nilaksh@nilaksh404·
Diffusion world models can help test and improve robot policies before running them on real robots. But can the choice of latent space make the WM more faithful? We show that semantic spaces beat reconstruction spaces on task relevant metrics. hskalin.github.io/semantic-wm
English
5
48
218
41.1K
Kamran Chitsaz retweetledi
Darshan Patil
Darshan Patil@dapatil211·
🧬 New paper Scientific datasets evolve as science evolves. With proteins, new sequences get added, annotations get corrected, and noisy entries get curated out. Introducing CoPeP, a continual-pretraining benchmark for protein LMs. Details 🧵 1/n
Darshan Patil tweet media
English
2
29
84
8.5K
Kamran Chitsaz retweetledi
Chandar Lab
Chandar Lab@ChandarLab·
Streaming Reinforcement Learning (RL) is a huge challenge: transitions are used once and discarded immediately. This makes agents extremely sample-inefficient. But what if we could "squeeze" more information out of every single frame? Check out our latest paper!
Chandar Lab tweet media
English
1
12
18
2.9K
Kamran Chitsaz retweetledi
Chandar Lab
Chandar Lab@ChandarLab·
‘The Markovian Thinker’, developed by our lab, has been accepted at @iclr_conf! 

This work achieved long reasoning without the quadratic attention tax LLMs reason in chunks with a bounded state, achieving linear compute, constant memory and scaling beyond its training limit!
GIF
English
1
18
66
8.8K
Kamran Chitsaz retweetledi
Chandar Lab
Chandar Lab@ChandarLab·
New work from our lab, accepted @iclr_conf : "The Expressive Limits of Diagonal SSMs for State-Tracking" We give a complete characterization of what diagonal SSMs can and cannot compute on state-tracking tasks and the answer is deeply connected to group theory. 🧵👇
English
2
13
25
4.5K
Kamran Chitsaz retweetledi
Chandar Lab
Chandar Lab@ChandarLab·
Can LLMs become CAD designers? Check out “CADmium: Fine-Tuning Code Language Models for Text-Driven Sequential CAD Design”, which is now published in Transactions on Machine Learning Research (TMLR), and led by @prashantg_17, @DavideBald42296, and @qfournier2!
English
1
4
15
8.1K
Kamran Chitsaz retweetledi
Mila - Institut québécois d'IA
Alongside @NeurIPSConf in San Diego, the satellite conference NeurIPS Mexico City is taking place, with several Mila student-researchers taking part. Two of them presented their research today. SaharDastani (@sonia_dt98), PhD student at ETS/Mila, presented “TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses” and Saba Ahmadi (@Saba_A96), affiliated researcher at UdeM/Mila, presented “The Promise of RL for Autoregressive Image Editing.” Congratulations!
Mila - Institut québécois d'IA tweet media
English
0
18
54
6.3K
Kamran Chitsaz retweetledi
Amir Kargaran
Amir Kargaran@amir_nlp·
With all the ICLR 2026 drama, we’re sharing some insights on the review and rebuttal process from ICLR 2025 & 2024. You might find them useful for your own rebuttal! arxiv.org/abs/2511.15462 The data of scores before and after rebuttal is also available: github.com/papercopilot/i…
Amir Kargaran tweet media
English
1
3
34
11.8K
Kamran Chitsaz retweetledi
Aarash Feizi @ ICLR 🇧🇷
Aarash Feizi @ ICLR 🇧🇷@aarashfeizi·
🚀 Announcing GroundCUA, a high-quality dataset for grounding computer-use agents. With over 3M expert annotations spanning 87 desktop apps, we use our new dataset to train state-of-the-art grounding models, namely GroundNext-3B and GroundNext-7B. 👇 Thread
English
5
31
89
22.4K
Kamran Chitsaz retweetledi
Mohammad Pezeshki
Mohammad Pezeshki@mpezeshki91·
We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities
Elvis Dohmatob@dohmatobelvis

1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram

English
0
8
15
2.3K
Kamran Chitsaz retweetledi
Amirhossein Kazemnejad
Amirhossein Kazemnejad@a_kazemnejad·
After nearly 3 years since our NeurIPS paper, SOTA architectures are now adopting NoPE. Kimi Linear uses NoPE for all full-attention layers (not a RoPE hybrid).
Rohan Paul@rohanpaul_ai

The brilliant Kimi Linear paper. It's a hybrid attention that beats full attention while cutting memory by up to 75% and keeping 1M token decoding up to 6x faster. It cuts the key value cache by up to 75% and delivers up to 6x faster decoding at 1M context. Full attention is slow because it compares every token with every other token and stores all past keys and values. Kimi Linear speeds this up by keeping a small fixed memory per head and updating it step by step like a running summary, so compute and memory stop growing with length. Their new Kimi Delta Attention adds a per channel forget gate, which means each feature can separately decide what to keep and what to fade, so useful details remain and clutter goes away. They also add a tiny corrective update on every step, which nudges the memory toward the right mapping between keys and values instead of just piling on more data. The model stacks 3 of these fast KDA layers then 1 full attention layer, so it still gets occasional global mixing while cutting the key value cache roughly by 75%. Full attention layers run with no positional encoding, and KDA learns order and recency itself, which simplifies the stack and helps at long ranges. Under the hood, a chunkwise algorithm plus a constrained diagonal plus low rank design removes unstable divisions and drops several big matrix multiplies, so the kernels run much faster on GPUs. With the same training setup, it scores higher on common tests, long context retrieval, and math reinforcement learning, while staying fast even at 1M tokens. It drops into existing systems, saves memory, scales to 1M tokens, and improves accuracy without serving changes. ---- Paper – arxiv. org/abs/2510.26692 Paper Title: "Kimi Linear: An Expressive, Efficient Attention Architecture"

English
7
34
366
52K
Kamran Chitsaz retweetledi
Divyat Mahajan
Divyat Mahajan@divyat09·
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned embedding of the future sequence, not the tokens themselves
GIF
English
11
46
222
60.3K
Kamran Chitsaz retweetledi
Mohammad Pezeshki
Mohammad Pezeshki@mpezeshki91·
My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.
Divyat Mahajan@divyat09

[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned embedding of the future sequence, not the tokens themselves

English
0
14
31
3.4K
Kamran Chitsaz retweetledi
Artem Zholus
Artem Zholus@artemZholus·
I can't attend #ICCV 2025 in Honolulu, Hawaii but my amazing teammates will be there! Please stop by our poster tomorrow 21 Oct (#438) to learn about TAPNext, a general, ViT-like architecture with SOTA point tracking quality! Links: 🌐 website: tap-next.github.io
Artem Zholus tweet media
English
1
3
7
1.5K
Kamran Chitsaz retweetledi
Mohammad Pezeshki
Mohammad Pezeshki@mpezeshki91·
Alleviating long context issues: ​Iterative Amortized Inference (IAI) refines solutions step-by-step over mini-batches, just like stochastic optimization. ​IAI merges: ​- Scalability of stochastic opt. (SGD). ​- Expressivity of forward-pass amortization (ICL in LLMs).
Sarthak Mittal@sarthmit

Meta on meta: thrilled to share our work on Meta-learning… at Meta! 🔥🧠 We make two major contributions: 1️⃣ Unified framework revealing insights into various amortizations 🧠 2️⃣ Greedy belief-state updates to handle long context-lengths 🚀

English
1
8
20
1.7K