Mohammed Sabry

724 posts

Mohammed Sabry

@M___Sabry

PhD @AdaptCentre @dcucomputing. Making models efficient in training/inference via modularity principles. Prev: research intern @GoogleAI

∆ Beigetreten Mayıs 2020

597 Folgt215 Follower

Mohammed Sabry@M___Sabry·18h

@AliesTaha @gaoj0017 that depends on whether the methods are inherently hardware-specific. If TurboQuant is designed to be GPU-native then that's part of it's value...and if the paper doesn't attribute the gains to the algo alone then no problem with this comparison either.

English

714

AT@AliesTaha·22h

@gaoj0017 only 3. Their experiments used single-core CPU for RaBitQ vs A100 GPU for TurboQuant has merit as a complaint the other 2 just don't hold

English

5.9K

Jianyang Gao@gaoj0017·1d

The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views. We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (openreview.net/forum?id=tO3AS…). We would greatly appreciate your attention and help in sharing it.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

752

5.1K

627.1K

Mohammed Sabry@M___Sabry·15 Mar

imo, it doesn’t seem like a big problem, the paper shows that larger D helps and makes the bottleneck less harmful. also, LLMs still train very well in practice, which may suggest that D is already large enough for language data (maybe because language is very low rank, so the effect isn’t as destructive as the norm numbers make it sound). But first I would like to know the evaluation of other proxies for gradient information loss than norm...norm does not necessarily correspond to useful signal.

English

949

bycloud@bycloudai·15 Mar

how big of a problem is this? > When backproping through the LM head, about 95-99% of the logit-gradient norm lies in directions that get projected away seems like the current workaround is just to use scaling to brute force it

English

350

40.4K

Mohammed Sabry@M___Sabry·11 Mar

@elhllos Pretraining is only one part, most of the gpus/tpus (compute centers) of frontier labs are for inference workloads.

English

1.3K

elhllos@elhllos·11 Mar

المشروع دا هو بيتكوين عالم ال AI فكرة التدريب اللامركزي لل AI models دي عبقرية وهتدي دفع جامدة لل open source الرهان الجاي على الـ 400B، لو قدروا يدربوا نموذج بالحجم دا نقدر نقول إن كل المليارات اللي بتتصرف علي ال data centers من شركات زي جوجل و openAI هيبقوا اترموا فالأرض

templar@tplr_ai

We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n

العربية

210.5K

Mohammed Sabry retweetet

Andrew Gordon Wilson@andrewgwils·10 Mar

In an era where many frontier labs are converging towards conservative incremental approaches, it's heartening to see resources being directed into ambitious efforts and fresh ideas!

Yann LeCun@ylecun

Unveiling our new startup Advanced Machine Intelligence (AMI Labs). We just completed our seed round: $1.03B / 890M€, one the largest seeds ever, probably the largest for a European company. We're hiring! [the background image is the Veil Nebula - a picture I took from my backyard, most appropriate for an unveiling] More details here: techcrunch.com/2026/03/09/yan…

English

138

13.1K

Mohammed Sabry retweetet

Wonder of Science@wonderofscience·8 Mar

How giraffes drink.

English

120

1.2K

124.3K

Mohammed Sabry@M___Sabry·24 Ara

The problem isn’t treating scaling curves like physics; it’s declaring regime changes without confidence intervals on the fit...

English

Mohammed Sabry retweetet

Dinghuai Zhang 张鼎怀@zdhnarsil·21 Ara

Continual learning has always been important, but people have not found effective ways. Tweaking loss objectives brings tons of ml papers a few year's back but cannot really achieve inf long context. Maybe we need more arch innovation like efficient attn and learnable memory.

Paras Chopra@paraschopra

Continual learning is the next big thing in AI. Today’s deep networks show catastrophic forgetting where new data points cause old knowledge to get erased because backpropagation updates all weights in a network, causing old useful info to be overwritten. Contrast this with brain - it doesn’t do backpropagation. no global updates, only local updates with each new data point. The day we have a credible alternative to backpropagation is the day we will solve continual learning. It’s an exciting time to be doing AI research!

English

160

20K

Mohammed Sabry retweetet

signüll@signulll·10 Ara

most ppl don’t really think. they rearrange cached thoughts until they feel smart. true thinking is rare because it’s metabolically expensive (just like thinking models are computationally expensive).

English

225

604

6.9K

346.2K

Mohammed Sabry retweetet

Andrew Gordon Wilson@andrewgwils·6 Ara

Continual learning as a discipline seems to have catastrophic forgetting that it has been focused on catastrophic forgetting for a decade with virtually no progress. Time for some radically new ideas in that area.

English

364

29.1K

Mohammed Sabry retweetet

Damek@damekdavis·3 Ara

New paper studies when spectral gradient methods (e.g., Muon) help in deep learning: 1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 2. We then explain why spectral methods can perform well despite this. Long thread

English

336

97.3K

Mohammed Sabry retweetet

Yann LeCun@ylecun·27 Kas

@giffmana Something something cake something something cherry.

English

259

29.6K

Mohammed Sabry@M___Sabry·26 Kas

Regardless of what direction the field takes, cutting-edge deep learning research will continue to require huge compute. The work is still fundamentally empirical: every promising idea demands careful ablations, stress-tests, and re-runs — even if experience, intuition, and taste help prune away many of the unnecessary experiments.

English

Mohammed Sabry@M___Sabry·26 Kas

Yes... big labs love scaling laws because they offer predictability, which makes strategic and financial planning much easier. Once an Org builds entire structures around the “scaling stack”: pre-training teams, post-training teams, eval teams, infra teams, it becomes harder to pivot toward radically different paradigms unless there’s a significant external shock.

Dwarkesh Patel@dwarkesh_sp

The @ilyasut episode 0:00:00 – Explaining model jaggedness 0:09:39 - Emotions and value functions 0:18:49 – What are we scaling? 0:25:13 – Why humans generalize better than models 0:35:45 – Straight-shotting superintelligence 0:46:47 – SSI’s model will learn from deployment 0:55:07 – Alignment 1:18:13 – “We are squarely an age of research company” 1:29:23 – Self-play and multi-agent 1:32:42 – Research taste Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!

English

Mohammed Sabry retweetet

bidhan@bidhan·25 Kas

anyone who is convinced that scaling era is over and want to sell me their gpus, pls dm

Dwarkesh Patel@dwarkesh_sp

English

684

Mohammed Sabry@M___Sabry·20 Kas

"just improve information gain per FLOPs"

dr. jack morris@jxmnop

some hypotheses for what “better pretraining” could mean - integration with other training stages: i’m guessing they’re finally at a point where post-training perf (eg SWE-Bench) can be used as signal for pretraining eng decisions - filtering: scaling approaches like influence functions for getting rid of datapoints that don’t help eval perf - synthetic data: using rephrasing to upsample certain useful documents and make them more amenable to reasoning - mixing: more principled & scalable approaches for determining mixing coefficients - new data: purchasing and scanning more books, transcribing YouTube, buying private token collections like news articles - smart packing: there are various ways to group documents into batches that work better, especially for long-context stuff - systems: more data, more flops

English

Mohammed Sabry@M___Sabry·15 Kas

Rage-baiting activated.... 🫢

English

Mohammed Sabry@M___Sabry·15 Kas

And it should be climbed during the training of a gaint NN

François Chollet@fchollet

The ladder of intelligence is the ladder of abstraction. L1: Memorizing answers (no generalization) L2: Interpolative retrieval of answers, pattern matching, memorizing answer-generating rules (local generalization) L3: Synthesizing causal rules on the fly (strong generalization) L4: Discovering general principles, metacognition (extreme generalization) To achieve compounding AI you need to reach L4.

English

Mohammed Sabry retweetet

Ahmad Beirami@abeirami·14 Kas

Even from NeurIPS experiments a decade ago, the bottom 25% of papers had a 5% chance of acceptance whereas the top 25% had a 50% chance. Review scores are noisy! With the recent flood of papers and the downfall of reviews, the correlation between merit and acceptance is going to 0 fast. Can someone estimate the convergence rate?

Micah Goldblum@micahgoldblum

An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3

English

145

26.3K

Mohammed Sabry retweetet

Peter Richtarik@peter_richtarik·13 Kas

I am an AC for ICLR 2026. One of the papers in my batch was just withdrawn. The authors wrote a brief response, explaining why the reviewers failed at their job. I agree with most of their comments. The authors gave up. They are fed up. Just like many of us. I understand. We pretend the emperor has clothes, but he is naked. Here is the final part of their withdrawal notice. I took the liberty to make it public, to highlight that what we are doing with AI conference reviews these last few years is, basically, madness. --- Comment: We thank the reviewers for their time. However, upon reading the reviews for our paper, it became immediately apparent that the four "reject" ratings are not based on good-faith academic disagreement, but on a critical failure to read the submitted paper. The reviews are rife with demonstrably false claims that are directly contradicted by the text. The core justifications for rejection rely on asserting that key components are "missing" when they are explicitly detailed in the manuscript. Some specific examples are (and many are even fake claims). Claim: Harder tasks like GSM8K are missing. Fact: GSM8K results are in many tables, like Table 2 (Section 4.2) and Appendix G. Claim: The method does not use per-layer ranks. Fact: This is the entire point of our method. The reviewer clearly mistook our method for the baselines. (Section 2, Table 1). Claim: The GP kernel is not specified. Fact: It is specified in Appendix E (Table 6). Claim: There is no ablation of the method's three stages. Fact: Section 4.4 ("Ablation Study") and Appendix J are dedicated to this. Reviewers have a fundamental responsibility to read and evaluate the work they are assigned. The nature of these errors is so fundamental, so systemic in overlooking explicit content, that it goes far beyond what "limited time" or "oversight" can explain. This work has gone through several rounds of revision over the last year. In earlier submissions, the paper usually received borderline or weak-accept scores. Numerous signs strongly suggest that some reviewers are relying entirely on AI tools to automatically generate peer reviews, rather than fulfilling their fundamental responsibility of personally reading and evaluating manuscripts. We strongly protest this. This is a gross disrespect to the authors. It is a flagrant desecration of the reviewer's sacred duty. It fundamentally undermines the integrity of the entire peer-review process. Given that the reviews are not based on the actual content of our paper, we have decided to withdraw the submission. We leave this comment so that future readers of the OpenReview page are aware that the items described as "missing" are already present in the submitted manuscript. These negative reviews for this submission are factually unsound and do not reflect the content of the paper. We cannot and will not accept an assessment that is not based on the work we actually submitted.

English

205

1.5K

149.4K

Mohammed Sabry retweetet

Shuaichen Chang@ShuaichenChang·13 Kas

ICLR reviews are out today. I’ve seen both authors and reviewers, including myself, frustrated by the poor quality of certain papers and reviews. Here are my thoughts. As people often say, 𝗛𝗮𝗽𝗽𝗶𝗻𝗲𝘀𝘀 = 𝗥𝗲𝗮𝗹𝗶𝘁𝘆 / 𝗘𝘅𝗽𝗲𝗰𝘁𝗮𝘁𝗶𝗼𝗻𝘀. I think much of the frustration comes from our expectations about the distribution of paper and review quality. Paper quality does 𝗡𝗢𝗧 follow a normal distribution, while we often aim to assign scores that do. The lower bound extends a lot farther from the mean than the upper bound. The same applies to review qualify. But because we expect a normal distribution, we also expect that we should encounter low-quality papers or reviews only rarely, which is not the reality. This mistaken assumption of normal distributions appears in many areas of life, as I learned from Taleb’s The Black Swan recently. My suggestion: we should recalibrate our expectations. Celebrate the completion of a 𝘀𝗼𝗹𝗶𝗱 project, rather than its recognition by a few peers. The true milestones in science are: (1) discovering or inventing something, and (2) having those findings make an impact to the community or society by being adopted by others. Paper reviews and acceptance are usually just steps along the way, though good feedback can certainly help refine our work. At the same time, our community needs better mechanisms to reduce low-effort submissions and reviews, so we can bring back some joy to the conference cycle.

English

12.3K

Entdecken

@AliesTaha @gaoj0017 @elhllos @giffmana @elonmusk @BarackObama @taylorswift13 @cristiano