Mohammed Sabry

724 posts

Mohammed Sabry banner
Mohammed Sabry

Mohammed Sabry

@M___Sabry

PhD @AdaptCentre @dcucomputing. Making models efficient in training/inference via modularity principles. Prev: research intern @GoogleAI

Beigetreten Mayıs 2020
597 Folgt215 Follower
Mohammed Sabry
Mohammed Sabry@M___Sabry·
@AliesTaha @gaoj0017 that depends on whether the methods are inherently hardware-specific. If TurboQuant is designed to be GPU-native then that's part of it's value...and if the paper doesn't attribute the gains to the algo alone then no problem with this comparison either.
English
0
0
1
714
AT
AT@AliesTaha·
@gaoj0017 only 3. Their experiments used single-core CPU for RaBitQ vs A100 GPU for TurboQuant has merit as a complaint the other 2 just don't hold
English
4
1
16
5.9K
Jianyang Gao
Jianyang Gao@gaoj0017·
The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views. We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (openreview.net/forum?id=tO3AS…). We would greatly appreciate your attention and help in sharing it.
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
72
752
5.1K
627.1K
Mohammed Sabry
Mohammed Sabry@M___Sabry·
imo, it doesn’t seem like a big problem, the paper shows that larger D helps and makes the bottleneck less harmful. also, LLMs still train very well in practice, which may suggest that D is already large enough for language data (maybe because language is very low rank, so the effect isn’t as destructive as the norm numbers make it sound). But first I would like to know the evaluation of other proxies for gradient information loss than norm...norm does not necessarily correspond to useful signal.
English
0
0
1
949
bycloud
bycloud@bycloudai·
how big of a problem is this? > When backproping through the LM head, about 95-99% of the logit-gradient norm lies in directions that get projected away seems like the current workaround is just to use scaling to brute force it
bycloud tweet media
English
23
36
350
40.4K
Mohammed Sabry
Mohammed Sabry@M___Sabry·
@elhllos Pretraining is only one part, most of the gpus/tpus (compute centers) of frontier labs are for inference workloads.
English
0
0
1
1.3K
elhllos
elhllos@elhllos·
المشروع دا هو بيتكوين عالم ال AI فكرة التدريب اللامركزي لل AI models دي عبقرية وهتدي دفع جامدة لل open source الرهان الجاي على الـ 400B، لو قدروا يدربوا نموذج بالحجم دا نقدر نقول إن كل المليارات اللي بتتصرف علي ال data centers من شركات زي جوجل و openAI هيبقوا اترموا فالأرض
templar@tplr_ai

We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n

العربية
24
58
1K
210.5K
Mohammed Sabry retweetet
Mohammed Sabry retweetet
Wonder of Science
Wonder of Science@wonderofscience·
How giraffes drink.
English
33
120
1.2K
124.3K
Mohammed Sabry
Mohammed Sabry@M___Sabry·
The problem isn’t treating scaling curves like physics; it’s declaring regime changes without confidence intervals on the fit...
English
0
0
0
68
Mohammed Sabry retweetet
Mohammed Sabry retweetet
signüll
signüll@signulll·
most ppl don’t really think. they rearrange cached thoughts until they feel smart. true thinking is rare because it’s metabolically expensive (just like thinking models are computationally expensive).
English
225
604
6.9K
346.2K
Mohammed Sabry retweetet
Andrew Gordon Wilson
Andrew Gordon Wilson@andrewgwils·
Continual learning as a discipline seems to have catastrophic forgetting that it has been focused on catastrophic forgetting for a decade with virtually no progress. Time for some radically new ideas in that area.
English
25
23
364
29.1K
Mohammed Sabry retweetet
Damek
Damek@damekdavis·
New paper studies when spectral gradient methods (e.g., Muon) help in deep learning: 1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 2. We then explain why spectral methods can perform well despite this. Long thread
Damek tweet media
English
11
67
336
97.3K
Mohammed Sabry retweetet
Yann LeCun
Yann LeCun@ylecun·
@giffmana Something something cake something something cherry.
English
12
2
259
29.6K
Mohammed Sabry
Mohammed Sabry@M___Sabry·
Regardless of what direction the field takes, cutting-edge deep learning research will continue to require huge compute. The work is still fundamentally empirical: every promising idea demands careful ablations, stress-tests, and re-runs — even if experience, intuition, and taste help prune away many of the unnecessary experiments.
English
0
0
0
46
Mohammed Sabry
Mohammed Sabry@M___Sabry·
Yes... big labs love scaling laws because they offer predictability, which makes strategic and financial planning much easier. Once an Org builds entire structures around the “scaling stack”: pre-training teams, post-training teams, eval teams, infra teams, it becomes harder to pivot toward radically different paradigms unless there’s a significant external shock.
Dwarkesh Patel@dwarkesh_sp

The @ilyasut episode 0:00:00 – Explaining model jaggedness 0:09:39 - Emotions and value functions 0:18:49 – What are we scaling? 0:25:13 – Why humans generalize better than models 0:35:45 – Straight-shotting superintelligence 0:46:47 – SSI’s model will learn from deployment 0:55:07 – Alignment 1:18:13 – “We are squarely an age of research company” 1:29:23 – Self-play and multi-agent 1:32:42 – Research taste Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!

English
1
0
0
79
Mohammed Sabry retweetet
bidhan
bidhan@bidhan·
anyone who is convinced that scaling era is over and want to sell me their gpus, pls dm
Dwarkesh Patel@dwarkesh_sp

The @ilyasut episode 0:00:00 – Explaining model jaggedness 0:09:39 - Emotions and value functions 0:18:49 – What are we scaling? 0:25:13 – Why humans generalize better than models 0:35:45 – Straight-shotting superintelligence 0:46:47 – SSI’s model will learn from deployment 0:55:07 – Alignment 1:18:13 – “We are squarely an age of research company” 1:29:23 – Self-play and multi-agent 1:32:42 – Research taste Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!

English
0
1
9
684
Mohammed Sabry
Mohammed Sabry@M___Sabry·
Rage-baiting activated.... 🫢
English
0
0
0
62
Mohammed Sabry retweetet
Ahmad Beirami
Ahmad Beirami@abeirami·
Even from NeurIPS experiments a decade ago, the bottom 25% of papers had a 5% chance of acceptance whereas the top 25% had a 50% chance. Review scores are noisy! With the recent flood of papers and the downfall of reviews, the correlation between merit and acceptance is going to 0 fast. Can someone estimate the convergence rate?
Micah Goldblum@micahgoldblum

An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3

English
3
8
145
26.3K
Mohammed Sabry retweetet
Peter Richtarik
Peter Richtarik@peter_richtarik·
I am an AC for ICLR 2026. One of the papers in my batch was just withdrawn. The authors wrote a brief response, explaining why the reviewers failed at their job. I agree with most of their comments. The authors gave up. They are fed up. Just like many of us. I understand. We pretend the emperor has clothes, but he is naked. Here is the final part of their withdrawal notice. I took the liberty to make it public, to highlight that what we are doing with AI conference reviews these last few years is, basically, madness. --- Comment: We thank the reviewers for their time. However, upon reading the reviews for our paper, it became immediately apparent that the four "reject" ratings are not based on good-faith academic disagreement, but on a critical failure to read the submitted paper. The reviews are rife with demonstrably false claims that are directly contradicted by the text. The core justifications for rejection rely on asserting that key components are "missing" when they are explicitly detailed in the manuscript. Some specific examples are (and many are even fake claims). Claim: Harder tasks like GSM8K are missing. Fact: GSM8K results are in many tables, like Table 2 (Section 4.2) and Appendix G. Claim: The method does not use per-layer ranks. Fact: This is the entire point of our method. The reviewer clearly mistook our method for the baselines. (Section 2, Table 1). Claim: The GP kernel is not specified. Fact: It is specified in Appendix E (Table 6). Claim: There is no ablation of the method's three stages. Fact: Section 4.4 ("Ablation Study") and Appendix J are dedicated to this. Reviewers have a fundamental responsibility to read and evaluate the work they are assigned. The nature of these errors is so fundamental, so systemic in overlooking explicit content, that it goes far beyond what "limited time" or "oversight" can explain. This work has gone through several rounds of revision over the last year. In earlier submissions, the paper usually received borderline or weak-accept scores. Numerous signs strongly suggest that some reviewers are relying entirely on AI tools to automatically generate peer reviews, rather than fulfilling their fundamental responsibility of personally reading and evaluating manuscripts. We strongly protest this. This is a gross disrespect to the authors. It is a flagrant desecration of the reviewer's sacred duty. It fundamentally undermines the integrity of the entire peer-review process. Given that the reviews are not based on the actual content of our paper, we have decided to withdraw the submission. We leave this comment so that future readers of the OpenReview page are aware that the items described as "missing" are already present in the submitted manuscript. These negative reviews for this submission are factually unsound and do not reflect the content of the paper. We cannot and will not accept an assessment that is not based on the work we actually submitted.
English
33
205
1.5K
149.4K
Mohammed Sabry retweetet
Shuaichen Chang
Shuaichen Chang@ShuaichenChang·
ICLR reviews are out today. I’ve seen both authors and reviewers, including myself, frustrated by the poor quality of certain papers and reviews. Here are my thoughts. As people often say, 𝗛𝗮𝗽𝗽𝗶𝗻𝗲𝘀𝘀 = 𝗥𝗲𝗮𝗹𝗶𝘁𝘆 / 𝗘𝘅𝗽𝗲𝗰𝘁𝗮𝘁𝗶𝗼𝗻𝘀. I think much of the frustration comes from our expectations about the distribution of paper and review quality. Paper quality does 𝗡𝗢𝗧 follow a normal distribution, while we often aim to assign scores that do. The lower bound extends a lot farther from the mean than the upper bound. The same applies to review qualify. But because we expect a normal distribution, we also expect that we should encounter low-quality papers or reviews only rarely, which is not the reality. This mistaken assumption of normal distributions appears in many areas of life, as I learned from Taleb’s The Black Swan recently. My suggestion: we should recalibrate our expectations. Celebrate the completion of a 𝘀𝗼𝗹𝗶𝗱 project, rather than its recognition by a few peers. The true milestones in science are: (1) discovering or inventing something, and (2) having those findings make an impact to the community or society by being adopted by others. Paper reviews and acceptance are usually just steps along the way, though good feedback can certainly help refine our work. At the same time, our community needs better mechanisms to reduce low-effort submissions and reviews, so we can bring back some joy to the conference cycle.
English
3
4
68
12.3K