90 posts

AT

@waterloo_intern

making models go fast @baseten studying eng @uwaterloo https://t.co/lCL6q1MBPY

San Fran เข้าร่วม Ekim 2024

105 กำลังติดตาม1.5K ผู้ติดตาม

ทวีตที่ปักหมุด

AT@waterloo_intern·7 Mar

- 230 training runs - 1,623 GPU hours (67 B200 days) - 76 TB of training data - a 2x faster model Every paper said it can't be done. Quantization Aware Distillation made it possible.

AT@waterloo_intern

x.com/i/article/2029…

English

107

1.2K

146.7K

AT@waterloo_intern·7h

@amiruci ``` dominate them so thoroughly that the comparison looks embarassing ``` should be our new logo

English

Amir Haghighat@amiruci·9h

@waterloo_intern this is what you made me do

English

145

Amir Haghighat@amiruci·13h

We now have a product specifically created for AI labs and their closed-weight models: we'll take care of not just inference, but auth, rate limits, metering, and billing integrations. We'll take care of providing both shared and dedicated inference, compliance needs, and matching end customers' geo requirements (us, ca, eu, uk, aus, jp, etc). It's called Baseten Frontier Gateway and is already battle-tested by multiple AI labs, like Poolside and their impressive Laguna M.1 agentic coding model.

English

4.9K

AT@waterloo_intern·2d

@modal this is a sick read...hats off to you guys

English

221

Modal@modal·2d

x.com/i/article/2051…

ZXX

120

11.6K

AT@waterloo_intern·5d

x.com/i/article/2050…

ZXX

318

AT@waterloo_intern·6d

@philipkiely HAHAHAHAHA

Filipino

Philip Kiely@philipkiely·6d

Developing empathy for LLMs by doing benchmark problems by hand.

English

2.8K

AT@waterloo_intern·28 Nis

this was so fun to work on, i hope you find it useful tried @baseten for GPU access?

Jino Rohit@jino_rohit

im making a decision to switch to blackwell than hopper since the 5090s are more affordable. i was learning WGMMA and renting h100 was getting too expensive :( what are some affordable options to rent among @vast_ai @modal etc

English

661

AT@waterloo_intern·24 Nis

@edenchan solid to the power of solid squared

English

313

Eden Chan@edenchan·24 Nis

Think fast! This is the voice that powers customer support and sales for Starlink

xAI@xai

Introducing Grok Voice Think Fast 1.0 A state-of-the-art voice model built for complex, multi-step workflows with snappy responses and high accuracy. It takes the top spot on the Tau Voice Bench and handles real-world messiness like noise, accents, and interruptions better than any other model in the world. x.ai/news/grok-voic…

English

644

45.8K

AT@waterloo_intern·10 Nis

@Millanphilipose @kanjun your own WHAT... flex 🙈

English

Millan Philipose@Millanphilipose·10 Nis

@AliesTaha @kanjun We're using our own datacenter for now

English

Kanjun 🐙@kanjun·9 Nis

Twitter’s algorithm is optimized for addiction, not for us. We deserve better. We’re releasing Bouncer today so you can take back control of your feed. Describe what you don't want, and Bouncer removes it. It’s free, doesn’t collect your data, and will be open source soon.

English

213

295

3.2K

585.8K

AT@waterloo_intern·4 Nis

@anandcpatelmdms @part_harry_ x.com/Goosewin/statu…

goosewin@Goosewin

guys you're never gonna believe this

QME

103

Anand C. Patel, MD MS@anandcpatelmdms·4 Nis

@AliesTaha @part_harry_ Smelly?

English

100

AT@waterloo_intern·3 Nis

we dug into 1-bit bonsai with @part_harry_ the grand canyon of a gap they showed... is just THREE (3) points away from normal PTQ but they already knew that here's the graph (fixed)

PrismML@PrismML

This scatter plot shows the Pareto frontier of intelligence vs. size, defined by models like Qwen3 0.6B, 1.7B, 4B, 8B, and Ministral3 3B. The 1-bit Bonsai family shifts that frontier dramatically to the left. This changes the tradeoff itself: models no longer have to be large to be capable.

English

100

17.1K

AT@waterloo_intern·4 Nis

@nisten @part_harry_ we used their axis to plot on their chart their benchmarks to get the intelligence scores the x-axis is the weight file size this is what PrismML used

English

333

nisten🇨🇦e/acc@nisten·4 Nis

@AliesTaha @part_harry_ The graph compares model sizes not total memory use dumbass. You're comparing total kv of 1bit float16 vs finetuned fp4PTQ /fp8 kv at 4k context benchmark or like... what are you even comparing? x.com/nisten/status/…

nisten🇨🇦e/acc@nisten

Got 1bit @PrismML Bonsai-8B llm working 4bit-kv turboquant. uses justs 2596 Megabytes of ram to run at 64k context. github.com/nisten/prism-m…

English

2.1K

AT@waterloo_intern·4 Nis

@HenkPoley @part_harry_ fair, the point is more that the graph was designed to make 3 points look like a generational leap

English

266

Henk Poley@HenkPoley·4 Nis

@AliesTaha @part_harry_ 3 percentage point better is still quite a bit better. 🤷‍♂️ 73.8 to 76.8, about 11% less errors on these tests. Given that most of these tests have errors, so a perfect score cannot be achieved, probably even a bit better.

English

426

AT@waterloo_intern·4 Nis

@JoshPurtell @part_harry_ more eyes on benchmarks is only a good thing

English

300

Josh@JoshPurtell·3 Nis

@AliesTaha @part_harry_ Taking this as permission to publicly sanity test forthcoming Baseten results/research

English

928

AT@waterloo_intern·1 Nis

@oneill_c whoaaaaa

English

468

Charlie O'Neill@oneill_c·1 Nis

x.com/i/article/2039…

ZXX

213

77K

AT@waterloo_intern·30 Mar

on-policy for the student off-policy for the teacher monkey input, monkey output

Harry Partridge@part_harry_

x.com/i/article/2038…

English

3.7K

AT@waterloo_intern·27 Mar

@philipkiely what is inference? how does it work? @philipkiely can i come to learn (and also maybe get ice-cream)?

English

369

Philip Kiely@philipkiely·27 Mar

Ice cream and books were a hit yesterday. ICYMI we're doing another, this time at the Ferry Building. Thursday 4/2 from 2-4 PM: luma.com/khxc93ju

English

3.2K

AT@waterloo_intern·27 Mar

@gaoj0017 only 3. Their experiments used single-core CPU for RaBitQ vs A100 GPU for TurboQuant has merit as a complaint the other 2 just don't hold

English

8.5K

Jianyang Gao@gaoj0017·27 Mar

We need to publicly clarify serious issues in Google’s ICLR 2026 paper TurboQuant. TurboQuant misrepresents RaBitQ in three ways: 1. Avoids acknowledging key methodological similarity (JL transform) 2. Calls our theory “suboptimal” with no evidence 3. Reports results under unfair experimental settings We have expressed our concerns to the authors before their submission, but they chose not to fix them in their paper submission. The paper was accepted at ICLR 2026 and heavily promoted by Google (tens of millions of views). At that scale, uncorrected claims quickly become “consensus.” Facts: 1. RaBitQ already proves asymptotic optimality (FOCS’17 bound) 2. TurboQuant uses the same random rotation step but misses stating the connection 3. Their experiments used single-core CPU for RaBitQ vs A100 GPU for TurboQuant None of these is properly disclosed. We’ve filed a formal complaint and posted on OpenReview (openreview.net/forum?id=tO3AS…). We’ll release a detailed technical report on arXiv. Our goal is simple: keep the academic record accurate. Would appreciate people taking a look and sharing.

English

1.3K

99.4K

Jianyang Gao@gaoj0017·27 Mar

The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views. We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (openreview.net/forum?id=tO3AS…). We would greatly appreciate your attention and help in sharing it.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

975

6.5K

AT@waterloo_intern·27 Mar

@Phenomenon_One well both really accesible via simplifying it confirm their claims with the gpu kernels, and my verdict was not useful at current perf for latest gpus

English

Omen@Phenomenon_One·27 Mar

@AliesTaha Update after perusing I see where the time went deep diving algos and testing claims on arbitrary datasets .. are you trying to make it accessible or are you trying to confirm their claims?

English

AT@waterloo_intern·26 Mar

x.com/i/article/2037…

ZXX

628

72.8K

AT@waterloo_intern·27 Mar

@Phenomenon_One i think partly due to my background being rooted in swe and not maths, but this was total time, including time it took to write article and make graphs

English

196

Omen@Phenomenon_One·27 Mar

@AliesTaha Lmao 31 hours 😂 How long did the original researchers spend on it … Will critique after I read this.

English

358

ค้นพบ

@amiruci @modal @philipkiely @baseten @edenchan @Millanphilipose @kanjun @AliesTaha