Robert Washbourne (@rawsh0) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

new model! strong <1B active MoE led data and posttraining for this release. cca goat @rishiiyer01 and the pretraining squad cooked x.com/ZyphraAI/statu…

Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English

8

12

73

5.5K

Robert Washbourne@rawsh0·17 May

@MatternJustus @jyangballin @rishiiyer01 @evan_j_chu thanks for having us!

English

0

5

155

Robert Washbourne retweetledi

Justus Mattern@MatternJustus·16 May

This went surprisingly well for our first event - heard great talks and had very interesting conversations about post-training and evals! A special thanks to our speakers @jyangballin, @rawsh0, @rishiiyer01 and @evan_j_chu, and looking forward to the next one :)

Justus Mattern@MatternJustus

Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks: @jyangballin (MSL) will present ProgramBench @rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B @evan_j_chu and I will speak FrontierSWE and our research bets!

English

3

5

84

8.2K

Robert Washbourne@rawsh0·15 May

@rishiiyer01 @JZWANG_T1 also cooking 🔥 we're hyped to RL this one

English

0

1

21

Robert Washbourne@rawsh0·15 May

@rishiiyer01 cooked here. very excited about scaling ttc with diffusion - sparse active params + diffusion decode means reasoning models can punch above their weight class with competitive latency x.com/rishiiyer01/st…

rishi@rishiiyer01

Leading the training for this model was a privilege. Training diffusion style models will be the future regardless of whether it is discrete/speculative or continuous.

English

1

16

849

Robert Washbourne@rawsh0·15 May

@LLMenjoyer 💎 alert

Norsk

0

3

94

llm_enjoyer@LLMenjoyer·15 May

i remember when this model was js homie’s schizo project, he literally took it from 0 to 100. proud of u homie 😭😭

rishi@rishiiyer01

Leading the training for this model was a privilege. Training diffusion style models will be the future regardless of whether it is discrete/speculative or continuous.

English

3

1

27

1.7K

Robert Washbourne@rawsh0·15 May

Linked the wrong thread ngl x.com/ZyphraAI/statu…

Zyphra@ZyphraAI

We present ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on @AMD. Autoregressive LLMs generate one token at a time; diffusion generates a block in parallel, speeding up inference. We show a 4.6-7.7x decoding speedup with minimal quality degradation 🧵

English

0

1

200

Robert Washbourne@rawsh0·15 May

very excited about scaling ttc with diffusion. sparse active params + diffusion decode means reasoning models can punch above their weight class with competitive latency x.com/ZyphraAI/statu…

Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English

3

4

45

2.9K

Robert Washbourne retweetledi

rishi@rishiiyer01·15 May

Leading the training for this model was a privilege. Training diffusion style models will be the future regardless of whether it is discrete/speculative or continuous.

Zyphra@ZyphraAI

We present ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on @AMD. Autoregressive LLMs generate one token at a time; diffusion generates a block in parallel, speeding up inference. We show a 4.6-7.7x decoding speedup with minimal quality degradation 🧵

English

7

8

78

7.7K

Robert Washbourne retweetledi

Zyphra@ZyphraAI·15 May

We present ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on @AMD. Autoregressive LLMs generate one token at a time; diffusion generates a block in parallel, speeding up inference. We show a 4.6-7.7x decoding speedup with minimal quality degradation 🧵

English

22

87

693

1.1M

Robert Washbourne@rawsh0·14 May

> this chat was flagged for possible cybersecurity risk 5.5 in codex would be goated if this didn't pop up every 5 seconds

English

0

3

270

Robert Washbourne@rawsh0·14 May

@NousResearch congrats! super cool

English

0

66

Nous Research@NousResearch·13 May

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

English

150

419

3.7K

443.4K

Robert Washbourne@rawsh0·14 May

@LLMenjoyer whatever could you be referring to I have no idea

English

1

0

54

llm_enjoyer@LLMenjoyer·14 May

@rawsh0 when bishi drops diffusion ill prove it isn't

English

1

0

50

Robert Washbourne@rawsh0·13 May

> new hf discussion on your model page titled “reproducibility” 😬 > they get significantly higher results than you reported 👍

English

3

1

29

2K

Robert Washbourne@rawsh0·14 May

@LLMenjoyer this is _______ motivated

English

1

0

1

97

llm_enjoyer@LLMenjoyer·13 May

@rawsh0 when u benchmaxx so hard even u have trouble measuring the amt of benchmaxxing

English

1

0

5

197

Robert Washbourne@rawsh0·13 May

huggingface.co/Zyphra/ZAYA1-8…

ZXX

0

2

200

Robert Washbourne retweetledi

Milad Aghajohari@MAghajohari·12 May

Excited to see that Markovian Thinker contributed to Zyphra's strong release 🚀. Their Markovian RSA: markovian thinking (carrying forward bounded-length reasoning tails) + RSA (recursive self-aggregation) boosted test-time compute to be on-par with larger reasoning models. 1/

Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English

1

5

50

7.4K

Robert Washbourne retweetledi

Justus Mattern@MatternJustus·12 May

Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks: @jyangballin (MSL) will present ProgramBench @rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B @evan_j_chu and I will speak FrontierSWE and our research bets!

English

6

10

147

34.2K

Robert Washbourne@rawsh0·11 May

@stochasticchasm @cloneofsimo blue cheese goes hard

English

1

0

1

53

stochasm@stochasticchasm·10 May

@cloneofsimo they always have some crazy flavors

English

2

0

2

544

Simo Ryu@cloneofsimo·10 May

Lovely icecream place near palo alto, stanford is really nice place!! Honey Lavender flavor is good

English

6

0

82

8.2K

Robert Washbourne@rawsh0·10 May

@Prince_Canuma @teortaxesTex 🔥

QME

0

151

Prince Canuma@Prince_Canuma·9 May

@teortaxesTex Haha, nice and cocky! But we the ones we humans made are significantly faster 😜 Btw, here’s the zaya port, it’s a great model: github.com/Blaizzy/mlx-vl…

English

2

0

11

7.2K

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·9 May

DeepSeek is quite cheeky and believes its mlx kernels are better made than human-made ones (it's 10 t/s for fp16, more like 18 for q8). Unfortunately a very messy project at this point, so I can't tell how legit this is. oh well. V4-flash, redo everything from scratch!

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@rawsh0 It did not. Ie the kernels work, but we failed to exceed naive mps baseline, and peaked at 17-20 tps for bs1 and like 70 tps for bs16 (fp16). I don't know enough about kernel engineering to resolve this (it was mostly an experiment in how far it can do autonomously). last step:

English

4

0

60

8.3K

Robert Washbourne@rawsh0·10 May

@stevibe Very cool, I’m surprised by the instruction following result, qwen3.5 is super strong

English

0

1

143

stevibe@stevibe·9 May

New 8B MoE from Zyphra: ZAYA1-8B, 760M active params, claiming it goes toe-to-toe with Qwen3.5 on reasoning and math. Ran it through BenchLocal against Qwen3.5-9B. 4 suites, 15 scenarios each. Results were... not what I expected: 🟢 InstructFollow: ZAYA 92 vs Qwen 44 🟢 ReasonMath: ZAYA 58 vs Qwen 49 🔴 ToolCall: ZAYA 63 vs Qwen 87 🔴 BugFind: ZAYA 55 vs Qwen 70 Split decision. ZAYA crushes instruction-following (huge gap) and edges out on math reasoning — matches the "reasoning model" framing on its card. But it gets cooked on tool calling and bug finding, which tracks with their own benchmarks showing weaker BFCL / τ² scores vs Qwen3.5. TL;DR: real reasoning chops, but not a Qwen3.5 replacement if your stack leans on tools or code debugging. Tested with @BenchLocalApp (open source) — run your own, don't trust vibes.

English

10

9

139

11.7K

Robert Washbourne

Keşfet