Robert Washbourne

456 posts

Robert Washbourne

Robert Washbourne

@rawsh0

posttraining lead @zyphraAI

Palo Alto, CA Katılım Ekim 2021
3.7K Takip Edilen670 Takipçiler
Sabitlenmiş Tweet
Robert Washbourne
Robert Washbourne@rawsh0·
new model! strong <1B active MoE led data and posttraining for this release. cca goat @rishiiyer01 and the pretraining squad cooked x.com/ZyphraAI/statu…
Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English
8
12
73
5.5K
Robert Washbourne retweetledi
Justus Mattern
Justus Mattern@MatternJustus·
This went surprisingly well for our first event - heard great talks and had very interesting conversations about post-training and evals! A special thanks to our speakers @jyangballin, @rawsh0, @rishiiyer01 and @evan_j_chu, and looking forward to the next one :)
Justus Mattern tweet media
Justus Mattern@MatternJustus

Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks: @jyangballin (MSL) will present ProgramBench @rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B @evan_j_chu and I will speak FrontierSWE and our research bets!

English
3
5
84
8.2K
Robert Washbourne
Robert Washbourne@rawsh0·
very excited about scaling ttc with diffusion. sparse active params + diffusion decode means reasoning models can punch above their weight class with competitive latency x.com/ZyphraAI/statu…
Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English
3
4
45
2.9K
Robert Washbourne retweetledi
rishi
rishi@rishiiyer01·
Leading the training for this model was a privilege. Training diffusion style models will be the future regardless of whether it is discrete/speculative or continuous.
Zyphra@ZyphraAI

We present ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on @AMD. Autoregressive LLMs generate one token at a time; diffusion generates a block in parallel, speeding up inference. We show a 4.6-7.7x decoding speedup with minimal quality degradation 🧵

English
7
8
78
7.7K
Robert Washbourne retweetledi
Zyphra
Zyphra@ZyphraAI·
We present ZAYA1-8B-Diffusion-Preview, the first diffusion language model trained on @AMD. Autoregressive LLMs generate one token at a time; diffusion generates a block in parallel, speeding up inference. We show a 4.6-7.7x decoding speedup with minimal quality degradation 🧵
Zyphra tweet media
English
22
87
693
1.1M
Robert Washbourne
Robert Washbourne@rawsh0·
> this chat was flagged for possible cybersecurity risk 5.5 in codex would be goated if this didn't pop up every 5 seconds
English
0
0
3
270
Nous Research
Nous Research@NousResearch·
Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.
Nous Research tweet media
English
150
419
3.7K
443.4K
llm_enjoyer
llm_enjoyer@LLMenjoyer·
@rawsh0 when bishi drops diffusion ill prove it isn't
English
1
0
0
50
Robert Washbourne
Robert Washbourne@rawsh0·
> new hf discussion on your model page titled “reproducibility” 😬 > they get significantly higher results than you reported 👍
Robert Washbourne tweet media
English
3
1
29
2K
llm_enjoyer
llm_enjoyer@LLMenjoyer·
@rawsh0 when u benchmaxx so hard even u have trouble measuring the amt of benchmaxxing
English
1
0
5
197
Robert Washbourne retweetledi
Milad Aghajohari
Milad Aghajohari@MAghajohari·
Excited to see that Markovian Thinker contributed to Zyphra's strong release 🚀. Their Markovian RSA: markovian thinking (carrying forward bounded-length reasoning tails) + RSA (recursive self-aggregation) boosted test-time compute to be on-par with larger reasoning models. 1/
Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English
1
5
50
7.4K
Robert Washbourne retweetledi
Justus Mattern
Justus Mattern@MatternJustus·
Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks: @jyangballin (MSL) will present ProgramBench @rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B @evan_j_chu and I will speak FrontierSWE and our research bets!
Justus Mattern tweet mediaJustus Mattern tweet media
English
6
10
147
34.2K
Simo Ryu
Simo Ryu@cloneofsimo·
Lovely icecream place near palo alto, stanford is really nice place!! Honey Lavender flavor is good
Simo Ryu tweet media
English
6
0
82
8.2K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
DeepSeek is quite cheeky and believes its mlx kernels are better made than human-made ones (it's 10 t/s for fp16, more like 18 for q8). Unfortunately a very messy project at this point, so I can't tell how legit this is. oh well. V4-flash, redo everything from scratch!
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@rawsh0 It did not. Ie the kernels work, but we failed to exceed naive mps baseline, and peaked at 17-20 tps for bs1 and like 70 tps for bs16 (fp16). I don't know enough about kernel engineering to resolve this (it was mostly an experiment in how far it can do autonomously). last step:

English
4
0
60
8.3K
Robert Washbourne
Robert Washbourne@rawsh0·
@stevibe Very cool, I’m surprised by the instruction following result, qwen3.5 is super strong
English
0
0
1
143
stevibe
stevibe@stevibe·
New 8B MoE from Zyphra: ZAYA1-8B, 760M active params, claiming it goes toe-to-toe with Qwen3.5 on reasoning and math. Ran it through BenchLocal against Qwen3.5-9B. 4 suites, 15 scenarios each. Results were... not what I expected: 🟢 InstructFollow: ZAYA 92 vs Qwen 44 🟢 ReasonMath: ZAYA 58 vs Qwen 49 🔴 ToolCall: ZAYA 63 vs Qwen 87 🔴 BugFind: ZAYA 55 vs Qwen 70 Split decision. ZAYA crushes instruction-following (huge gap) and edges out on math reasoning — matches the "reasoning model" framing on its card. But it gets cooked on tool calling and bug finding, which tracks with their own benchmarks showing weaker BFCL / τ² scores vs Qwen3.5. TL;DR: real reasoning chops, but not a Qwen3.5 replacement if your stack leans on tools or code debugging. Tested with @BenchLocalApp (open source) — run your own, don't trust vibes.
stevibe tweet mediastevibe tweet mediastevibe tweet mediastevibe tweet media
English
10
9
139
11.7K