Andy Yao

125 posts

Andy Yao

@yaoandy107

Thinking about code or my next bowl of ramen.

Tokyo, Japan Katılım Eylül 2014

321 Takip Edilen71 Takipçiler

Andy Yao@yaoandy107·2 May

@Xxi5olc @tcrow_psych @internetope @arcprize We already have coding and math benchmarks. ARC-AGI measures how models handle unfamiliar tasks. It may not matter much if you mainly care about coding, but it still has its own value.

English

Axi@Xxi5olc·1 May

Yes I did and will do again. Guess what I am saying is: Even with only 0.5% score, the model already writes better code, does better math and puts together precise financial models. Where does this insistence on cracking ArcAgi come from? Does solving ArcAgi improve intelligence across the board?

English

259

ARC Prize@arcprize·1 May

GPT-5.5 & Opus 4.7 on ARC-AGI-3 - GPT-5.5: 0.43% - Opus 4.7: 0.18% We found 3 failure modes: - True local effect, false world model - Wrong level of abstraction from training data - Solved the level, didn’t reinforce the reward See our full analysis 🧵

English

138

1.5K

344.5K

Andy Yao@yaoandy107·30 Nis

@gokmakes @MariusMollerH I suddenly started getting "Couldn't read commit" errors on Cloud Build around the same time as well.

English

161

Gökberk@gokmakes·30 Nis

@MariusMollerH I’m seeing the same behavior on my end. GitHub events are coming through, but Cloud Build triggers aren’t firing at all. Might be an issue on Google Cloud side?

English

335

Marius Møller-Hansen@MariusMollerH·30 Nis

Is GitHub, Google Cloud, or me the issue? Because cloudtriggers are not triggering

English

269

Andy Yao@yaoandy107·18 Ara

@amix3k One thing I noticed with Gemini 3 Flash is to make sure to use temp 1, otherwise the performance degrades. It's mentioned in the docs and also shown in my evals

English

179

Amir Salihefendić@amix3k·18 Ara

I ran some internal Doist evals for Gemini 3 Flash, and it seems to perform poorly (hallucinations even on simple prompts). These are the same tests & prompts we use for Flash 2.5. I am getting to a point where it’s very tough to trust any benchmarks.

English

138

177.5K

Andy Yao@yaoandy107·17 Ara

@ardaaltinors @theo @scaling01 Just a heads up, 2.5 Flash actually got an update in September. To me, the original release was good but not a clear upgrade. It was better in some ways but worse in others compared to 2.0. The updated version definitely works better for my use cases though.

English

Arda@ardaaltinors·17 Ara

@theo @scaling01 you were telling us 2.0 flash is better and to use it. i can't keep up with your pivots

English

344

Lisan al Gaib@scaling01·17 Ara

You don't understand Gemini 3 Flash is a bigger deal than Pro

Riley Brown@rileybrown

Their level of confidence is actually concerning at this point. I'm scared for OpenAI and Anthropic.

English

992

218.1K

Andy Yao retweetledi

Arena.ai@arena·26 Ağu

🚨🍌Breaking News: Gemini-2.5-Flash-Image-Preview (“nano-banana”) by @GoogleDeepMind now ranks #1 in Image Edit Arena. In just two weeks: 🟡“nano-banana” has driven over 5 million community votes in the Arena 🟡Record-breaking 2.5M+ votes casted for this model alone 🟡It has achieved the largest Elo score lead in Arena history (a monster 171 point lead) Huge congrats to @GoogleDeepMind!

Google DeepMind@GoogleDeepMind

Image generation with Gemini just got a bananas upgrade and is the new state-of-the-art image generation and editing model. 🤯 From photorealistic masterpieces to mind-bending fantasy worlds, you can now natively produce, edit and refine visuals with new levels of reasoning, control and creativity. A quick dive into Gemini 2.5 Flash’s capabilities 🧵

English

151

558.1K

Andy Yao@yaoandy107·20 Ağu

@gus_tiffer It's better to craft the requirements and design first, then split it into tasks and implement them one by one. One-shotting never performs well for me. Trying to refine a bad version is even worse, I will prefer reset and start the bad piece over again instead.

English

Gus Tiffer@gus_tiffer·20 Ağu

to everyone who says they're one-shotting apps.... genuinely how? i'm working in claude code every day. carefully thinking through an app. how it will work. crafting a big beautiful prompt. And still... the end result is barely there. I spend the next few hours working through the details, debugging, adding/removing, etc. Do you have some prompting secret I don't know about? Or does it all just have to be refined in the end?

English

887

4.1K

613.9K

Andy Yao@yaoandy107·9 Ağu

@OfficialLoganK @K78467764 @GoogleDeepMind @joshwoodward I'd love to see automatic memory, and a faster, more concise research mode like Grok or Perplexity, ideally with selectable depth (quick vs. deep).

English

Logan Kilpatrick@OfficialLoganK·9 Ağu

@K78467764 @GoogleDeepMind pls send feedback on what needs to be fixed and @joshwoodward + team will fix

English

9.6K

Logan Kilpatrick@OfficialLoganK·9 Ağu

AI from the people who pioneered AI

English

254

168

5.1K

502.1K

Andy Yao@yaoandy107·7 Ağu

GPT 5 is super cheap! Let's see how it performs 👀

English

Andy Yao@yaoandy107·6 Ağu

@OpenAI Let's see 👀

English

OpenAI@OpenAI·6 Ağu

LIVE5TREAM THURSDAY 10AM PT

English

2.8K

23.3K

6.9M

Andy Yao@yaoandy107·5 Ağu

This is soo cool! Super amazed 🤯

Google DeepMind@GoogleDeepMind

What if you could not only watch a generated video, but explore it too? 🌐 Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt. From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵

English

Andy Yao retweetledi

Yori より@yoridesucom·2 Ağu

Struggling with learning Japanese? Learn with Yori! Translations with grammar & context explanations. 🎨 Custom styles control ✨ Detailed explanation 💬 24/7 chat buddy with correction Free to start 👉yoridesu.com #LearnJapanese #Japanese #JLPT #IndieDev #EdTech

English

365

Andy Yao retweetledi

Ty Smith@tsmith·15 Eki

"Kotlin is now the recommended programming language for server-side JVM usage at Google, set to replace Java while still providing access to a large existing Java ecosystem." youtu.be/o14wGByBRAQ