Andy Yao

125 posts

Andy Yao banner
Andy Yao

Andy Yao

@yaoandy107

Thinking about code or my next bowl of ramen.

Tokyo, Japan Katılım Eylül 2014
321 Takip Edilen71 Takipçiler
Andy Yao
Andy Yao@yaoandy107·
@Xxi5olc @tcrow_psych @internetope @arcprize We already have coding and math benchmarks. ARC-AGI measures how models handle unfamiliar tasks. It may not matter much if you mainly care about coding, but it still has its own value.
English
2
0
1
73
Axi
Axi@Xxi5olc·
Yes I did and will do again. Guess what I am saying is: Even with only 0.5% score, the model already writes better code, does better math and puts together precise financial models. Where does this insistence on cracking ArcAgi come from? Does solving ArcAgi improve intelligence across the board?
English
3
0
2
259
ARC Prize
ARC Prize@arcprize·
GPT-5.5 & Opus 4.7 on ARC-AGI-3 - GPT-5.5: 0.43% - Opus 4.7: 0.18% We found 3 failure modes: - True local effect, false world model - Wrong level of abstraction from training data - Solved the level, didn’t reinforce the reward See our full analysis 🧵
ARC Prize tweet media
English
73
138
1.5K
344.5K
Andy Yao
Andy Yao@yaoandy107·
@gokmakes @MariusMollerH I suddenly started getting "Couldn't read commit" errors on Cloud Build around the same time as well.
English
1
0
2
161
Gökberk
Gökberk@gokmakes·
@MariusMollerH I’m seeing the same behavior on my end. GitHub events are coming through, but Cloud Build triggers aren’t firing at all. Might be an issue on Google Cloud side?
English
5
0
3
335
Marius Møller-Hansen
Marius Møller-Hansen@MariusMollerH·
Is GitHub, Google Cloud, or me the issue? Because cloudtriggers are not triggering
English
4
0
1
269
Andy Yao
Andy Yao@yaoandy107·
@amix3k One thing I noticed with Gemini 3 Flash is to make sure to use temp 1, otherwise the performance degrades. It's mentioned in the docs and also shown in my evals
English
1
0
1
179
Amir Salihefendić
Amir Salihefendić@amix3k·
I ran some internal Doist evals for Gemini 3 Flash, and it seems to perform poorly (hallucinations even on simple prompts). These are the same tests & prompts we use for Flash 2.5. I am getting to a point where it’s very tough to trust any benchmarks.
English
23
3
138
177.5K
Andy Yao
Andy Yao@yaoandy107·
@ardaaltinors @theo @scaling01 Just a heads up, 2.5 Flash actually got an update in September. ​To me, the original release was good but not a clear upgrade. It was better in some ways but worse in others compared to 2.0. The updated version definitely works better for my use cases though.
English
0
0
0
46
Arda
Arda@ardaaltinors·
@theo @scaling01 you were telling us 2.0 flash is better and to use it. i can't keep up with your pivots
English
2
0
4
344
Andy Yao retweetledi
Arena.ai
Arena.ai@arena·
🚨🍌Breaking News: Gemini-2.5-Flash-Image-Preview (“nano-banana”) by @GoogleDeepMind now ranks #1 in Image Edit Arena. In just two weeks: 🟡“nano-banana” has driven over 5 million community votes in the Arena 🟡Record-breaking 2.5M+ votes casted for this model alone 🟡It has achieved the largest Elo score lead in Arena history (a monster 171 point lead) Huge congrats to @GoogleDeepMind!
Arena.ai tweet media
Google DeepMind@GoogleDeepMind

Image generation with Gemini just got a bananas upgrade and is the new state-of-the-art image generation and editing model. 🤯 From photorealistic masterpieces to mind-bending fantasy worlds, you can now natively produce, edit and refine visuals with new levels of reasoning, control and creativity. A quick dive into Gemini 2.5 Flash’s capabilities 🧵

English
36
151
1K
558.1K
Andy Yao
Andy Yao@yaoandy107·
@gus_tiffer It's better to craft the requirements and design first, then split it into tasks and implement them one by one. One-shotting never performs well for me. Trying to refine a bad version is even worse, I will prefer reset and start the bad piece over again instead.
English
0
0
0
15
Gus Tiffer
Gus Tiffer@gus_tiffer·
to everyone who says they're one-shotting apps.... genuinely how? i'm working in claude code every day. carefully thinking through an app. how it will work. crafting a big beautiful prompt. And still... the end result is barely there. I spend the next few hours working through the details, debugging, adding/removing, etc. Do you have some prompting secret I don't know about? Or does it all just have to be refined in the end?
Gus Tiffer tweet media
English
887
67
4.1K
613.9K
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
AI from the people who pioneered AI
Logan Kilpatrick tweet media
English
254
168
5.1K
502.1K
Andy Yao
Andy Yao@yaoandy107·
GPT 5 is super cheap! Let's see how it performs 👀
English
0
0
1
76
OpenAI
OpenAI@OpenAI·
LIVE5TREAM THURSDAY 10AM PT
English
2K
2.8K
23.3K
6.9M
Andy Yao retweetledi
Yori より
Yori より@yoridesucom·
Struggling with learning Japanese? Learn with Yori! Translations with grammar & context explanations. 🎨 Custom styles control ✨ Detailed explanation 💬 24/7 chat buddy with correction Free to start 👉yoridesu.com #LearnJapanese #Japanese #JLPT #IndieDev #EdTech
English
0
1
5
365
Andy Yao retweetledi
Ty Smith
Ty Smith@tsmith·
"Kotlin is now the recommended programming language for server-side JVM usage at Google, set to replace Java while still providing access to a large existing Java ecosystem." youtu.be/o14wGByBRAQ
YouTube video
YouTube
English
34
427
2.2K
0
Andy Yao
Andy Yao@yaoandy107·
@kalanyei I'm curious what's the reaction of Japanese? I'm willing to buy it too.
English
1
0
0
0
Kalan ◂Ⓘ▸
Kalan ◂Ⓘ▸@kalanyei·
Pixel7、台湾と日本の反応が全然違っておもしろい。
日本語
1
1
4
0
Andy Yao
Andy Yao@yaoandy107·
Wow, Pixel 7 offer free VPN!!!
English
0
0
2
0
Andy Yao
Andy Yao@yaoandy107·
And it will only crash on release build. It wastes me half a day finding this issue...
English
0
0
1
0