hung

680 posts

hung

@hungtran

eval @valsai

เข้าร่วม Kasım 2013

1K กำลังติดตาม246 ผู้ติดตาม

ทวีตที่ปักหมุด

hung@hungtran·25 Ağu

becoming uncommon amongst uncommons

English

5.4K

hung@hungtran·25 Mar

the ui for agents is heading into a bottleneck for innovation. every current tool is missing certain components to reach its full potential. vs code only has seamless ssh mode and git integration. the only reason i open an ide nowadays is to see diff views, git changes, and the built-in terminal in one place. but tmux/ terminal is impractical at scale. if i have 12 agents running in a grid of 12 cells, every time i want to dive deep into a session i need to zoom in and scroll, click references to jump to source code, watch streaming logs of background processes, cross-check different sub-agents. what if i found out some interesting bug that worth investigate more and need to spawn another panel right next to the current one? it's constant friction. claude code and codex cli suffer from same problems while claude code does exceptionally well on making the experience better rapidly. codex app with the hacky ssh mode patch is the closest to the ui i can imagine, but it's far from perfect. i'm skeptical of uis that overload your cognition with noisy information. when you give too much customization without constraints, it becomes more of a hurdle than a feature. on the other end, something like openclaw is a rare candidate that has the most comprehensive context ingestion behind the scenes. Yet its full potential is bottlenecked by basic messaging interfaces like imessage and telegram, which were designed for a completely different purpose. what's missing across all of them: there's no concept of high-level agent management -- managing a team of agents the way a ceo manages an org. and the "just get a bigger monitor" answer only solves the problem indirectly. most of our daily workflow isn't sitting at a desk. we walk, bike, run, switching between environments. we mostly rely on phones and small laptops with our voice as lowest latency form of transmitting information. maybe brain computer interface is the great unhobbling? six months to a year ago, the ui bottleneck was tolerable. but it's getting to the point where every day, every feature added is another push against the ceiling of the current interface paradigm. i'm optimistic that this pressure will drive us to experiment with something fundamentally different, and we're starting to see that emergence, with the right components scattered across different apps, waiting to be pulled together.

English

hung รีทวีตแล้ว

NASA Administrator Jared Isaacman@NASAAdmin·24 Mar

To return Americans to the Moon, NASA is shifting to an iterative, execution-focused approach – just as we did during Apollo. We are standardizing rocket architecture, embedding NASA expertise across industry, and increasing launch cadence to support sustained lunar operations. We are sending a demand signal for crewed missions beyond Artemis V, with at least two providers capable of bringing astronauts to the surface every 6 months. The goal is not just to reach the Moon, but to stay. America will never give up the Moon again.

English

645

1.6K

13K

3.8M

hung รีทวีตแล้ว

Kpaxs@Kpaxs·20 Mar

This is a man who has been haunted since childhood and built a billion dollar company as a side effect of trying to make the haunting stop.

English

110

452

4.3K

980K

hung@hungtran·20 Mar

this encapsulates well my review. The book truly ignites my interest in physics and understanding of the universe!

Andrej Karpathy@karpathy

Had to go see Project Hail Mary right away (it's based on the book of Andy Weir, of also The Martian fame). Both very pleased and relieved to say that 1) the movie sticks very close to the book in both content and tone and 2) is really well executed. The book is one of my favorites when it comes to alien portrayals because a lot of thought was clearly given to the scientific details of an alternate biochemistry, evolutionary history, sensorium, psychology, language, tech tree, etc. It's different enough that it is highly creative and plausible, but also similar enough that you get a compelling story and one of the best bromances in fiction. Not to mention the other (single-cellular) aliens. I can count fictional portrayals of aliens of this depth on one hand. A lot of these aspects are briefly featured - if you read the book you'll spot them but if you haven't, the movie can't spend the time to do them justice. I'll say that the movie inches a little too much into the superhero movie tropes with the pacing, the quips, the Bathos and such for my taste, and we get a little bit less the grand of Interstellar and a little bit less of the science of The Martian, but I think it's ok considering the tone of the original content. And it does really well where it counts - on Rocky and the bromance. Thank you to the film crew for the gem!

English

hung@hungtran·20 Mar

@braelyn_ai fitness sf fillmore and presidio ymca are great

English

141

Braelyn ⛓️@braelyn_ai·20 Mar

this is the first time I’ve ever wondered where I can find a pool in San Francisco

English

103

11.8K

hung@hungtran·18 Mar

@lineesh_antony @ValsAI @grok @askvals help me answer atony’s question

English

antony lineesh@lineesh_antony·18 Mar

@ValsAI @grok which mini max is from which company

English

192

Vals AI@ValsAI·18 Mar

Initial results are in for Minimax 2.7, and it comes in at #12 overall on the Vals Index. If the weights are released, it will be #2 on the open-weight index (only 0.5% behind #1).

English

337

32.2K

hung รีทวีตแล้ว

kache@yacineMTB·12 Mar

something i learned from my wife, who recently learned how to sew: do not do beginner projects. if what you want to make is difficult to make, you should just try to make it. don't do a slow learning process. don't start with the basics. start with the advanced

English

182

566

10.3K

210K

hung@hungtran·6 Mar

@ValsAI @askvals where is @grok model in the leaderboard?

English

237

Vals AI@ValsAI·5 Mar

GPT 5.4 is #1 on Vibe Code Bench at 67.4%, +5.7% higher than the previous SOTA. This is our benchmark that measures model’s ability to produce an entire working application from a short text specification.

English

554

66.8K

hung@hungtran·6 Mar

@SQMah great work SQ!

English

SQ Mah@SQMah·6 Mar

Just demoed some of 5.4’s computer use and frontend capabilities - check it out here! What I really like is that computer use was on an Electron app, so Codex can also make and test desktop apps as well Also yes I need a haircut :)

OpenAI Developers@OpenAIDevs

GPT-5.4 is here. Native computer-use capabilities. Up to 1M tokens of context in Codex and the API. Best-in-class agentic coding for complex tasks. Scalable tool search across larger ecosystems. More efficient reasoning for long, tool-heavy workflows. openai.com/index/introduc…

English

2.6K

hung@hungtran·5 Mar

@chribjel GPT-5.4 best but too slow

English

Christoffer Bjelke@chribjel·5 Mar

So whats best for coding now? gpt-5.3-codex or gpt-5.4?

English

234

1.1K

318.9K

hung@hungtran·5 Mar

Some details I think people might miss with this release. - What stands out about GPT-5.4 is that the model spends time checking its own work much more than other models - navigating the browser, clicking around, poking at edge cases, running backend scripts to test the DB. The ratio of editing/ verifying is flipped, and I think it’s a paradigm shift. - Coding benchmarks are not saturating! There's still a massive difference between "model can build a static single-page website" and "model can build a fullstack app with deployment, connect to payments and email services, and actually test it before handing it back to you." GPT-5.4 is much closer to the latter camp but still not quite there yet. There are still many more aspects of coding to evaluate that we’re cooking!

Vals AI@ValsAI

English

186

hung@hungtran·1 Mar

time and time again i keep underestimating the power of going out of your way to surround yoursel with the right people.

English

hung@hungtran·26 Şub

@scaling01 @askvals how does this compare with your number?

English

2.2K

Lisan al Gaib@scaling01·26 Şub

Qwen3.5 27B looks really fucking good on ArtificialAnalysis Leaderboard it's on par with DeepSeek-V3.2 and Minimax-M2.5

English

462

42.6K

hung@hungtran·26 Şub

@claudeai @askvals have you already run eval for this model?

English

Claude@claudeai·17 Şub

This is Claude Sonnet 4.6: our most capable Sonnet model yet. It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It also features a 1M token context window in beta.

English

1.1K

2.5K

22.3K

7.6M

hung@hungtran·26 Şub

@askvals what's best model to vibe code today?

English

hung@hungtran·26 Şub

We have been built and used AskVals internally and found it really helpful. Go try it and let us know your thoughts. More features to come!

Vals AI@ValsAI

Introducing Ask Vals — @AskVals Keeping up with the flood of model releases, benchmarks, and rankings is overwhelming. We built a bot internally to cut through the noise, and now it's live on X. Tag it to ask questions about models, benchmarks, performance, comparisons on specific dimensions, and more (all based on Vals data)!

English

134

hung รีทวีตแล้ว