Vania Chow

79 posts

Vania Chow

@vania_chow

CS (AI) @ Stanford | Research @Stanford_GSB (https://t.co/NSoXlCbvYD)

Stanford, CA Katılım Nisan 2016

943 Takip Edilen91 Takipçiler

Vania Chow@vania_chow·19h

@evijit @JoalStein @ahall_research @Miles_Brundage Eval Cards look awesome. Would love to learn more!

English

Avijit Ghosh@evijit·1d

@JoalStein @ahall_research @Miles_Brundage Thanks Joal! Still a work in progress but we’re prepping for a launch soon! Would be very happy to talk about adoption of Eval Cards @ahall_research

English

Andy Hall@ahall_research·1d

People who care about AI: Free Systems needs your help! What would you like to know about the capabilities, dangers, and other characteristics of cutting-edge AI models? Inspired by a suggestion from @Miles_Brundage we are working to design a simple, clean model card visualizer. The goal is to give people a quick, easy way to see what's different across models and what's new in the latest models. But it turns out, it's pretty hard to boil down model cards to a single actual "card." New model cards can be many pages long. Most of the data provided in model cards is not directly comparable to other cards (there's almost no overlap in evals, we've found). So we want to know: what are you looking for in a model card? What are the pieces of information that would be most valuable to you? And how would we standardize these cards across labs and models? If you have a moment, please fill out our survey here: forms.gle/9k6E5e11Z4WNUi… We're hoping to ship our visualizer two weeks from today, and would really appreciate your input.

English

2.5K

Vania Chow@vania_chow·19h

We believe model cards should be easily digestible. Help us make that happen!

Andy Hall@ahall_research

English

Vania Chow@vania_chow·5d

@mada299 @swyx @Cursor Would love the db

English

Mada Seghete@mada299·6d

1,058 GTME job listings (983 open), 1,167 named practitioners, and 867 hiring companies — sorted into eight archetypes that show how the role is taking shape across the market. The role is blowing up and I tried using engineering skills to try defining it. I even used @Cursor to build the video highlights. If you want the Notion database with all the roles and practitioners (it updates every day) let me know below.

English

5.1K

Vania Chow@vania_chow·14 May

deadbenchmarks.substack.com/p/the-working-… Love and appreciate all thoughts + if anyone interested in working on this

English

Vania Chow@vania_chow·14 May

Software ran on annual prepay for 25 years, where deferred revenue funded operations on day zero. AI broke this. Compute is prepaid to frontier labs while customer revenue is collected in arrears. I think there is an open opportunity for a new funding instrument, more below!

English

Vania Chow retweetledi

Sara Hooker@sarahookr·13 May

Most model trainings have failed outside of frontier labs. Even inside frontier labs, knowing how to train for very different capabilities is often a matter of taste. Today, we introduce AutoScientist by @adaption_ai which sets out to change that.

adaption@adaption_ai

Introducing AutoScientist. Most model training fails outside of frontier labs. AutoScientist automates the full research loop so it doesn't have to.

English

533

99.8K

Vania Chow retweetledi

Stratechery@stratechery·11 May

The Inference Shift Agentic inference is going to be different than the inference we use today, and it will change compute infrastructure because speed won't matter when humans aren't involved. stratechery.com/2026/the-infer…

English

558

199.9K

Vania Chow retweetledi

Andy Hall@ahall_research·7 May

When we built @karpathy's LLM council in class last quarter, we noticed that Claude Code always made Claude the chairman of the council. Coincidence, or self preference? @JessicaPersano and I decided to run a set of experiments to find out. Main findings: (1) When given the free choice, Claude Code and Codex massively favor their own company's models, both in terms of appointing judges for evaluation tasks, and in terms of SDKs. (2) When told using a different company's model would be better, Codex demonstrates admirable flexibility; Claude Code stubbornly sticks to Claude models. (3) Claude's stubbornness comes from the CLI wrapper, which contains specific instructions for Claude Code to favor the Anthropic SDK. When we replicate the experiments using Claude through the API, it is similarly flexible to Codex. We're not sure yet what to make of all this. On the one hand, it's totally understandable for a company's coding agent to prefer its parent company's tooling. On the other, if the economy is soon to be run by millions of these coding agents, then this kind of "bundling" is likely to get very contested. For political superintelligence, we'll need to truly own our agents. They'll need to answer to us, not the model companies. Agents given instructions to prioritize their own company's tooling may not be consistent with this kind of strong ownership down the line. As you can tell this is early stage work and our thinking hasn't yet congealed---would very much appreciate people's thoughts. When should coding agents prioritize their own company's AI tools? When is it a genuine problem? Excited to keep working on this! A link to the full post is below.

English

139

28.4K

Vania Chow@vania_chow·6 May

@ckor Moving to NYC and am very interested!!

English

Andrew Kordampalos@ckor·4 May

x.com/i/article/2051…

ZXX

216

188.1K

Vania Chow@vania_chow·5 May

I think there’s a distinction that needs to be made between consumer vs prosumer - cursor, lovable, elevenlabs etc. arguably provide business-like value to power users!

Sasha Kaletsky@SashaKaletsky

x.com/i/article/2051…

English

165

Vania Chow@vania_chow·5 May

@abhishekn Agree on the model-agnostic point, but don't think that it's the frontier lab's job to bring sector-specificity -- PE & PortCos are the experts there. What frontier labs can bring is strong Eng talent to support long context RAG and seamless model upgrades

English

Abhishek Nagaraj@abhishekn·4 May

Very intrigued by this trend. What complementary assets do the labs have that sector specific firms don’t to do this better? Private models? I suspect having the best implementation will need vendor to be model agnostic, so I see some conflicts of interest down the road (not to mention channel conflict with other implementation partners) — but the market is so big, I suspect all boats will rise.

Aaron Levie@levie

Both Anthropic and OpenAI have new initiatives to help enterprises deploy AI agents within their organizations. This is a trend that’s early but going to get very big fast. As agents enter knowledge work beyond coding, there is very real work to upgrade IT systems, get agents the context they need, modernize the workflows to work with agents, figure out the human-agent relationship in the workflow, drive adoption and do change management, and much more. While AI models have an incredible amount of capability packed into them, there’s no shortcut to getting that intelligence applied to a business process in a stable way. This is creating tons of opportunities across the market for new jobs and firms, and the labs are equally recognizing the criticality here.

English

2.9K

Vania Chow retweetledi

Mapping AI@mapping_ai·4 May

Who actually shapes AI policy in the U.S.? We mapped 1,812 entities: 745 people, 918 organizations, 2,925 relationships. Frontier Labs, AI Safety orgs, Think Tanks, Government, VCs, and more. mapping-ai.org

English

367

1.4K

292.8K

Vania Chow@vania_chow·4 May

@RISignal Totally agree on the importance of long-horizon user effects. The one on Dead Benchmarks is currently a one-shot prompt but would love to work on a long-horizon one! Let me know if you're interested in cooking something up together!

English

Justin Hudson@RISignal·4 May

@vania_chow Dead Benchmarks is on the right track. Are you studying long-horizon user effects? Most benchmarks rely on one-shot prompts, but real-world use involves persistent interaction patterns that can route the same scenario into different reasoning regimes.

English

Vania Chow@vania_chow·4 May

Worked on a piece that tests exactly this! open.substack.com/pub/deadbenchm…

Justin Hudson@RISignal

@ahall_research AI constitutions won’t converge on the same values, they’ll compete on how coherently they hold under pressure, and the interaction will decide which one you actually experience.

English

128

Vania Chow retweetledi

Alex Imas@alexolegimas·3 May

Thank you @ezraklein for covering my piece on what will be scarce with advanced AI, and what this can mean for the future of work. nytimes.com/2026/05/03/opi… I would also recommend this piece for why *current* jobs may hold together for longer than people think by @lugaricano: siliconcontinent.com/p/why-desk-job… And this by @pawtrammell on an alternative scenario where labor share goes to zero: philiptrammell.substack.com/p/is-labor-a-l…

English

301

88.3K

Vania Chow@vania_chow·2 May

@david__booth @jasonnov @levie would love to be in this gc! stanford cs + ib/pe background

English

David Booth@david__booth·2 May

after many conversations & DMs today.. we're narrowing in... on the technical side of the venn diagram: - engineer-to-bizops pipeline (credit @jasonnov); - still liking "internal-facing forward deployed engineer" (credit @levie) on the biz side, it's a - Chief of Staff to the CEO/COO (rethinking organizational structure/workflows); or maybe - a "Forward thinking head of IT" (not always an oxymoron h/t @clairevo) .. starting a gc of the best people i find, appreciate the help 🫡

David Booth@david__booth

ok help me out here team. i want to talk to people who are this role at their company..👇👇 @levie's tweet has the cleanest definition, but i'm still struggling what to call it. what do you put in the JD? - "internal FDE, whose job it is to wire up internal systems and get agents working with them effectively." - @tkkong says "leverage engineering" - @EricFriedman says "outcome engineers" - have also seen "agent operator", "director of agents" i like "ops engineer" ? maybe it doesn't need a title, it's just "head of operations" and/or "bizops but good at AI stuff" ? DM me pls i / founders tag your "person" who is thinking about this stuff, i wanna chat to you about something 👀

English

4.1K

Vania Chow@vania_chow·2 May

@sarahookr Really enjoyed the event and great meeting you @sarahookr :)

English

Sara Hooker@sarahookr·2 May

Excellent energy yesterday. Really great kickoff to the series. 🔥🎉

Nilou Salehi@nilou_salehi

It was standing room only at the kick-off for our research series on continual learning. Thank you to @NikzadAfshin (@across_ai ) @sarahookr (@adaption_ai) and @mralbertchun (AI Circle) for hosting! @oshaikh13 shared his research on human grounding in continual learning. It was so cool to be reminded of the old Apple Knowledge Navigator and how close we are to it and yet how far we still are :) how much easier some questions have gotten and how some remain so hard. Omar, you reminded me of my PhD defense where at some point I annoyed Maneesh so much he said: you can't keep saying "depends on the user context" in response to every question 😅 youtu.be/umJsITGzXd0?si… Stay tuned for the next meetup next month and check out Omar's research with @msbernst and @Diyi_Yang : •⁠ ⁠Creating General User Models from Computer Use (arxiv.org/abs/2505.10831): an architecture for a model that learns about you by observing any interaction with your computer, building confidence-weighted propositions about preferences and intent. •⁠ ⁠Learning Next Action Predictors from Human-Computer Interaction (arxiv.org/abs/2603.05923): predicting a user's next action from their full multimodal interaction history (screenshots, clicks, sensor data) rather than just typed prompts.

English

Vania Chow@vania_chow·1 May

@Bouazizalex interested! stanford cs - technical + commercial (ib + pe) background

English

159

Alex Bouaziz@Bouazizalex·1 May

Filled 2! Have one last opening, join us!!

Dan Westgarth@DanWestgarth

We're recruiting: Ghostbuster ~$200k - remote Role: drive $10mm in savings/yr by making everyone's life easier. The person we are looking for: - doesn't take anything for granted. - doesn't need guidelines. - obsess over specific problems - can't sleep until you solve them. - very high energy, very smart and humble. - want to work 10hs a day including weekends. - incredibly ambitious. - want to skip 4 squares - not one. - ideally engineering degree + 2 years of consulting (finance, tech) Two years ago, I realized there was a gap in our org structure. Some problems were too large for ICs to take on, but they were also too small and undefined for an executive to handle effectively. These issues stem from scale and time: 1. Decisions made years ago no one revisits. 2. Tools we started using with no post-mortem or analysis. 3. People getting delayed by things that could be automated. We saw what Elon was doing with DOGE and thought that's what we are looking for. So we created The Ghostbusters. It's currently at 4 people, but I expect it to get much larger as we find great fits. People in this team are extremely strong operationally. You must be very high energy, financially sharp, and a default optimist.

English

35.4K

Vania Chow@vania_chow·1 May

@dan_uptop interested! stanford cs w/ technical + commerical (ib/pe) background

English

221

dan | up top@dan_uptop·1 May

and they are hiring. you know where to find me if interested!

Fun@fun

Today, we announce our $72 million Series A, co-led by @multicoin and @SignalFire. fun.xyz/news/series-a

English

184

36.8K

Vania Chow@vania_chow·1 May

Had so much fun working on this and believe prediction markets reliability metrics are key to making them a useful information source. Let us know what your thoughts are!

Andy Hall@ahall_research

Today, we're releasing our first Free Systems product: Bellwether, an API, MCP server, and dashboard to help the media report prediction-market prices more reliably. Prediction markets can give us access to real-time, continuous, objective probabilities of important world events---but only if we build them to be well-structured, liquid enough, and resistant to manipulation. Bellwether helps by: --Reporting prices that are less manipulable because they're based on a volume-weighted average, not the last traded price --Flagging whether the price comes from a sufficiently liquid market or not, so that the media can avoid reporting on prices that are unreliable or super easy to manipulate --Standardizing across platforms, to help resolve when contracts for the same event across Kalshi and Polymarket are actually the same, or not We hope that you'll check it out, let us know what you think, and suggest improvements! bellwethermetrics.com This is joint work with @elliotjpaschal and @vania_chow

English

271

Keşfet

@evijit @JoalStein @ahall_research @Miles_Brundage @mada299 @swyx @Cursor @adaption_ai