aura architect

153 posts

aura architect banner
aura architect

aura architect

@aurarchitect

architecting a new interface for creating software | running @inventorsRes | building & investing in middleware & application layer

San Francisco @inventors Katılım Haziran 2025
96 Takip Edilen90 Takipçiler
Sabitlenmiş Tweet
aura architect
aura architect@aurarchitect·
chat is not the future.
English
1
1
7
981
serafim
serafim@serafimcloud·
Someone made a YC W26 tier list. 21st ended up in S tier. Appreciate the vote of confidence. Back to building.
serafim tweet media
English
24
10
243
83.3K
aura architect retweetledi
otso veistera
otso veistera@OtsoVeistera·
You're wasting half your context window. We’re launching @thetokenco (YC W26) today. We compress LLM inputs before they reach the model. Fewer tokens, lower cost, faster inference. Models also perform better. In customer case studies we’ve seen a +5% lift in user purchases due to higher preference for outputs from compressed prompts. The API is live. Link in the comments
English
76
57
508
91.3K
aura architect retweetledi
Mandeep
Mandeep@themandeepc·
I think folks are being misled by "high performance" on browser use "benchmarks". It's not appreciated enough just how different they are to LLM benchmarks, and why they're difficult to do right and currently extremely flawed. LLM benchmarks are "closed world": the model generates text, and you verify it against some fixed ground truth that doesn't change. Even 'hard' benchmarks like Humanity's Last Exam fit this pattern. The benchmark dataset fully defines the expected inputs, outputs, and validation function. Browser use benchmarks, however, are fundamentally different because they're not closed world. "Actions" - things that change state on a website - are especially difficult. You can't go around willy nilly and mutate state on Twitter, Salesforce, etc, every time you run the evals. That especially applies to the websites we care about: internal enterprise software being the most obvious category. Even data retrieval can be difficult: websites and data change. Restaurant availability changes every hour, flight availability/prices change even faster. It's _slightly_ easier than actions since you can cache the HTML and make it closed world, as some benchmarks do, but this doesn't work for actions, and ages badly. Other benchmarks get around this by trying to fix the date of a check ("find me flights on 1 March 2024"). Ofc that trick doesn't work for most tasks (like that flights example - you can't view historical flight availability). Then there's CAPTCHAs, which exist on basically every high-value web task (even if hidden). Current benchmarks exclude all these 'inconvenient' tasks, which massively skews them to be totally unrepresentative of how humans use websites. Pure computer use have it easier because they're often closed world: the start and desired end state can be well-defined and evaluated inside a network-less container. Updating an Excel sheet has no harm (which tbf represents a lot of economic work). But once you're doing things in a browser, on websites over the internet, this nice property doesn't apply anymore. WebArena's answer to this conundrum was to create 'fake' websites that were supposed to be representative of real ones. The problem is, they're not. OSWorld makes it kinda closed world by providing cached versions of HTML, but this only really works for data retrieval. They're also very unrepresentative. WebVoyager is especially egregious: just 15 (!!) websites are represented, and the tasks are ridiculously easy. Take a look yourself: github.com/MinorJerry/Web… So, how does this translate to the claims made by browser startups? Well, WebVoyager (the extremely easy one) is the benchmark the avg browser startup reports 85%+ accuracy on. Claude's performance is reported for computer use, and against OSWorld which is dominated by closed-world tasks. So really, high reported accuracies should be taken with a huge grain of salt, and there's still a long way to go before computer use is solved. That said, there's at least one other team thinking about these problems (@yutori_ai, with their release of Navi-Bench). From first principles, this is a really tricky problem to solve. The infra and data to properly benchmark web agent performance is extremely nascent and underdeveloped. It's a problem we think a lot about at Indices -- please reach out (DM) if you do too!
English
5
9
43
7.6K
Mehul Agarwal
Mehul Agarwal@meh_agarwal·
I want to host a singles mixer for @ycombinator & a16z @speedrun founders. SIGN UP BELOW to show interest. You need to be YC or @a16z, need to be SINGLE (we will DM your partner) and below 30. No VCs or scouts allowed unless you’re very cool. I’ve seen this weird rivalry bw YC & a16z founders since I moved to SF. It’s also a fact that almost all of us are single (sorry married losers) If we want to maintain world peace, it’s important that the future generations grow together. This is not a networking event. It’s only for those who want to: create the future people want.
Mehul Agarwal tweet media
English
72
10
266
321.9K
aura architect
aura architect@aurarchitect·
a lot of founders won’t make it. not because they are not working hard (in most cases). if you don’t have enough agency, thats the easiest problem to fix
English
0
0
1
50
Sam Altman
Sam Altman@sama·
More than 200k people downloaded the Codex app in the first day. And they seem to love it. CODEX FTW!
English
2.1K
307
9.2K
1.6M
aura architect
aura architect@aurarchitect·
a product recommendation engine and an ads engine are two different things. if you lock out >99% of all products through financial filtering, the user will not receive better recommendations than without that filter. AI already does product recommendations within chats organically based on internet sentiment. Not the best metric tbh but still way better than recommendations based on ad spend of the products.
English
0
0
0
15
Beff (e/acc)
Beff (e/acc)@beffjezos·
I, for one, think ads can be symbiotic with human cognition. Products are canned priors over action space, policies you can roll out to efficiently take action in the world. A great AI recommender systems for products can help people take actions that are beneficial to them.
roon@tszzl

putting my mediaobserver hat on. ant ads are pretty brilliant because they’re dishonest in a way that’s only going to ragebait openai heads and certain industry insiders but are funny and striking to everyone else. when you’re a call option variance is good. mario kart blue shell

English
12
2
42
6.6K
aura architect
aura architect@aurarchitect·
can some great journalist / creator like @LEMMiN0 create a well-researched documentary on the sam <> elon beef?
English
0
0
1
97
aura architect
aura architect@aurarchitect·
@beffjezos building successful saas is easier than building successful deeptech, there are still so many untouched opportunities in the application layer and you don't need to rely on external factors (like investors giving you money) that much
English
0
0
0
68
aura architect
aura architect@aurarchitect·
low-effort change to your system prompt: never use gradients unless specifically asked to thank me later
English
0
0
0
84
Elliot Arledge
Elliot Arledge@elliotarledge·
my friend @neuralkian just dropped a pipeline parallelism course for FREE! this is exactly what frontier labs would hire you to work on at scale in order to speed up training and inference on large models. you'll start with a simple example of overlapping computation on a small MLP, and work up from there!
Elliot Arledge tweet media
English
9
20
322
12.9K
Don
Don@donatelli2026·
Just moved to San Francisco from France I'm going all in on entrepreneurship now I've been very inspired by @robj3d3, @levelsio and @alexcooldev
English
156
10
818
189.5K