Elliot Bensabat

216 posts

Elliot Bensabat

@elliotbst

ML/DL nerd before it was cool. Then turned into Product. Building something new…

New York, NY Katılım Kasım 2011

375 Takip Edilen199 Takipçiler

Elliot Bensabat@elliotbst·2 May

Still surprised this isn’t pushed more. We need a protocol (think open banking) to make it easy and safe for consumers to move/share their data from your llm chats. Would be good for competition and thus for users. @elonmusk and the Grok team should do that. They had to reinvent distribution of that product (the @ grok) to start growth, but if it was easier to switch, i’m sure market share would skyrocket

Elliot Bensabat@elliotbst

Is anyone pushing for memory portability across LLMs? More time passes, more lock-in, harder for competitors to compete. OpenAI led early with the chats where memory compounds - workouts, travel, day-to-day life. Anthropic focused more on code and work, which are self-contained. So if Anthropic wants to win on the personal side, pushing for memory portability seems like an important move? cc @DarioAmodei @mikeyk Claude's been better for most of my new projects, but ChatGPT has so much context on my life that switching feels costly.

English

Elliot Bensabat@elliotbst·2 May

@andrewchen More than understanding what to build, having some across the full product lifecycle (research, design, eng, gtm) is the superpower. You’re still invovled in the same decisions, but have unlimited execution power

English

697

andrew chen@andrewchen·2 May

bullish on the PM role quietly becoming the most important role in tech again when anyone can build, the person who decides WHAT to build becomes the bottleneck

English

283

181

2.3K

230.2K

Elliot Bensabat@elliotbst·9 Nis

@LaubRebecca People in SF are about to wonder if you’re AI

English

213

reblaub@LaubRebecca·9 Nis

I moved from Paris to San Francisco on a whim in 24 hours. here’s the story: last year I killed my second startup and told myself: slow down, think things through, be a normal person 24 hours later, I was on a plane to SF to join a YC hardware startup. lmao. now I joined an insane team building something that will change how people reconnect. more on that soon, so follow me here. I traded: Jardin du Luxembourg for the Golden Gate Bridge, croissants for cold brew, and my quiet Parisian life for a tech bro house. (best part) the new schedule: 9am to 1am, 7 days a week, living with 3 of the smartest and most unhinged people I’ve met at this point, I’m clearly not fixing my impulsivity or my adhd.. I’m just leaning into it if you’re in SF and into growth/consumer, say hii, let’s go on a coffee walk! btw: I’m flying my dog James over next month, hit me with the dog-friendly recs can’t wait to share more soon 🇫🇷🇺🇸

San Francisco, CA 🇺🇸 English

282

35.8K

Elliot Bensabat@elliotbst·18 Mar

@bcherny @trq212 probably a simple feature that would help a bunch of people - please let me link folders/sessions that i'm using with Claude Code locally to Project I have on Claude web so they can share context.

English

Elliot Bensabat@elliotbst·1 Mar

@NYCMayor I truly don’t know who’s more retarded between you and @RashidaTlaib - tough to decide

English

Mayor Zohran Kwame Mamdani@NYCMayor·28 Şub

Today’s military strikes on Iran — carried out by the United States and Israel — mark a catastrophic escalation in an illegal war of aggression. Bombing cities. Killing civilians. Opening a new theater of war. Americans do not want this. They do not want another war in pursuit of regime change. They want relief from the affordability crisis. They want peace. I am focused on making sure that every New Yorker is safe. I have been in contact with our Police Commissioner and emergency management officials. We are taking proactive steps, including increasing coordination across agencies and enhancing patrols of sensitive locations out of an abundance of caution. Additionally, I want to speak directly to Iranian New Yorkers: you are part of the fabric of this city — you are our neighbors, small business owners, students, artists, workers, and community leaders. You will be safe here.

English

66.4K

60.3K

405.6K

39.6M

Elliot Bensabat@elliotbst·12 Oca

English

129

Elliot Bensabat@elliotbst·9 Ara

What do you think happens to personal investing in a world with AGI? How do you find the right level of urgency/pressure to optimize for speed while minimizing burnout/keeping burnout low? What investing tip do you think is overlooked? What do you think needs to happen for the majority of private wealth to move from legacy banks to newer players?

English

Vlad Tenev@vladtenev·8 Ara

Opening up a personal AMA. Want to know how I think about leadership, innovation, or life? Ask away.

English

610

1.2K

352.4K

Elliot Bensabat@elliotbst·2 Ara

That's actually the best way to solve the problem of people posting fake ARR numbers/charts. Build a feature on Stripe's dashboard to request a tweet from a Stripe bot (e.g. @StripeARR) for any milestone. The tweet will @ the company/founder with the number and date of achievement. The founder can repost that tweet or refer to it as proof, bringing attention & free marketing back to Stripe. cc @patrickc

Stripe@stripe

Congrats @chatbase on crossing $8 million ARR! See the Chatbase blimp fly over Stripe City today: bfcm.stripe.com.

English

142

Elliot Bensabat@elliotbst·22 Kas

@nikitabier FYI you’re cross-selling your premium plan when a written tweet is over the character limit, but after upgrading you redirect users into an intro to premium flow which deletes the written draft. Frustrating and should be easy to fix

English

Elliot Bensabat@elliotbst·22 Kas

Card is also the perfect entry point into budgeting/reporting (which should be commoditized but card focused companies are not incentivized to build it) -> which is the perfect entry point into getting people to share their other card/bank data -> which is the best way to improve your CAC/LTV. Bullish

English

Elliot Bensabat@elliotbst·22 Kas

@vladtenev Robinhood’s credit card is *very* good. Just let me automate cashback settlement -> invest in a specific asset (equity/index/crypto) in robinhood. Would help you compete with card offering bitcoin rewards + help users dollar cost average into any asset

English

Elliot Bensabat@elliotbst·8 Eki

@rabois @APompliano You could also argue something like this would allow homeowners to sell part of their home ownership while keeping the physical property

English

Elliot Bensabat@elliotbst·8 Eki

@rabois @APompliano Could in theory tokenize the home so you can continuously trade on its value as opposed to waiting for a liquidity event. If certain homes are tokenized in a market (need liquidity), you could create indexes that would finally let me long Florida with the NYC election coming soon

English

Anthony Pompliano 🌪@APompliano·8 Eki

This is a somewhat crazy idea, but I believe it would be incredibly popular. $OPEN should create a way for people to wager in a prediction market on the price a home will sell for. Everyone has looked at a listing online and said "that home is overpriced!" or "that house is a steal of a deal!" Prediction markets now make it possible for people to wager on their opinion. Example: someone in your neighborhood lists their home, but you think the price is too high or too low. You should be able to use your market knowledge to express your view in a prediction market. The prediction market would be a very fast feedback signal to the seller on whether they are priced too high or too low. Opendoor's algorithm would get another data point to consider when pricing offers in the future too. Opendoor could drive additional revenue by offering these markets. They could keep the revenue or they could use it as an incentive to buy down the home price for a buyer, offer an incentive to the seller, etc. Additionally, the prediction market on the home's price would create marketing buzz about a home for local media or social media folks. This idea would create free marketing for Opendoor, drive a new revenue stream, improve their pricing algorithm, and help buyers/sellers better understand the true market value of a home. Probably not the most important thing to work on first, but something to consider @nejatian @shrisha @fahdananta @morganb and team

English

226

1.1K

200.1K

Elliot Bensabat@elliotbst·22 May

@NWischoff Let the games begin

English

Elliot Bensabat@elliotbst·27 Mar

@nikitabier The difference here feels like the B2B implications. We’ll get used to transform our photos in infinite ways. But creative teams will reshape how they work and are structured

English

Elliot Bensabat@elliotbst·27 Mar

@nikitabier For consumer usage, it reminds me of the apparition of AR filters on snapchat. Felt magical at first, with a super viral launch. Slowly most people got bored of it and were not impressed, but they are used daily by influencers on social media in more subtle ways to look prettier

English

114

Nikita Bier@nikitabier·27 Mar

What is the longevity of the new Studio Ghibli trend—or more broadly, the trend of adding generative "artistic filters" to our photos? It certainly feels like a fleeting moment, but I'll take the other side of that bet: The closest analogue to what's happening was in 2022: using Stable Diffusion, Lensa launched their generative portraits feature that enabled you to upload a few selfies and get a set of attractive photos of yourself. It lasted about 6 weeks—and printed a cool $30mil in sales. This time feels different. To use a video game metaphor, Lensa was akin to a rail flyer (like Starfox 64) where you could only do one thing—fly forward—and you stopped playing once you reached the end of the game. Prompt-based photo filtering is much closer to an open world flight simulator, where anything can happen and anywhere can be explored. It supports any number of participants or things in the photos and any sort of customization. This has effectively infinite play time. While we will get fatigued by the Ghibli style soon, we will see a number of Instagram-with-AI-filter attempts in the coming weeks. My guess is that the one that resonates the most will have sufficient constraints that ensure a consistency of tone on the feed (similar to Instagram's hipster-grainy filters)—yet has enough latitude for creativity.

English

196

1.7K

388.3K

Elliot Bensabat@elliotbst·19 Şub

@amiruci Congrats Amir - well done from you and the team!

English

141

Amir Haghighat@amiruci·19 Şub

Today we announced our $75m series C after growing revenue 6x in a year. But this milestone seemed impossible 3 years ago. This post is mostly about that. Baseten is 5.5 years old. The company truly wasn’t working for the first 3 years. Even back then we were an ML infra company focused on fast and reliable inference. The difference was the kind of models customers were deploying on Baseten: they were mostly predictive models (regressors, classifiers, some NER stuff). And they were mostly used in back-office use cases: fraud prediction, trust and safety, content moderation. The implication was that the customers didn’t *really* care about fast and reliable. The head of ML at a large logistics company told me “if our ML infra goes down for an hour, it’s ok — it comes back up and picks up from where it left off”. Word-for-word quote. It was a sinking feeling: why did we build all this stuff? Right around the same time, in 2022, we saw more of our customers deploy deep learning / BERT models and use them in production. This was exciting for 2 reasons: 1) these models tended to be used in the path of the end users of our customers and therefore they cared about fast and reliable, and 2) we saw a pattern where these models a) were getting more useful, and b) their weights tended to be open, and that one day we’ll have language models that are both useful and open. Back then we had gpt-2 and Flan T5, which were open but not useful enough for any of our customers to use them in production. But we decided to prematurely build for this future. This meant building more product in particular on 2 pillars: optimizations at the model level (today that’s speculative decoding four different ways, disaggregated serving, different attentions, smart k/v cache utilization), and scalable infrastructure (cross-cloud and cross-region horizontal scale, self-hosted, multi-node inference, geo-aware routing to shave off 10s of milliseconds). This bet, coupled with the team's execution ended up paying off and getting us to today. Today a dozen foundation model companies trust Baseten for their entire inference stack. So do hundreds of AI companies and a small but fast-growing number of enterprises. But one thing is certain in the AI space today: you can't sit on your laurels. The sands are constantly shifting underneath us. We have to keep making bold bets and being ok with some of them failing. That's the only true moat in our space: intuition coupled with fast execution.

English

253

26.2K

Elliot Bensabat@elliotbst·18 Şub

@karpathy @IndraVahan Any RL approaches from what you know that have been applied to get humor right? It’s essentially what comedians/funny people do, they try jokes and use crowd reaction as their reward/penalty to adjust an existing joke or just move in a completely different direction

English

Andrej Karpathy@karpathy·18 Şub

@IndraVahan Great question right? I'd love to know, I don't think I fully understand this either. But considering that noone has (to my knowledge) figured out a way to post-train an LLM to be funny, I am prepared to believe humor is really difficult and requires more underlying capability?

English

172

1.5K

175.3K

Andrej Karpathy@karpathy·18 Şub

I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check. Thinking ✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan question: "Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of "rings" using a slider. For example in Catan the radius is 3 hexes. Single html page please." Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not. ❌ It did not solve my "Emoji mystery" question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message. ❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 "tricky" tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro. ✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails. I like that the model *will* attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day... The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at. DeepSearch Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call "Deep Research", together with thinking. Except instead of "Deep Research" it is "Deep Search" (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went: - ✅ "What's up with the upcoming Apple Launch? Any rumors?" - ✅ "Why is Palantir stock surging recently?" - ✅ "White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?" - ✅ "What toothpaste does Bryan Johnson use?" - ❌ "Singles Inferno Season 4 cast where are they now?" - ❌ "What speech to text program has Simon Willison mentioned he's using?" ❌ I did find some sharp edges here. E.g. the model doesn't seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don't exist. A few times it said factual things that I think are incorrect and it didn't provide a citation for it (it probably doesn't exist). E.g. it told me that "Kim Jeong-su is still dating Kim Min-seol" of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI). The impression I get of DeepSearch is that it's approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI's recently released "Deep Research", which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a "major LLM labs" when I tried with it...). Random LLM "gotcha"s I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on. ✅ Grok 3 knows there are 3 "r" in "strawberry", but then it also told me there are only 3 "L" in LOLLAPALOOZA. Turning on Thinking solves this. ✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it. ✅ Few simple puzzles worked ok even without thinking, e.g. *"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"*. E.g. GPT4o says 2 (incorrectly). ❌ Sadly the model's sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse. ❌ Model still appears to be just a bit too overly sensitive to "complex ethical issues", e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying. ❌ Simon Willison's "*Generate an SVG of a pelican riding a bicycle*". It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can't "see" like people do, so it's arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude's are best, but imo I suspect they specifically targeted SVG capability during training. Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

English

666

2.2K

16.8K

3.7M

Elliot Bensabat@elliotbst·30 Mar

@paulg Classic example of this is Benford’s law. en.m.wikipedia.org/wiki/Benford%2… When faking numbers, people don’t keep the expected dustribution of digits which makes it easy to catch them for fraud

English

158

Paul Graham@paulg·30 Mar

When you do bad stuff you leave statistical tracks you don't know you're leaving.

Jeremy Nguyen ✍🏼 🚢@JeremyNguyenPhD

Are medical studies being written with ChatGPT? Well, we all know ChatGPT overuses the word "delve". Look below at how often the word 'delve' is used in papers on PubMed (2023 was the first full year of ChatGPT).

English

115

290

3.6K

557.2K

Elliot Bensabat@elliotbst·11 Mar

@davidmarcus Topped by having the director of a movie about Auschwitz make a pro-palestinian speech about occupation 🤢

English

366

David Marcus@davidmarcus·11 Mar

Surprised so many people at the Oscars are wearing the red Gaza “ceasefire” pin instead of one demanding the return of the hostages who have been held by Hamas for 155 days, including Americans. If they truly want the former, they should demand the latter forcefully.

English

193

18.3K

Keşfet

@elonmusk @andrewchen @LaubRebecca @bcherny @trq212 @NYCMayor @RashidaTlaib @DarioAmodei