Sabitlenmiş Tweet
Sasha Aickin
19.2K posts

Sasha Aickin
@xander76
Sasha Aickin. Ex-CTO @ Redfin. (Former?) documentary filmmaker. Avid cook/book club hoster. He/him. @[email protected]
Katılım Mayıs 2007
946 Takip Edilen2.5K Takipçiler

Hey @ConEdison, one of my two power mains was cut off in February and you made me hire an electrician to prove it was your issue. We filed a claim to get that money back, but we've never heard back, and support can't help us. Emailed outageclaims@coned as well, no response.
English

We are now SOC2 Type 2 compliant. Yay!
But also: if you're a startup going for compliance, it's a messy process that's hard to figure out, and I wrote up all the good, bad, and ugly stuff I wish I'd known before I started.
Libretto@getlibretto
We're officially SOC2 Type 2 compliant at Libretto! 🎉 But forget the usual corporate speak—here's an honest look at the weird, messy reality of SOC2 compliance at a startup. Check out what we learned the hard way: #StartupLife #SOC2 #RealTalk libretto.ai/blog/what-i-wi…
English
Sasha Aickin retweetledi

We're officially SOC2 Type 2 compliant at Libretto! 🎉
But forget the usual corporate speak—here's an honest look at the weird, messy reality of SOC2 compliance at a startup. Check out what we learned the hard way: #StartupLife #SOC2 #RealTalk
libretto.ai/blog/what-i-wi…
English
Sasha Aickin retweetledi

@jxnlco To be fair, this is basically the worst it's been in the last 3 years.
English

This was fun. We found that GPT-4o just started giving different answers on Monday, without any change in our code. Learn more about model drift and why it's an issue with LLM development here: libretto.ai/blog/yes-ai-mo…
Libretto@getlibretto
1/ Well, this was kind of wild: we caught GPT-4o changing underneath us.
English

@Max_Fisher @chrislhayes Nate Silver has D+1 nationally, it’s worth noting.
English
Sasha Aickin retweetledi

@conorsen That's also how I remember it. All of them posted vague tweets about how the final result was pretty clear, starting at some point on Wednesday (I think reasonably early on Wednesday).
English

This reminds me of the last time I voted in Montana, close to the end of Election Day, there was no one in line and as soon as I said “Jeremiah,” multiple poll workers immediately cheered “the last Baumann!!”
Mara Gay@MaraGay
A poll worker just yelled out, “first time voter!” and everybody in the Brooklyn polling site cheered
English

Excited to open-source a new hallucinations eval called SimpleQA! For a while it felt like there was no great benchmark for factuality, and so we created an eval that was simple, reliable, and easy-to-use for researchers. Main features of SimpleQA:
1. Very simple setup: there are 4k diverse fact-seeking questions written by humans where there can only be a single, indisputable answer. Model completions are graded by an autograder as either correct, incorrect, or not attempted.
2. We created it so that it would be challenging for the current class of frontier models; both o1-preview and Claude Sonnet 3.5 are below 50% accuracy.
3. Reference answers have high correctness. Questions are written to be non-ambiguous and reference answers were verified by two independent annotators. Questions are also written to be timeless, so SimpleQA can be a useful benchmark even 5 or 10 years from now.
The way that I think about evals is that they are an incentive for the AI community. New benchmarks in AI get saturated very quickly, and what they incentivize gets encoded into the next generation of language models. With a good hallucinations eval, hopefully the next wave of language models will be more trustworthy and reliable!

English

@_jasonwei Read through the paper. Nice work! One thing that was maybe a little concerning, though, was that one of your example questions seems to have more than one answer. It seems that Akiko Kumahira was known as Akiko Kumahira Comrie or Akiko Comrie after her marriage.
English

@_jasonwei This is really neat! I'm curious, are you open sourcing the actual question set, or just the eval code? I tried to find the questions and it looks like it's downloading them from a private URL. (But maybe I'm just misunderstading!)
English




