Abel

594 posts

Abel

@abel__js

exploring robots | prev @microsoft | @MIT Sandbox recipient + patent | vnzlo 🇻🇪

San Francisco, CA Katılım Kasım 2017

387 Takip Edilen180 Takipçiler

Abel@abel__js·4h

@hy3na_xyz @eringriffith there is weakness everywhere haha

English

Hyena@hy3na_xyz·5h

@eringriffith

GIF

QME

747

erin griffith@eringriffith·11h

A detailed and brutal look at the tactics of buzzy AI compliance startup Delve "Delve built a machine designed to make clients complicit without their knowledge, to manufacture plausible deniability while producing exactly the opposite." substack.com/home/post/p-19…

English

1.5K

1.2M

Abel@abel__js·1d

@itsCathyDi @agentmail @adisingh have you seen the @PrimeIntellect hoodie ? both of you got turbo mogged irl

English

265

Cathy Di@itsCathyDi·1d

didn’t think anything could beat our dedalus hoodie but @Agentmail just raised the bar

English

7.7K

Abel@abel__js·1d

is microsoft the only company named after a medical condition? sounds like a diagnosis. macrohard sounds like the cure you would want. @elonmusk

English

Abel@abel__js·2d

@comma_ai how good is the real2sim transfer ?

English

758

comma@comma_ai·2d

Fellow e2e enjoyers, We've got something special for you today. Introducing openpilot 0.11: the first robotics agent fully trained in a learned simulation. Shipped to real users.

GIF

English

898

134K

Abel@abel__js·4d

@nicole_clash is part of the training data now kek right @grok

English

Nic0le@nicole_clash·4d

AI can’t generate this yet so I had to generate it myself

English

1.7K

Abel@abel__js·6d

@ManikantaChavv7 @saafwater @theresidency wow saaf water

Indonesia

Manikanta Chavvakula@ManikantaChavv7·6d

Moved to SF in Jan! We just closed a $600K order last week at @saafwater. In the same few weeks, we grew from monitoring 560M → 670M litres of water daily. Our deployments paid for themselves in 3 months!! Presenting it all in a few hours at @theresidency Demo Day in SF. Who's coming? 🚀 @HrishikeshMB @SanketMarathe09 @jehunix

English

101

9.2K

Abel@abel__js·13 Mar

@rasmus_up I think there are studies here has to a lot to do with perplexity and burstiness

English

Rasmus@rasmus_up·12 Mar

Someone should make a study analyzing what makes a piece of text seem like AI written. My theory: the more it correlates with feeling of cringe, the more likely it is to be AI

English

149

Abel@abel__js·4 Mar

@aidaxbaradari @be_inaudible wow congrats !

English

Aida Baradari@aidaxbaradari·3 Mar

Today, we're introducing Spectre I, the first smart device to stop unwanted audio recordings. We live in a world of always-on listening devices. Smart devices and AI dominate our world in business and private conversations. With Deveillance, you will @be_inaudible.

English

1.1K

42.5K

4.4M

Abel retweetledi

Mandeep@themandeepc·3 Mar

I think folks are being misled by "high performance" on browser use "benchmarks". It's not appreciated enough just how different they are to LLM benchmarks, and why they're difficult to do right and currently extremely flawed. LLM benchmarks are "closed world": the model generates text, and you verify it against some fixed ground truth that doesn't change. Even 'hard' benchmarks like Humanity's Last Exam fit this pattern. The benchmark dataset fully defines the expected inputs, outputs, and validation function. Browser use benchmarks, however, are fundamentally different because they're not closed world. "Actions" - things that change state on a website - are especially difficult. You can't go around willy nilly and mutate state on Twitter, Salesforce, etc, every time you run the evals. That especially applies to the websites we care about: internal enterprise software being the most obvious category. Even data retrieval can be difficult: websites and data change. Restaurant availability changes every hour, flight availability/prices change even faster. It's _slightly_ easier than actions since you can cache the HTML and make it closed world, as some benchmarks do, but this doesn't work for actions, and ages badly. Other benchmarks get around this by trying to fix the date of a check ("find me flights on 1 March 2024"). Ofc that trick doesn't work for most tasks (like that flights example - you can't view historical flight availability). Then there's CAPTCHAs, which exist on basically every high-value web task (even if hidden). Current benchmarks exclude all these 'inconvenient' tasks, which massively skews them to be totally unrepresentative of how humans use websites. Pure computer use have it easier because they're often closed world: the start and desired end state can be well-defined and evaluated inside a network-less container. Updating an Excel sheet has no harm (which tbf represents a lot of economic work). But once you're doing things in a browser, on websites over the internet, this nice property doesn't apply anymore. WebArena's answer to this conundrum was to create 'fake' websites that were supposed to be representative of real ones. The problem is, they're not. OSWorld makes it kinda closed world by providing cached versions of HTML, but this only really works for data retrieval. They're also very unrepresentative. WebVoyager is especially egregious: just 15 (!!) websites are represented, and the tasks are ridiculously easy. Take a look yourself: github.com/MinorJerry/Web… So, how does this translate to the claims made by browser startups? Well, WebVoyager (the extremely easy one) is the benchmark the avg browser startup reports 85%+ accuracy on. Claude's performance is reported for computer use, and against OSWorld which is dominated by closed-world tasks. So really, high reported accuracies should be taken with a huge grain of salt, and there's still a long way to go before computer use is solved. That said, there's at least one other team thinking about these problems (@yutori_ai, with their release of Navi-Bench). From first principles, this is a really tricky problem to solve. The infra and data to properly benchmark web agent performance is extremely nascent and underdeveloped. It's a problem we think a lot about at Indices -- please reach out (DM) if you do too!

English

7.6K

Abel@abel__js·27 Şub

@sama can I’ve .001%

English

Sam Altman@sama·27 Şub

We have raised a $110 billion round of funding from Amazon, NVIDIA, and SoftBank. We are grateful for the support from our partners, and have a lot of work to do to bring you the tools you deserve.

English

4.2K

2.6K

39.5K

8.9M

Abel@abel__js·25 Şub

@dezzie_rezzie loooooooool

Destiny Rezendes@dezzie_rezzie·24 Şub

twitter.com/i/spaces/1nGnR…

ZXX

353

148

12.2K

Abel@abel__js·24 Şub

maximus decimus meridius

Noah Zweben@noahzweben

Announcing a new Claude Code feature: Remote Control. It's rolling out now to Max users in research preview. Try it with /remote-control Start local sessions from the terminal, then continue them from your phone. Take a walk, see the sun, walk your dog without losing your flow.

143

Abel@abel__js·23 Şub

@AnthropicAI everyone knows this

English

Anthropic@AnthropicAI·23 Şub

We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax. These labs created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models.

English

7.3K

6.3K

55.1K

33.6M

Abel@abel__js·23 Şub

@PalmerLuckey call them black whales

English

422

Palmer Luckey@PalmerLuckey·23 Şub

ZXX

324

292

8.3K

267.1K

Abel@abel__js·21 Şub

muuu I love cows

Sama Hoole@SamaHoole

>Be aurochs, 10,000 BC >Roam the steppes of Anatolia in vast herds >Humans approach. Nervous. >They don't kill you. They bring you grass. >Something strange begins. >Be domesticated cattle, 8,000 BC >Humans shelter you from wolves >You provide milk for their children >Partnership forms. Mutual. Ancient. >Your kind spreads across every continent with theirs. >Be cattle, 3,000 BC >Pull the plough that breaks the first agricultural soil >Humans couldn't have done it without you >Mesopotamia feeds thousands because of your shoulders >Civilisation, technically, runs on ox power >Be cattle, Roman Empire >Armies march on leather boots you provided >Shields made from your hide >Legions fed on your meat >Rome, technically, runs on you >Be cattle, Medieval Britain >The peasant's only source of winter protein >Your tallow lights the candles they read by >Your manure feeds the fields that feed everyone >The feudal economy runs on you >Be cattle, 1800s >Power the industrial revolution's early leather belts and drive shafts >Provide the tallow that lubricates every machine >Your bones make the fertiliser that feeds the growing cities >The industrial world, technically, still runs on you >Be cattle, 1950s >Scientists discover your organs saved millions during wartime >Insulin extracted from your pancreas keeps diabetics alive >Gelatin from your bones holds medicines together >Modern healthcare, technically, runs on you too >Be cattle, 2026 >The world decides your breath is a global threat >The "solution" is a lab-grown burger in a plastic box >The plastic is made from the oil that replaced your tallow >The soil turns to dust because your hooves don't stir it >The humans forget who pulled them out of the mud >You aren't the engine of progress anymore; you're the exhaust. >Be confused.

English

Abel@abel__js·21 Şub

@bluewmist nothing stop buying things you don’t need and it will change your life

English

509

blue@bluewmist·21 Şub

What life changing item can you buy for less than $100?

English

146

164

129.6K

Abel@abel__js·21 Şub

@itanmaymaliwal @bluewmist yikes

English

Tanmay@itanmaymaliwal·21 Şub

@bluewmist A $20 book that changes how you think. Most life upgrades aren’t expensive they’re just applied consistently.

English

4.7K

Abel@abel__js·21 Şub

@Hamzeml do you have a Rust .md file 👀🦀 that you would recommend ?

English

Hamzé 🦀@Hamzeml·20 Şub

Stop Feeding Your Rust Code Raw Primitives (It Deserves Better), by @Hamzeml open.substack.com/pub/hghalebi/p…

English

246

Abel@abel__js·21 Şub

@METR_Evals why is the variance so high ?

English

3.9K

METR@METR_Evals·21 Şub

We estimate that GPT-5.3-Codex with reasoning effort `high` (not `xhigh`) has a 50%-time-horizon of around 6.5 hours (95% CI of 3 hrs to 17 hrs) on our suite of software tasks. OpenAI provided API access for this evaluation.

English

880

180.1K

Abel@abel__js·21 Şub

@brian_lovin yeah I’m on a Mac M2 with 16GB of RAM and it crashes my Mac half of the sessions

English

Brian Lovin@brian_lovin·21 Şub

This looks amazing but The #1 thing holding me back from using the Claude Desktop app for more things is performance. It's so slow and buggy. I'm on an M3 Max with 96GB of RAM and it drops frames when switching between the Chat and Code tabs...

Claude@claudeai

Claude Code on desktop can now preview your running apps, review your code, and handle CI failures and PRs in the background. Here’s what's new:

English

127

737

109.3K

Keşfet

@hy3na_xyz @eringriffith @itsCathyDi @agentmail @adisingh @PrimeIntellect @elonmusk @comma_ai