Roma

4.6K posts

Roma banner
Roma

Roma

@ComplexiaSC

Vell, Zaphod’s just zis guy, you know? The best place to chat and code with AI: https://t.co/5jYBq11Ft9 Blog: https://t.co/VF1EM6GQgz

Katılım Mart 2011
1.4K Takip Edilen380 Takipçiler
Roma
Roma@ComplexiaSC·
Time to build
Roma tweet media
English
0
0
0
4
Roma
Roma@ComplexiaSC·
@elonmusk @grok make a Borat caption meme where he says Grok Voice Number 1 in that iconic way of his
English
1
0
1
34
Elon Musk
Elon Musk@elonmusk·
Grok Voice is #1!
Artificial Analysis@ArtificialAnlys

Announcing agentic performance benchmarking for Speech to Speech models on Artificial Analysis. We use 𝜏-Voice to measure tool calling and customer interaction voice agent capabilities in realistic customer service scenarios Even the strongest Speech to Speech (S2S) models today resolve only about half of realistic customer service scenarios end-to-end - a meaningful gap relative to frontier text-based agents on the same tasks. Voice channels introduce significant complexity: challenging accents, background noise, and packet loss, all while requiring fast responses, consistency across long multi-turn conversations, and reliable tool use. Performance also varies considerably by audio condition: in clean audio some models perform notably better, but realistic conditions continue to pose a challenge. Conversation duration also varies meaningfully across models, with implications for both customer experience and operational cost. About 𝜏-Voice: Our Agentic Performance benchmark is based on 𝜏-Voice (Ray, Dhandhania, Barres & Narasimhan, 2026), which extends 𝜏²-bench into the voice modality to evaluate S2S models on realistic customer service tasks. It measures multi-turn instruction following, support of a simulated customer through a complete interaction, and tool use against simulated customer service systems. The simulated user combines an LLM-driven decision model with realistic audio synthesis: diverse accents, background noise, and packet loss modelled on real network conditions. This complements our Big Bench Audio benchmark measuring intelligence and Conversational Dynamics (Full Duplex Bench subset) benchmark measuring conversational naturalness. Scores are the average of three independent pass@1 trials. We evaluate under realistic audio conditions using the 𝜏²-bench base task split across three domains: ➤ Airline (50 scenarios): e.g., changing a flight, rebooking under policy constraints ➤ Retail (114 scenarios): e.g., disputing a charge, processing a return ➤ Telecom (114 scenarios): e.g., resolving a billing issue, troubleshooting a service problem Task success is determined by deterministic checks against expected actions and final database state, consistent with the 𝜏²-bench evaluator. Key results: xAI's Grok Voice Think Fast 1.0 is the clear leader at 52.1%, averaging 5.6 minutes per conversation, the second-longest overall. OpenAI's GPT-Realtime-2 (High) (39.8%, 3.0 min) and GPT-Realtime-1.5 (38.8%, 4.8 min) follow, with Gemini 3.1 Flash Live Preview - High close behind at 37.7% (3.8 min). Speech to Speech is a fast evolving modality and we expect movement in rankings as we continue to add new models with these capabilities, and model robustness improves. Congratulations @xAI @elonmusk! See below for further detail ⬇️

English
2.4K
5.7K
25.5K
8.3M
Theo - t3.gg
Theo - t3.gg@theo·
I can't help but feel personally burned by the Claude Code changes announced today. We put so much work into wrapping the (atrocious) Claude Agent SDK in T3 Code. It was the ONLY path they supported, so we made it work. It was hell. Now our users are getting their rate limits cut by 40x, despite us doing everything right. I listened to the Claude Code team. I had my issues with their direction, but I trusted them and took them at their word. I will never make that mistake again. Until we see significant change, it is safe to assume any statement from an Anthropic employee is a lie on a timer. The rug will be pulled, no matter how many promises are made beforehand.
English
401
298
8.3K
1.5M
Roma
Roma@ComplexiaSC·
@KorduGG @theo @durangocode That's too many for me. More than 4 and I lose track of what's happening where. Cmux kinda solves it, but not really. I need the actual thread tracking, like in T3 Code.
English
1
0
1
25
Kordu
Kordu@KorduGG·
@ComplexiaSC @theo @durangocode Yeah, I mean, to be fair, I just, I haven't really found a usage for using much anything but just open the CLI, quickly get what I need. I usually just have like ten CLIs open, but you know, each to their own.
English
1
0
0
5
Roma
Roma@ComplexiaSC·
@theo I cancelled 2 months ago. I do most of my work with Codex anyway. I decided to swap my Claude Code sub to a Cursor sub. It's more than enough usage for the few UI tasks I need Claude for.
English
0
0
1
480
Roma
Roma@ComplexiaSC·
@KorduGG @theo You can have it embedded in T3 Code or something like @durangocode where you have terminals on a canvas. Point is, you can still use Claude Code and other actually good things like Codex and Cursor from a single interface without having to switch apps.
English
1
0
1
44
Kordu
Kordu@KorduGG·
@ComplexiaSC @theo I mean, at that point, honestly, it's just going ahead and using the CLI is just normally gonna be better, I think. Not really a point to use any external apps for it anymore, I guess. Kind of a load of garbage and shit.
English
1
0
0
15
HangryBear
HangryBear@TechmanConway·
@GigaBasedDad Ozark, aside from all the mafia shit that happens. Other than that he foots the bill.
English
2
0
5
325
Giga Based Dad
Giga Based Dad@GigaBasedDad·
Can you name a single one of them?
Giga Based Dad tweet media
English
99
17
201
19K
Roma
Roma@ComplexiaSC·
@leerob Can I just come do it? I want to build you a better Github. There is literally zero need to have your codebase outside of Cursor anymore. It just makes so much sense.
English
0
0
1
283
Lee Robinson
Lee Robinson@leerob·
If you're an engineer who loves to push the limits of the latest AI models and coding agents... You should come work with me! There's so much to learn, build, and teach.
Lee Robinson tweet media
English
67
42
965
67.5K
Roma
Roma@ComplexiaSC·
@MarieIsabellaB Get pet insurance bro. $60 bucks a month for a peace of mind.
English
1
0
1
182
Marie Isabella
Marie Isabella@MarieIsabellaB·
Look at them, just living life like they pay their own vet bills
English
75
246
5.4K
439K
Roma
Roma@ComplexiaSC·
@beffjezos I reckon Elon buys out Anthropic by end of year and guts it into efficiency just like he did with Twitter.
English
0
0
0
95
Sean Strickland
Sean Strickland@SStricklandMMA·
@ChampRDS But did I say that? I said if he made true on his claims to jump me with all his buddies yes.... Which would be a lawful act of self defense because I dont have to fight you and your 30 friends..
English
213
180
9K
133.3K
Roma
Roma@ComplexiaSC·
What are the chances SpaceX buys Anthropic outright?
English
0
0
0
15
Roma retweetledi
Durango
Durango@durangocode·
Grok 4.3 now generally available in Durango Chat!
English
0
2
3
39
Roma
Roma@ComplexiaSC·
Have you already seen this @levelsio? Looks like @OpenRouter heard you and got you covered.
Roma tweet media
English
0
0
1
36
Roma
Roma@ComplexiaSC·
@elonmusk @nottombrown Can you tell Anthropic to stop their bullshit billing API rates to people who have keywords they don’t like in prompts or even git history? I can’t in good conscience use Claude if I don’t know how they will bill me from one request to the next.
English
1
0
4
471
Elon Musk
Elon Musk@elonmusk·
Same here. By way of background for those who care, I spent a lot of time last week with senior members of the Anthropic team to understand what they do to ensure Claude is good for humanity and was impressed. Everyone I met was highly competent and cared a great deal about doing the right thing. No one set off my evil detector. So long as they engage in critical self-examination, Claude will probably be good. After that, I was ok leasing Colossus 1 to Anthropic, as SpaceXAI had already moved training to Colossus 2.
English
1.4K
2.2K
27.7K
3.1M
Tom Brown
Tom Brown@nottombrown·
In the next few days we'll be ramping up Claude inference on Colossus. Grateful to be partnering with SpaceX here. We are going to need to move a lot of atoms in order to keep up with AI demand, and there's nobody better at quickly moving atoms (on or off planet Earth)
English
111
322
7.4K
585.6K
Pete McCarthy
Pete McCarthy@PMccarthy26071·
@ComplexiaSC @BigP4P4Smurf @thedimitri The interest payments on the loan are peanuts compared to your income. You're effectively stalling repaying the loans until you die and pass on the payments to your kids. Your taxes next to theirs you'd see you pay more. They don't have income. They have assets like stock options
English
1
0
0
16