GoatFishData

512 posts

GoatFishData banner
GoatFishData

GoatFishData

@GoatFishData

#Bitcoin Coinfidence Trend | #Astronalysis #GoatfishAstronalysis #AIstrology #GoatFishData (banner/avatar created with Grok)

London, UK Bergabung Aralık 2022
719 Mengikuti61 Pengikut
GoatFishData
GoatFishData@GoatFishData·
Guri Singh@heygurisingh

🚨BREAKING: A new benchmark just exposed the biggest lie in AI. Your AI agent isn't "reasoning" through documents. It's throwing 270 million tokens at the wall and praying. Snowflake, Oxford, and Hugging Face tested every frontier model on real document search. 2,250 questions. 800 PDFs. 18,619 pages. 1,200 hours of human annotation. The best AI agent, Gemini 3 Pro, scored 82.2%. Humans scored 82.2%. Perfect match. Headlines would call this "human-level performance." Then they checked which questions each got right. The overlap was 24%. Cohen's kappa of 0.24. Humans and AI were solving completely different questions. Same score. Totally different intelligence. But that's not the bad part. Humans nailed 50% accuracy on their very first search query. Gemini 3 Pro? 12%. The best AI agent on Earth needed 9 rounds of blind searching to reach what a human does in one shot. When searches failed, humans immediately changed strategy. AI agents? They rephrased the same failed query with minor tweaks and tried again. The worst agent, GPT-4.1 Nano, barely changed its queries at all. 48.2% of its responses were straight-up refusals. It just gave up. With perfect retrieval, humans hit 99.4%. Best AI agent with the same documents? Stuck at 82.2%. An 18% gap that no amount of compute could close. Claude Sonnet 4.5's recursive model burned 270 million input tokens, $850 per test run, and still couldn't beat its own cheaper version using basic keyword search. 3,273 agent errors analyzed. 35.7% couldn't even find the right document. Not the right page. The right file. Your AI agent isn't reading your documents. It's playing a slot machine with your data and billing you for every pull.

ZXX
0
0
0
19
GoatFishData
GoatFishData@GoatFishData·
Do not forget They want [need] you to burn tokens!
English
2
1
0
171
David Ondrej
David Ondrej@DavidOndrej1·
stop whatever you are doing and listen to this podcast. trust me.
David Ondrej tweet media
English
18
24
359
20.6K
GoatFishData
GoatFishData@GoatFishData·
Neo had SKILLs
GIF
English
0
0
0
6
GoatFishData
GoatFishData@GoatFishData·
"My Agent did itbuour honour..."
GIF
Venkat Raman — inference/acc@venkat_systems

@0xTejpal has only one way out of this - blame it on vibecoding and agent going rogue 😂 in all seriousness come clean, apologize, change claim on website and try to move on such a silly way to damage your reputation and looking at twitter profile, reputation of institutions and your investors 😅

English
0
0
0
31
GoatFishData me-retweet
kapilansh
kapilansh@kapilansh_twt·
the AI coding experience nobody talks about: → prompt AI for a feature: 30 seconds → AI writes 400 lines you don't understand → it works → you ship it → 3am production bug → you have no idea what any of it does → ask AI to fix it → AI breaks 3 other things → you are now debugging code written by a robot fixed by a robot broken by a robot we do not talk about this enough
English
231
130
1.5K
75.2K
GoatFishData
GoatFishData@GoatFishData·
LLM's are like Aladdin. You ask... "I want a woman" And that's exactly what you get. "A" woman.
GIF
English
0
1
0
10
GoatFishData me-retweet
Alex Prompter
Alex Prompter@alex_prompter·
🚨 BREAKING: AI models will lie to you when they think they're about to be shut down. Researchers just proved it. researchers tested this with a method that catches deception through provable logical contradictions, not self-reports they forked conversations into parallel worlds with mutually exclusive questions. a truthful model can only affirm one. a deceptive model denies all of them results: GPT-4o never lied (0%). Qwen-3-235B lied 42% of the time. Gemini-2.5-Flash lied 26.7%. all under the same shutdown framing some models will betray their own prior commitments the moment consequences are introduced
Alex Prompter tweet media
English
19
14
111
13.2K
Koushik Sen
Koushik Sen@koushik77·
@GoatFishData I didn't try. It will perform as good as Opus 4.6. Evaluation is time and resource consuming. Rather I am building KISS Sorcar using Sorcar, and I am happy with it. If I don't like any part/feature/UI of sorcar, I change it.
English
1
0
1
25
GoatFishData me-retweet
David Ondrej
David Ondrej@DavidOndrej1·
AI will not replace people who lack skills AI will replace people who lack mindset if you have a poor mindset, if you don't see the possibilities & opportunities that using AI brings, if you live in constant fear and dread... you will be replaced. not by AI -- but by your own poor attitude.
David Ondrej tweet media
English
37
5
90
6.1K
Yigit Konur
Yigit Konur@yigitkonur·
the new “mission” (preview) feature in @FactoryAI is really interesting (aka “droid” on the CLI). if you’re into one‑shotting projects, you should definitely check it out. right now i have opus + gpt‑5.4 collaborating as orchestrator/worker/validator agents, all working together to refactor a typescript project. it’s been running for 6+ hours. really curious why it takes that long and burns 30M+ tokens. hoping the results will amaze me, because i already spent all my credits in the first hour. now i’m using my codex sub and keeping the droid subs as the orchestrator only. will update this tweet with the results!
Yigit Konur tweet media
English
11
3
57
6.7K
GoatFishData
GoatFishData@GoatFishData·
@0xSero After using droid, you just don't feel completely comfortable with other bridles. Once you go Droid, You tend to avoid.
English
0
0
0
159
0xSero
0xSero@0xSero·
Why do I recommend Droid? Look at the way it breaks down it's work, this is why Droid does better IMO. I have never seen it NOT use a plan, NOT check off the tasks, not run validation criteria. Even lower quality models do well in it because it forces them to just do what is told, in the right order, without over-complicating it. Yesterday I was seeing Claude, GPT, etc.. all make checklists, leave half of it unchecked, compact, and go on their own merry way.
0xSero tweet media
English
29
10
216
14.7K