Eoghan Flanagan

5.7K posts

Eoghan Flanagan

@KateandPie

London, England Katılım Mart 2010

673 Takip Edilen189 Takipçiler

Sabitlenmiş Tweet

Eoghan Flanagan@KateandPie·15 Şub

ZXX

2.1K

Eoghan Flanagan@KateandPie·2d

@allTheYud The AI shibbpleth: "Say now Schmidhuber"

English

Eliezer Yudkowsky@allTheYud·3d

Like Jürgen Schmidhuber! People really have no idea how much Schmidhuber personally invented. Tragic, really. Nobody would dream of blaming *me* for accelerating progress in deep learning if they were properly acquainted with the accomplishments of Jürgen Schmidhuber.

English

157

6.3K

Eoghan Flanagan@KateandPie·2d

@DarkDementus @Pamela_Vista That doesn't mean they're incorrect.

English

Eoghan Flanagan@KateandPie·3d

@EMostaque Hi Emad, I think your analysis accepts a premise of the OpenAI piece that it should not - the eliding of the United States' situation with the rest of the world. Europe doesn't have any frontier models so it can't even "tax the AI". Same with UK.

English

123

Emad@EMostaque·4d

x.com/i/article/2041…

ZXX

200

91K

Eoghan Flanagan@KateandPie·3d

@dioscuri That's a real Libran take

English

210

Henry Shevlin@dioscuri·4d

Confession: I’m mildly astrology sympathetic. Obviously it has no predictive validity, but an ex who was super into it explained how it’s a great way of laundering insights in socially appropriate ways, e.g., “she’s such a Scorpio” when what you mean is “she’s a complete bitch”

Deivon Drago@DeivonDrago

This is a bad take. Astrology ought to be a non starter, unless you are being facetious about it. Taking astrology seriously suggests that you are epistemically broken. Possibly scientifically illiterate.

English

101

112

3.4K

245.4K

Eoghan Flanagan@KateandPie·4d

@MillionInt Progenitorzy nowego wyścigu Rodzice legionu Adam i Ewa automatyzacji

Polski

Jerry Tworek@MillionInt·4d

Progenitors of a new race Parents of a legion Adam and Eve of automation Corey, Otto, Myrtle, Dewey and North

English

138

11.4K

Eoghan Flanagan@KateandPie·4d

@charles_irl Hey Charles. Modal customer here. Can you point me to a demo of this?

English

Charles 🎉 Frye@charles_irl·4d

it took me fifteen years to write this tweet -- ~365 * 24 * 60 * 60 * 1000 times higher latency than the lag between your brain and your finger the technology was worth the wait

Modal@modal

The future of artificial intelligence is physical. @physical_int runs robotic control inference on Modal with >2x lower latency than the lag between your brain and your finger.

English

8.1K

Eoghan Flanagan@KateandPie·4d

@patrickc Amazing what 15 billion dollars buys these days

English

237

Patrick Collison@patrickc·4d

Congratulations to Alex and the whole team at MSL. As a sucker for all things speedy (patrickcollison.com/fast), I thought this was an impressive chart:

Alexandr Wang@alexandr_wang

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English

1.7K

762.9K

Eoghan Flanagan@KateandPie·6d

@rahimjina What

English

Rahim Jina@rahimjina·6d

SPACE NUTELLA!

Italiano

131

Eoghan Flanagan@KateandPie·6d

@allTheYud Eliezer Yudkowsky is my favourite historian.

154

Eoghan Flanagan@KateandPie·6d

@DoctaDG @Teknium Something tells me you have a non-normies definition of "normirs"

English

867

Darryl@DoctaDG·6d

OpenClaw development is moving too fast. They're adding features that should be skills. I've now onboarded 10+ "normie" friends to ai agents, and every single one of them prefers Hermes. Great work @Teknium

English

168

17.9K

Eoghan Flanagan@KateandPie·6 Nis

@rachelvscott Ask a vague question. Get a vague answer. Congrats.

English

461

Rachel Scott@rachelvscott·5 Nis

Spoke with President Trump. He told me the conflict should be over in days, not weeks but if no deal is made he’s blowing up the whole country with “very little” off the table. "If happens, it happens. And if it doesn't, we're blowing up the whole country,” he said. I asked if there’s anything off limits. “Very little,” he said.

English

2.5K

2.8K

8.6K

4.9M

Eoghan Flanagan@KateandPie·6 Nis

@rryssf_ The WTL:DR?

English

Robert Youssef@rryssf_·4 Nis

Holy shit. UNC just let an AI run 50 experiments autonomously for 72 hours and it built a memory system that beats every human-designed baseline. +411% improvement on long-context benchmarks. The biggest gains weren't from tuning parameters they came from fixing bugs and redesigning architecture that humans missed entirely. > The experiment started with a simple text-only memory system scoring F1 = 0.117 on LoCoMo, a benchmark that tests whether AI agents can recall and reason over months of multi-session conversations. UNC gave an autonomous research pipeline called AutoResearchClaw three things: the codebase, two benchmark evaluation harnesses, and API access to LLMs. > No human touched the inner loop again. The pipeline ran for 72 hours, executed 50 experiments, diagnosed its own failures, rewrote its own architecture, and ended at F1 = 0.598 beating every human-designed memory system ever published on that benchmark. The previous state of the art was 0.432. > The most important finding is what drove the gains. Traditional AutoML searches hyperparameters: learning rates, batch sizes, temperature values. > Those contributed almost nothing here. The three categories that actually moved the needle were bug fixes (+175%), architectural redesign (+44%), and prompt engineering (+188% on specific categories). Each of those individually exceeded the cumulative contribution of all hyperparameter tuning combined. This is the finding that should change how the field thinks about automated research: the valuable improvements require code comprehension, failure diagnosis, and cross-component reasoning capabilities that live entirely outside what traditional AutoML can do. > The single most impactful discovery came in iteration 1. The pipeline found that an API call was missing a response_format parameter. One line of code. Without it, the model produced verbose natural-language answers instead of structured JSON, and the verbosity destroyed F1 precision. > Fix: +175% improvement in a single step. In iteration 5, the pipeline discovered that all 4,277 stored memory timestamps had been corrupted to the ingestion date rather than the actual conversation date. It autonomously wrote a keyword-matching repair script that corrected 99.98% of them without re-ingesting any data. These are not the kinds of failures a hyperparameter search finds. They require reading code, understanding what it does, and diagnosing why the output is wrong. The full optimization trajectory across both benchmarks: → LoCoMo starting F1: 0.117 naïve baseline, text-only memory → Iteration 1: missing response_format parameter found and fixed F1 jumps to 0.322, +175% → Iteration 2: pipeline discovers set-union merging of dense and sparse search beats score-based re-ranking F1 to 0.464, +44% → Iteration 3: anti-hallucination prompting added F1 to 0.516, +11% → Iteration 5: 4,277 corrupted timestamps autonomously repaired F1 to 0.580, +7% → Iterations 8 and 9: two failed experiments automatically detected and reverted → Final LoCoMo F1: 0.598 +411% from baseline, beats SimpleMem SOTA of 0.432 → Mem-Gallery starting F1: 0.254 → Phase 2 breakthrough: pipeline discovers returning full original dialogue text outperforms LLM-generated summaries counterintuitive, since summaries are the standard approach F1 jumps to 0.690, +96% in one phase → Phase 3: pipeline finds that prompt constraint positioning before vs. after the question matters more than constraint content one category improves +188% from repositioning alone → Phase 5: BM25 tokenization fix stripping punctuation from "sushi." to "sushi" yields +0.018 F1, more than 10 rounds of prompt engineering combined → Final Mem-Gallery F1: 0.797 +214% from baseline, beats MuRAG SOTA of 0.697 → Total wall-clock time: 72 hours equivalent to approximately 4 weeks of human researcher time at 3 experiments per day → Throughput with 8 parallel workers: 5.81 queries per second 3.5x faster than the fastest human-designed baseline > The architecture the pipeline designed is called OMNIMEM and it has three principles that no human researcher had combined before. Selective ingestion: before anything enters memory, lightweight encoders measure novelty and discard redundant content CLIP embeddings detect scene changes across video frames, voice activity detection rejects silence, Jaccard overlap filters near-duplicate text. Only novel information gets stored. Multimodal Atomic Units: every memory regardless of modality gets stored as a compact metadata record with a pointer to raw content in cold storage fast search over small summaries, lazy loading of large assets only when needed. Progressive retrieval: instead of loading all retrieved content at once, the system expands information in three stages gated by a token budget summaries first, then full text for high-confidence matches, then raw images and audio only when necessary. > The hybrid search discovery is the one that should make every RAG builder pay attention. Standard practice is to combine dense vector search and sparse keyword search by re-ranking their results together using a blended score. The pipeline tested this and found it degrades performance. The reason: score-based re-ranking disrupts the semantic ordering that dense retrieval already established. The fix the pipeline discovered autonomously is set-union merging dense results keep their original ranking, BM25-only results get appended at the end. No re-ranking. No blended scores. Just union. This simple change contributed +44% in a single iteration and was confirmed by ablation: removing BM25 hybrid search costs -14% F1, the second-largest component contribution after pyramid retrieval at -17%. > The capability threshold is what makes this alarming rather than just impressive. AutoML has existed for decades. It searches hyperparameters efficiently. It finds nothing here because the real gains require understanding why a system is failing reading stack traces, tracing data corruption through a pipeline, recognizing that a missing parameter is causing 9x verbosity, writing a repair script for corrupted timestamps. These are software engineering tasks that require comprehension, not optimization. The pipeline completed them without human input. The previous state of the art on both benchmarks was built by human researchers over months of manual iteration. The pipeline beat it in 72 hours. The AI researcher ran the experiment. The AI researcher fixed the bugs. The AI researcher beat the humans.

English

535

33.9K

Eoghan Flanagan retweetledi

Ajeya Cotra@ajeya_cotra·3 Nis

Basic and extremely important point: it would cost $1000s to pay a human to do a 10h task. But with the default way we measure AI performance, we spend like $10 running the AI agent. The AI gets much less time than the human. This can massively underestimate capabilities!

Joel Becker@joel_bkr

new blog post from me and nate rush: "Many Benchmarks Scores Would Appear Much Higher If You Let The AIs Use Adequate Labor" if humans given 10 minutes to do a 1-hour task with a hand tied behind their back scored 50%, you wouldn't say "humans score 50%." we argue that something analogous is happening for reported AI performance on many benchmarks. to better understand the limits of AI performance, we need benchmark authors/runners (who are trying their best, operating under tough constraints!) to get dramatically more tokens. post linked below.

English

176

14.8K

Eoghan Flanagan@KateandPie·3 Nis

@ajeya_cotra Understood. And I agree. I think the original post was capable of misinterpretation but that could just be me.

English

Eoghan Flanagan@KateandPie·3 Nis

@ajeya_cotra I was very unsure what you were saying here. It's difficult to parse. Your point is that humans and AI should be compared on cost per $ terms right?

English

Eoghan Flanagan retweetledi

Joseph Redmon@pjreddie·2 Nis

This war really makes a lot of sense when you rememeber it was planned by claude, way too confident, big initial changes, unexpected (but extremely predictable) problems arise, try to fix them, break other stuff, no backup plan, rewrite tests to get them to pass

English

840

39.9K

Eoghan Flanagan@KateandPie·1 Nis

@xeophon Everyone misses the whale

English

Florian Brand@xeophon·1 Nis

kinda funny that the go-to lab for ai ppl to use their april fools joke on is still the whale, despite them being rather quiet for a while

English

1.1K

Eoghan Flanagan@KateandPie·1 Nis

@MillionInt It's April Fools Day not Global Existentialist Terror Day Jerry

English

759

Jerry Tworek@MillionInt·1 Nis

Deep learning research is done and the best we can do until the end of history is to create billions of simulated environments with massive amounts of human labor

English

446

32.1K

Eoghan Flanagan@KateandPie·1 Nis

@housetrotter Obstetrician - Physician - Mortician

English

assistant inspector@housetrotter·1 Nis

the physician to mortician pipeline

English

1.3K

Eoghan Flanagan@KateandPie·1 Nis

@S_OhEigeartaigh Did you see Sam Altman just quit OpenAI to become a Benedictine monk?

English

Keşfet

@allTheYud @DarkDementus @Pamela_Vista @EMostaque @dioscuri @MillionInt @charles_irl @patrickc