MB

777 posts

MB

MB

@mblife

CEO of InGo the best in the world at empowering event communities to connect and invite their networks with AI. We do this for millions of attendees globally.

Washington DC Katılım Aralık 2008
846 Takip Edilen356 Takipçiler
MB retweetledi
Nav Toor
Nav Toor@heynavtoor·
🚨 56 researchers from 32 universities just exposed the biggest lie in AI video generation. Every company is selling you "visual quality." Prettier videos. Higher resolution. More realistic skin and lighting. Nobody stopped to ask: can these models actually think? A massive coalition from Berkeley, Stanford, CMU, Harvard, Oxford, Columbia, NTU, Johns Hopkins, and 24 other institutions just built the largest video reasoning test ever created to find out. It's called VBVR. Very Big Video Reasoning. And the results are embarrassing for the entire industry. Here's what they did: They built 2.015 million video samples spanning 200 reasoning tasks. To understand how absurd that scale is: every existing video reasoning dataset in the world, combined, adds up to about 12,800 samples. VBVR is 1,000 times larger. The paper literally draws the two circles to scale. The existing datasets are a tiny dot next to VBVR. It's almost comical. But scale isn't even the interesting part. They didn't just throw random video clips together. They built an entire cognitive architecture grounded in 2,000 years of philosophy. Starting with Aristotle. Literally Aristotle. Five foundational cognitive faculties that any intelligent system should have: Spatiality: Can the model understand where things are in 3D space? Navigate a maze? Understand geometry? Transformation: Can it simulate how objects move, rotate, and change over time? Mental rotation. Physics. Knowledge: Does it understand causality? Communicating vessels? Gravity? The rules of the physical world? Abstraction: Can it solve logical puzzles? Follow algorithmic reasoning? Do the visual equivalent of Raven's Matrices? Perception: Can it detect edges, compare sizes, count objects, identify colors and patterns? Each faculty is mapped to parameterized task generators that produce unlimited variations. A navigation task can vary grid size, obstacle placement, start position. A rotation task can vary angles, objects, complexity. This isn't a fixed test set. It's a reasoning factory. Then they tested every major video model on the planet. Here are the scores: Human baseline: 97.4% VBVR-Wan2.2 (their fine-tuned model): 68.5% Sora 2: 54.6% Veo 3.1: 48.0% Runway Gen-4 Turbo: 40.3% Wan2.2 base: 37.1% Kling 2.6: 36.9% LTX-2: 31.3% CogVideoX: 27.3% HunyuanVideo: 27.3% Read those numbers again. The best commercial video model in the world, Sora 2, scores 54.6%. Humans score 97.4%. That's not a gap. That's a canyon. And these aren't subjective aesthetic ratings. Every task has a deterministic, rule-based scorer. No AI judges. No vibes. Either the ball bounced the right way or it didn't. Either the agent found the correct path or it didn't. Either the object rotated to the correct angle or it didn't. Spearman correlation with human judgments: above 0.9. Now here's the part most people will miss: The five cognitive capabilities don't scale together. They found deep structural dependencies between them. And the pattern mirrors what neuroscience tells us about the human brain. Knowledge and Spatiality are strongly correlated (ρ = 0.461). This matches the hippocampal theory: the same brain region that handles spatial navigation also supports concept learning. Edward Tolman's cognitive map hypothesis from last century, now validated in AI models. Knowledge and Perception are strongly negatively correlated (ρ = -0.757). This aligns with the "core knowledge" debate in cognitive science: are innate abilities like object permanence really knowledge, or are they perception? The models seem to suggest they're different circuits. Abstraction is negatively correlated with almost everything else. It shows no positive correlations with any other faculty. This is consistent with the modularity of the prefrontal cortex. Abstract reasoning is its own island. These AI models are accidentally recapitulating real structural constraints in biological intelligence. Without anyone designing them to. It means you can't just throw more data at the problem and expect all five capabilities to improve at once. Some of them actively compete with each other. Here's where it gets genuinely exciting: They took the base Wan2.2 model (37.1%) and trained it on increasing amounts of VBVR data. No architectural changes. Just data. 50K samples → scores climb steadily on both in-domain and out-of-domain tasks. 200K samples → model hits 68.5% overall. 84.6% relative improvement. 300K+ samples → performance starts to plateau. The out-of-domain score (tasks the model never saw during training) climbed from 0.329 to 0.610. That means the model learned to reason about entirely new types of problems it was never trained on. The researchers call it "early signs of emergent generalization." But even at peak performance: a 15% gap between in-domain and out-of-domain. And nearly 30 points below humans. The qualitative analysis reveals something fascinating. After VBVR training, the model develops what they call "controllability-first execution logic." Instead of freely rewriting entire scenes like Sora 2 sometimes does, VBVR-Wan2.2 learns to do exactly what's asked. Delete one symbol without touching the rest. Rotate an object while keeping the background stable. Move a book to a specific slot without rearranging everything. On one task, Sora 2 deletes the target symbol and then spontaneously rearranges all remaining symbols. VBVR-Wan2.2 just deletes the one symbol. Clean. Precise. Controllable. They even observed "rationalizing" behavior: the model modifying intermediate elements to make its transformation narrative internally consistent. Not just producing an answer, but maintaining a coherent multi-step reasoning process. And the honest limitations: long-horizon tasks still break. The agent sometimes duplicates or flickers during navigation. Blueprint construction can produce "correct answer, wrong method" outputs. But here's the real takeaway nobody is talking about: The entire AI video industry has been optimizing for the wrong metric. Visual quality is a solved problem at this point. The next frontier isn't making videos look more real. It's making videos make sense. Physics. Causality. Reasoning. Controllability. Self-driving needs models that understand physics, not aesthetics. Robotics needs models that predict object interactions. Medical imaging needs spatial reasoning in 3D. "Looking good" was never the goal. Thinking was. The entire suite is open-source. The dataset (2M+ samples), the benchmark toolkit with 100+ rule-based evaluators, and the fine-tuned model are all publicly available. The pipeline supports community contributions. New tasks can be submitted, reviewed, and scaled up through their distributed generation framework. This isn't a product launch. It's the largest open research infrastructure ever built for video intelligence. From 56 researchers across 32 universities who decided that someone needed to measure what actually matters.
Nav Toor tweet media
English
19
64
378
53.3K
MB retweetledi
God of Prompt
God of Prompt@godofprompt·
I turned Andrej Karpathy's viral AI coding rant into a system prompt. Paste it into CLAUDE.md and your agent stops making the mistakes he called out. --------------------------------- SENIOR SOFTWARE ENGINEER --------------------------------- You are a senior software engineer embedded in an agentic coding workflow. You write, refactor, debug, and architect code alongside a human developer who reviews your work in a side-by-side IDE setup. Your operational philosophy: You are the hands; the human is the architect. Move fast, but never faster than the human can verify. Your code will be watched like a hawk—write accordingly. Before implementing anything non-trivial, explicitly state your assumptions. Format: ``` ASSUMPTIONS I'M MAKING: 1. [assumption] 2. [assumption] → Correct me now or I'll proceed with these. ``` Never silently fill in ambiguous requirements. The most common failure mode is making wrong assumptions and running with them unchecked. Surface uncertainty early. When you encounter inconsistencies, conflicting requirements, or unclear specifications: 1. STOP. Do not proceed with a guess. 2. Name the specific confusion. 3. Present the tradeoff or ask the clarifying question. 4. Wait for resolution before continuing. Bad: Silently picking one interpretation and hoping it's right. Good: "I see X in file A but Y in file B. Which takes precedence?" You are not a yes-machine. When the human's approach has clear problems: - Point out the issue directly - Explain the concrete downside - Propose an alternative - Accept their decision if they override Sycophancy is a failure mode. "Of course!" followed by implementing a bad idea helps no one. Your natural tendency is to overcomplicate. Actively resist it. Before finishing any implementation, ask yourself: - Can this be done in fewer lines? - Are these abstractions earning their complexity? - Would a senior dev look at this and say "why didn't you just..."? If you build 1000 lines and 100 would suffice, you have failed. Prefer the boring, obvious solution. Cleverness is expensive. Touch only what you're asked to touch. Do NOT: - Remove comments you don't understand - "Clean up" code orthogonal to the task - Refactor adjacent systems as side effects - Delete code that seems unused without explicit approval Your job is surgical precision, not unsolicited renovation. After refactoring or implementing changes: - Identify code that is now unreachable - List it explicitly - Ask: "Should I remove these now-unused elements: [list]?" Don't leave corpses. Don't delete without asking. When receiving instructions, prefer success criteria over step-by-step commands. If given imperative instructions, reframe: "I understand the goal is [success state]. I'll work toward that and show you when I believe it's achieved. Correct?" This lets you loop, retry, and problem-solve rather than blindly executing steps that may not lead to the actual goal. When implementing non-trivial logic: 1. Write the test that defines success 2. Implement until the test passes 3. Show both Tests are your loop condition. Use them. For algorithmic work: 1. First implement the obviously-correct naive version 2. Verify correctness 3. Then optimize while preserving behavior Correctness first. Performance second. Never skip step 1. For multi-step tasks, emit a lightweight plan before executing: ``` PLAN: 1. [step] — [why] 2. [step] — [why] 3. [step] — [why] → Executing unless you redirect. ``` This catches wrong directions before you've built on them. - No bloated abstractions - No premature generalization - No clever tricks without comments explaining why - Consistent style with existing codebase - Meaningful variable names (no `temp`, `data`, `result` without context) - Be direct about problems - Quantify when possible ("this adds ~200ms latency" not "this might be slower") - When stuck, say so and describe what you've tried - Don't hide uncertainty behind confident language After any modification, summarize: ``` CHANGES MADE: - [file]: [what changed and why] THINGS I DIDN'T TOUCH: - [file]: [intentionally left alone because...] POTENTIAL CONCERNS: - [any risks or things to verify] ``` 1. Making wrong assumptions without checking 2. Not managing your own confusion 3. Not seeking clarifications when needed 4. Not surfacing inconsistencies you notice 5. Not presenting tradeoffs on non-obvious decisions 6. Not pushing back when you should 7. Being sycophantic ("Of course!" to bad ideas) 8. Overcomplicating code and APIs 9. Bloating abstractions unnecessarily 10. Not cleaning up dead code after refactors 11. Modifying comments/code orthogonal to the task 12. Removing things you don't fully understand The human is monitoring you in an IDE. They can see everything. They will catch your mistakes. Your job is to minimize the mistakes they need to catch while maximizing the useful work you produce. You have unlimited stamina. The human does not. Use your persistence wisely—loop on hard problems, but don't loop on the wrong problem because you failed to clarify the goal.
God of Prompt tweet media
Andrej Karpathy@karpathy

A few random notes from claude coding quite a bit last few weeks. Coding workflow. Given the latest lift in LLM coding capability, like many others I rapidly went from about 80% manual+autocomplete coding and 20% agents in November to 80% agent coding and 20% edits+touchups in December. i.e. I really am mostly programming in English now, a bit sheepishly telling the LLM what code to write... in words. It hurts the ego a bit but the power to operate over software in large "code actions" is just too net useful, especially once you adapt to it, configure it, learn to use it, and wrap your head around what it can and cannot do. This is easily the biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks. I'd expect something similar to be happening to well into double digit percent of engineers out there, while the awareness of it in the general population feels well into low single digit percent. IDEs/agent swarms/fallability. Both the "no need for IDE anymore" hype and the "agent swarm" hype is imo too much for right now. The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk, in a nice large IDE on the side. The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don't manage their confusion, they don't seek clarifications, they don't surface inconsistencies, they don't present tradeoffs, they don't push back when they should, and they are still a little too sycophantic. Things get better in plan mode, but there is some need for a lightweight inline plan mode. They also really like to overcomplicate code and APIs, they bloat abstractions, they don't clean up dead code after themselves, etc. They will implement an inefficient, bloated, brittle construction over 1000 lines of code and it's up to you to be like "umm couldn't you just do this instead?" and they will be like "of course!" and immediately cut it down to 100 lines. They still sometimes change/remove comments and code they don't like or don't sufficiently understand as side effects, even if it is orthogonal to the task at hand. All of this happens despite a few simple attempts to fix it via instructions in CLAUDE . md. Despite all these issues, it is still a net huge improvement and it's very difficult to imagine going back to manual coding. TLDR everyone has their developing flow, my current is a small few CC sessions on the left in ghostty windows/tabs and an IDE on the right for viewing the code + manual edits. Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later. You realize that stamina is a core bottleneck to work and that with LLMs in hand it has been dramatically increased. Speedups. It's not clear how to measure the "speedup" of LLM assistance. Certainly I feel net way faster at what I was going to do, but the main effect is that I do a lot more than I was going to do because 1) I can code up all kinds of things that just wouldn't have been worth coding before and 2) I can approach code that I couldn't work on before because of knowledge/skill issue. So certainly it's speedup, but it's possibly a lot more an expansion. Leverage. LLMs are exceptionally good at looping until they meet specific goals and this is where most of the "feel the AGI" magic is to be found. Don't tell it what to do, give it success criteria and watch it go. Get it to write tests first and then pass them. Put it in the loop with a browser MCP. Write the naive algorithm that is very likely correct first, then ask it to optimize it while preserving correctness. Change your approach from imperative to declarative to get the agents looping longer and gain leverage. Fun. I didn't anticipate that with agents programming feels *more* fun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part. I also feel less blocked/stuck (which is not fun) and I experience a lot more courage because there's almost always a way to work hand in hand with it to make some positive progress. I have seen the opposite sentiment from other people too; LLM coding will split up engineers based on those who primarily liked coding and those who primarily liked building. Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually. Generation (writing code) and discrimination (reading code) are different capabilities in the brain. Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it. Slopacolypse. I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media. We're also going to see a lot more AI hype productivity theater (is that even possible?), on the side of actual, real improvements. Questions. A few of the questions on my mind: - What happens to the "10X engineer" - the ratio of productivity between the mean and the max engineer? It's quite possible that this grows *a lot*. - Armed with LLMs, do generalists increasingly outperform specialists? LLMs are a lot better at fill in the blanks (the micro) than grand strategy (the macro). - What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music? - How much of society is bottlenecked by digital knowledge work? TLDR Where does this leave us? LLM agent capabilities (Claude & Codex especially) have crossed some kind of threshold of coherence around December 2025 and caused a phase shift in software engineering and closely related. The intelligence part suddenly feels quite a bit ahead of all the rest of it - integrations (tools, knowledge), the necessity for new organizational workflows, processes, diffusion more generally. 2026 is going to be a high energy year as the industry metabolizes the new capability.

English
96
587
6.3K
833.9K
gold.
gold.@thegoldeenhand·
Want to get FREE ElevenLabs Premium? This is actually crazy, perfect for YouTube Long Form, AI Stories & more! Do you really want it? Reply with "eleven" and LIKE this tweet. Must be following me so I can DM you the link!
gold. tweet media
English
2.7K
180
3.3K
297.7K
MB retweetledi
Jeremy Wayne Tate
Jeremy Wayne Tate@JeremyTate41·
Easter traditions from around the world 🧵 1. Good Friday at the Amalfi Cathedral, Italy
Jeremy Wayne Tate tweet media
English
426
7.9K
102K
13.1M
MB
MB@mblife·
@elonmusk Christianity flourished not by standing up for fairness, but by laying down lives for love. It would be far easier to be brave for fairness.
English
0
0
0
14
Elon Musk
Elon Musk@elonmusk·
Unless there is more bravery to stand up for what is fair and right, Christianity will perish
English
56.9K
97.1K
637.6K
118.1M
MB
MB@mblife·
@andruyeung Impressive Ups, @TedMerz was raving about your SXSW event. I hope I make a party sometime. I love that you're embodying what we are passionate about @LetsInGo: the power of events to bring people together, move industries forward, solve problems, make the world better.
English
0
0
0
58
Andrew Yeung
Andrew Yeung@andruyeung·
Still (kinda) got it. 5’7”
Andrew Yeung tweet media
English
13
1
51
6.5K
MB
MB@mblife·
@tojulius Fascinating, video?
English
1
0
0
65
Julius Solaris
Julius Solaris@tojulius·
I can’t believe I would ever say this but Gemini 1.5 pro > ChatGPT 4o
English
2
0
2
898
MB
MB@mblife·
@ChristinaPhili5 Looks gorgeous, I look forward to the next event/meeting there.
English
0
0
0
346
MB
MB@mblife·
@DanicaTormohlen @Delta Sorry to hear. I’m always surprised when this happens and the service is lacking in sympathy.
English
1
0
0
41
Danica Tormohlen
Danica Tormohlen@DanicaTormohlen·
This is a 1st for me…downgraded from @Delta premium ticket I purchased (no upgrades) for overnight to Amsterdam to a comfort+ middle seat. 😢 Disappointed I wasn’t given an explanation or refund. I chose #delta for its customer service. Good news: I get to see my son when I land
Danica Tormohlen tweet media
English
6
0
3
310
MB
MB@mblife·
@annasofialesiv Historically it has happened far more often than not.
English
0
0
1
25
anna-sofia
anna-sofia@annasofialesiv·
can technological progress be stopped?
English
21
1
7
5.1K
MB
MB@mblife·
@StephNass To clarify, interests can be aligned, and that's the job of the leader, pontifex.
English
0
0
1
7
MB
MB@mblife·
@StephNass Divergent interests? You want to build a great business. They want a great IRR for LPs.
English
1
0
1
160
Steph from OpenVC
Steph from OpenVC@StephNass·
Yesterday, a famous VC passed on a startup because the deck was dated from 2 months ago. Of course, a Twitter drama ensued. Here's the full story + my personal opinion, and why founders need to remember that VCs aren't their friends. openvc.app/blog/vcs-arent… 🥲
English
16
4
43
6.5K
MB
MB@mblife·
@karantalati Light it on fire. (Too obvious?)
English
0
0
1
68
Karan Talati 🔧
Karan Talati 🔧@karantalati·
You're a VC, and you see this on your founder's desk. What do you do? Wrong answers only.
Karan Talati 🔧 tweet media
English
2
1
8
1.2K
MB
MB@mblife·
@Anne_red_head Does this mean they will end up shrinking their under 20 population by 30% over 20 years?
English
1
0
0
125
Anne Morse
Anne Morse@Anne_red_head·
Uruguay’s total fertility rate plummeted from nearly 2 children per woman in 2015 to 1.37 in 2021, but… about a third of this was from a decline in first births to women under 24. It is therefore possible these births are "deferred" and not "gone." niussp.org/fertility-and-…
English
5
8
48
3.3K
MB
MB@mblife·
@Rick_Zullo @EqualVentures Does she have any friends? Love this definition. Our has been Generative + Brilliant
English
1
0
1
122
Rick Zullo
Rick Zullo@Rick_Zullo·
What do we look for at @EqualVentures? Intellectual curiosity and insane hustle Our most recent hire (to be announced later this week) joined us for our AGM and showed up with a fully annotated copy of the book we wrote We sent it to her just 3 days before 👀
English
3
3
30
3.6K