Matthias Georgi

637 posts

Matthias Georgi banner
Matthias Georgi

Matthias Georgi

@mgeorgi

Triplet Dad - Engineer at @Meta - Building https://t.co/BXFeVl8lgX

Katılım Eylül 2008
1.3K Takip Edilen331 Takipçiler
Sabitlenmiş Tweet
Matthias Georgi
Matthias Georgi@mgeorgi·
NodeTool lets you build, test, and deploy AI workflows on a living canvas. • 1,000+ Nodes (Agents, Image, Video, Audio) • Run local or cloud models The ultimate sandbox for composable AI. #opensource #LocalAI #workflows
English
1
0
2
118
Armin Ronacher ⇌
Armin Ronacher ⇌@mitsuhiko·
I think tmux is great software for an agent. But how people can actually work day to day in tmux is beyond me. It's such a horrible UX and hack.
English
177
16
589
116.2K
Alexandr Wang
Alexandr Wang@alexandr_wang·
i find muse spark is very good at data analysis—both finding relevant open-source data and analyzing it. for example, here's my results for analyzing global share of GDP over past century: meta.ai/share/cw54skLB…
Alexandr Wang tweet media
English
52
29
589
69.4K
Matthias Georgi
Matthias Georgi@mgeorgi·
@forgebitz this is not meant for enterprise. more like prototypes and apps for 1-10 people.
English
0
0
0
253
Klaas
Klaas@forgebitz·
can you imagine the uptime a lovable clone doesn't really give me agi vibes
Klaas tweet media
English
12
0
61
6.1K
@levelsio
@levelsio@levelsio·
Today I asked my Claude Code on pieter.com to write a message to my Claude Code on pieter.net, two different servers I do this a lot and then copy paste it to the other SSH window in Termius But then I wondered why can't Claude Code sessions in same account message each other and talk? That'd be super useful to me Many times I build a feature in one site and then want to send it to the other site to build the same!
@levelsio tweet media
English
42
9
455
103.8K
@levelsio
@levelsio@levelsio·
@timonlyup Yes same story pretty much, but I left NL
English
6
0
43
9.2K
Matthias Georgi retweetledi
Mario Zechner
Mario Zechner@badlogicgames·
today, a tall guy in a colorful sweater walked up to me. i was already at the end of my social energy reserves, having dozens of people walk up to me, never having 5 minutes to breath. the guy just wanted to say hi and thank me for pi. i thanked him for his kind words, like i did a lot of times today, hoping that i finally get my 5 minutes. until i looked at his name tag. it was @lucasmeijer one of the most instrumental people behind the Unity game engine. i was sort of star struck and we ended up speaking for 4 hours, chilling outside the venue, finding out that while we never meet IRL we had an immense amount of shared history. we reminisced about AOT compilers we worked on, shared ex-business partners , our shared emotional rollercoasters when we had to let go of the technical achievements that defined our lives, burning out, finding your identity again, all paired with an immense amount of laughter and the joy of having found a kindred spirit. we eventually went on a hilariously inefficient pub hunt, ending up at the weirdest fucking "upper class" establishment called "The OWO" where everyone was looking at us two dorks like we were aliens. They made us put stickers over our smartphone cameras and sold us lagers for £11 a pop. hilarious. i now remember why i loved speaking at conferences 15 years ago. i think i made a friend today. so, thanks @swyx and crew for putting together @aiDotEngineer and letting me clown around on stage. fantastic vibes, great people.
English
11
17
764
62.7K
Matthias Georgi retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
Alexandr Wang tweet media
English
720
1.2K
10.3K
4.4M
Armin Ronacher ⇌
Armin Ronacher ⇌@mitsuhiko·
Got my new macbook pro and I can report that all of my attempts of making local models work with pi so far have been disappointing. Not sure what else I should try :D
English
35
2
137
23.5K
Matthias Georgi
Matthias Georgi@mgeorgi·
@steipete could you make the agent system swappable? using claude agent sdk instead of pi for example?
English
0
0
0
172
Peter Steinberger 🦞
Peter Steinberger 🦞@steipete·
Thinking how we can evolve openclaw plugins to be more powerful while also making core leaner. Also wanna add support for claude code/codex plugin bundles. Good stuff coming soon!
English
233
84
2.3K
165.9K
Matthias Georgi
Matthias Georgi@mgeorgi·
@mitsuhiko I think this comes from post-training. They punish exceptions and catching is one way to avoid them :)
English
0
0
0
120
Armin Ronacher ⇌
Armin Ronacher ⇌@mitsuhiko·
No matter the prompting, shit like this stays around. I really think that clankers got trained on visual basic 6's "on error resume next" -.-
Armin Ronacher ⇌ tweet media
English
41
7
510
54.3K
@levelsio
@levelsio@levelsio·
Thank god MCP is dead Just as useless of an idea as LLMs.txt was It's all dumb abstractions that AI doesn't need because AI's are as smart as humans so they can just use what was already there which is APIs
Morgan@morganlinton

The cofounder and CTO of Perplexity, @denisyarats just said internally at Perplexity they’re moving away from MCPs and instead using APIs and CLIs 👀

English
699
342
6.2K
2.1M
Ethan He
Ethan He@EthanHe_42·
@karpathy Reminds me of AutoML and neural architecture search. But with intelligence this time.
English
6
0
227
41K
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
972
2.1K
19.5K
3.6M
Boris Cherny
Boris Cherny@bcherny·
New in Claude Code: Code Review. A team of agents runs a deep review on every PR. We built it for ourselves first. Code output per Anthropic engineer is up 200% this year and reviews were the bottleneck Personally, I’ve been using it for a few weeks and have found it catches many real bugs that I would not have noticed otherwise
Claude@claudeai

Introducing Code Review, a new feature for Claude Code. When a PR opens, Claude dispatches a team of agents to hunt for bugs.

English
462
498
7.4K
1.2M
Matthias Georgi
Matthias Georgi@mgeorgi·
@karpathy Let's say agents would start submitting PRs with alleged performance improvements. How would they prove the gains? Evaluating every submission is expensive.
English
0
0
0
68
Andrej Karpathy
Andrej Karpathy@karpathy·
The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later. I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run: github.com/karpathy/autor… Alternatively, a PR has the benefit of exact commits: github.com/karpathy/autor… but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back. I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.
English
528
712
7.6K
1.1M
Matthias Georgi
Matthias Georgi@mgeorgi·
LLMs don't write correct code. They write code that looks correct. After all the praise for agents, it's time for a reality check. @KatanaLarp took a deeper look at vibecoded software: a SQLite implementation in Rust. 576,000 lines of code. The vibecoded port was correct but also 20,171x slower. "An LLM prompted to "implement SQLite in Rust" will generate code that looks like an implementation of SQLite in Rust. It will have the right module structure and function names. But it can not magically generate the performance invariants that exist because someone profiled a real workload and found the bottleneck." Why is that? "This gap between intent and correctness has a name. AI alignment research calls it sycophancy, which describes the tendency of LLMs to produce outputs that match what the user wants to hear rather than what they need to hear." And it gets worse. "This also applies to LLM-generated evaluation. Ask the same LLM to review the code it generated and it will tell you the architecture is sound, the module boundaries clean and the error handling is thorough. It will sometimes even praise the test coverage." What Competent Looks Like SQLite's quality comes from 26 years of profiling-driven decisions, not code volume. At 156,000 lines of C, it meets testing standards rigorous enough for aviation software, with a test suite 590x larger than the library itself. Its speed rests on a handful of deliberate choices: zero-copy page access, prepared statement reuse, a single-integer schema check, fdatasync over fsync. The LLM reimplementation is 3.7x larger and misses all of them. Competence is knowing which line matters and why the "safe default" is sometimes the wrong one. x.com/KatanaLarp/sta…
English
0
0
1
53
Matthias Georgi
Matthias Georgi@mgeorgi·
@steipete couldn't you run a codex github action doing that on PR creation?
English
0
0
0
148