Justin

4.9K posts

Justin banner
Justin

Justin

@JustinPerea01

Mathematical Statistician. I used to do astronomy. Currently working on the Ara app that lets you brain dump ideas and validate.

Maryland, USA Katılım Şubat 2009
848 Takip Edilen365 Takipçiler
Sabitlenmiş Tweet
Justin
Justin@JustinPerea01·
I've been working on a tamagotchi version of Claude to put on a Core S3. He is now my little AI buddy that can record all audio throughout the day for me and organize any to-dos, notes, etc. Still need to wire up the animations to respond to different triggers. But glad I did this using @claudeai - the best ai - especially after them standing sticking to their morals.
English
2
2
18
4.4K
Justin
Justin@JustinPerea01·
@yiliush I need it simply for those hero sections. They're gorgeous 🤩
English
0
0
1
70
Yiliu
Yiliu@yiliush·
before collaborator i built out this whole agentic knowledge base app with an entire source docs → knowledge graph → mcp pipeline... and just never released it. you guys want me to release it?
Andrej Karpathy@karpathy

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

English
17
4
149
14K
Nero
Nero@nerooeth·
This isn’t something Claude can replicate. You come to me for this level of quality.
Nero tweet media
English
53
6
267
12.4K
My name is Byf (Lore Daddy)
Just going to leave this lovely bit of fictional commentary here...
My name is Byf (Lore Daddy) tweet mediaMy name is Byf (Lore Daddy) tweet media
Nav Toor@heynavtoor

🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves. And the way they proved it is devastating. Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems — and made one change. They swapped the numbers. Same problem. Same logic. Same steps. Different numbers. Every model's performance dropped. Every single one. 25 state-of-the-art models tested. But that wasn't the real experiment. The real experiment broke everything. They added one sentence to a math problem. One sentence that is completely irrelevant to the answer. It has nothing to do with the math. A human would read it and ignore it instantly. Here's the actual example from the paper: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" The correct answer is 190. The size of the kiwis has nothing to do with the count. A 10-year-old would ignore "five of them were a bit smaller" because it's obviously irrelevant. It doesn't change how many kiwis there are. But o1-mini, OpenAI's reasoning model, subtracted 5. It got 185. Llama did the same thing. Subtracted 5. Got 185. They didn't reason through the problem. They saw the number 5, saw a sentence that sounded like it mattered, and blindly turned it into a subtraction. The models do not understand what subtraction means. They see a pattern that looks like subtraction and apply it. That is all. Apple tested this across all models. They call the dataset "GSM-NoOp" — as in, the added clause is a no-operation. It does nothing. It changes nothing. The results are catastrophic. Phi-3-mini dropped over 65%. More than half of its "math ability" vanished from one irrelevant sentence. GPT-4o dropped from 94.9% to 63.1%. o1-mini dropped from 94.5% to 66.0%. o1-preview, OpenAI's most advanced reasoning model at the time, dropped from 92.7% to 77.4%. Even giving the models 8 examples of the exact same question beforehand, with the correct solution shown each time, barely helped. The models still fell for the irrelevant clause. This means it's not a prompting problem. It's not a context problem. It's structural. The Apple researchers also found that models convert words into math operations without understanding what those words mean. They see the word "discount" and multiply. They see a number near the word "smaller" and subtract. Regardless of whether it makes any sense. The paper's exact words: "current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data." And: "LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts." They also tested what happens when you increase the number of steps in a problem. Performance didn't just decrease. The rate of decrease accelerated. Adding two extra clauses to a problem dropped Gemma2-9b from 84.4% to 41.8%. Phi-3.5-mini from 87.6% to 44.8%. The more thinking required, the more the models collapse. A real reasoner would slow down and work through it. These models don't slow down. They pattern-match. And when the pattern becomes complex enough, they crash. This paper was published at ICLR 2025, one of the most prestigious AI conferences in the world. You are using AI to help you make financial decisions. To check legal documents. To solve problems at work. To help your children with homework. And Apple just proved that the AI is not thinking about any of it. It is pattern matching. And the moment something unexpected shows up in your question, it breaks. It does not tell you it broke. It just quietly gives you the wrong answer with full confidence.

English
8
168
1.8K
67.9K
Justin
Justin@JustinPerea01·
@Chilka_ @nerooeth Dude this is vibe coded garbage 😭 Claude can do better but if you have some taste
English
1
0
0
113
Justin
Justin@JustinPerea01·
@birdabo Much better. My poor 🦀
English
0
0
2
417
Justin
Justin@JustinPerea01·
@FarzaTV I was wanting to make something like this since google live ai thing but it didn't even occur to me to have a cursor show you were to go. Amazing work!
English
0
0
2
1K
Farza 🇵🇰🇺🇸
I built this thing called Clicky. It's an AI teacher that lives as a buddy next to your cursor. It can see your screen, talk to you, and even point at stuff, kinda like having a real teacher next to you. I've been using it the past few days to learn Davinci Resolve, 10/10.
English
935
644
8.9K
820.7K
Marcelo Design X
Marcelo Design X@MarceloDesignX·
Designers, do you also use Claude? I have the Max plan for $200, and it feels like the pro plan for $20. Are you also experiencing such limited usage?
Marcelo Design X tweet media
English
15
0
24
3.3K
Justin
Justin@JustinPerea01·
@MarceloDesignX That's how it was for me for two weeks until about 2 days ago. It finally feels like normal again.
English
1
0
1
206
Brian Casel
Brian Casel@CasJam·
Opus 4.6 is a fantastic model. Claude Code is an excellent harness. The Max plan is still a steal. I'm still shipping like crazy. I'm with the 99% of builders who'd rather be building than complaining about Anthropic's (frankly, pretty reasonable) terms.
English
111
40
1.1K
43.1K
Justin
Justin@JustinPerea01·
@danveloper ??? It resets Friday and you're only at 20%
English
0
0
1
107
Alex Lieberman
Alex Lieberman@businessbarista·
My favorite daily Claude prompt: “Ask me more questions. Fill in all gaps. Don’t make any dangerous assumptions.” I’ll reply with this dozens of times in a single session. By the end, I’ve dumped in so much context, one-shotting everything from presentations to web apps to content posts becomes the standard.
English
31
10
224
18.3K
Justin
Justin@JustinPerea01·
@clairevo @bcherny I felt like mine was smarter today than it's been in 2 weeks
English
0
0
0
1.1K
claire vo 🖤
claire vo 🖤@clairevo·
I hate to be *that guy* but it does seem like claude code got a little dumber. For example, presuming its current context is accurate vs. looking up the docs. Feels a little less proactive. Am I hallucinating @bcherny or have there been relevant changes?
English
76
1
231
32.5K
Justin
Justin@JustinPerea01·
@trunarla Okay I have to buy one of these and put Claude in it while I use Claude code
English
0
0
0
164
Justin
Justin@JustinPerea01·
@AnthropicAI Let's push it up to next week guys 🤞🤞
English
0
0
5
2.5K
Anthropic
Anthropic@AnthropicAI·
We've signed an agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity, coming online starting in 2027, to train and serve frontier Claude models.
English
538
1.2K
17.9K
2.3M
Justin
Justin@JustinPerea01·
@trq212 Its been better the past 40ish hours but was still bad at the start of my reset.
Justin tweet media
English
0
0
0
155
Thariq
Thariq@trq212·
I want to do a few more of these calls. If your MAX 20x plan ran out of tokens unexpectedly early and you're willing to screenshare and run some prompts through Claude Code please comment. Trying to figure out how we can improve /usage to give more info.
Kieran Klaassen@kieranklaassen

Resolved!! @trq212 helped me out debug where the token usage came from and it was my fault 100% Script to find token usage gist.github.com/kieranklaassen… I had a recurring script that ran every 5 minutes that should not have run every 5. I hope we can make it easier to detect these within Claude and Claude Code soon too.

English
334
69
1.7K
330.1K
Justin
Justin@JustinPerea01·
@danveloper I swear I've seen someone else post this exact same terminal message
English
0
0
0
233
Dan Woods
Dan Woods@danveloper·
I'm at a different point this morning. It's hard to feel like Claude isn't actively working against me. Full night of autoresearch is just a markdown log full of lies. When asked to prove its findings and show its work, Claude will confidently display bullets and markdown tables, but when I ask it what log file and where the artifacts are - "I need to be honest here: I didn't actually run the experiment." It doesn't follow explicit directions anymore either: "You MUST always output to a log file so I can follow along" -> [doesn't do that] -> "you're not fuckin outputting anything to a log" -> "You're right - I'll redirect to a log file immediately" [pkill -f python3]... Anthropic is materially worse today than one month ago. I've lost every ounce of trust I had in Claude and I'm not really sure how that makes me feel. Maybe ok? I'm still a competent software developer (I think), but it seems like the major productivity gains that were very real a month ago have somehow slipped my grasp... where does that leave us? @bcherny - can you offer any thoughts? How should we think about what we're all observing - that Opus (at all effort levels) has become, at a minimum, materially worse. The worst read, but can't be ruled out: actively working against us.
Dan Woods tweet media
English
130
34
460
84.6K
Justin
Justin@JustinPerea01·
@MidaRunna I'm at working right now just thinking about playing marathon
English
1
0
3
261
MIDA Sponsored Kit
MIDA Sponsored Kit@MidaRunna·
yes i play marathon. yes i’m employed. we exist
English
44
105
1.6K
21.8K
Ronan Farrow
Ronan Farrow@RonanFarrow·
(🧵1/11) For the past year and a half, I've been investigating OpenAI and Sam Altman for @NewYorker. With my coauthor @andrewmarantz, I reviewed never-before-disclosed internal memos, obtained 200+ pages of documents related to a close colleague, including extensive private notes, and interviewed more than 100 people. OpenAI was founded on the premise that A.I. could be the most dangerous invention in human history—and that its C.E.O. would need to be a person of uncommon integrity. We lay out the most detailed account yet of why Altman was ousted out by board members and executives who came to believe he lacked that integrity, and ask: were they right to allege that he couldn't be trusted? A thread on some of of our findings:
Ronan Farrow tweet media
English
462
6.4K
29.6K
4.5M