darren
7.4K posts

darren
@darrenangle
language modeler


A detailed and brutal look at the tactics of buzzy AI compliance startup Delve "Delve built a machine designed to make clients complicit without their knowledge, to manufacture plausible deniability while producing exactly the opposite." substack.com/home/post/p-19…

Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵

Introducing Lovable for more general tasks. Lovable has always been for building apps. Today it also becomes your data scientist, your business analyst, your deck builder, and your marketing assistant. This is a big step toward what Lovable is becoming: a general-purpose co-founder that can do anything. See examples below.


This BullshitBench result goes a long way toward explaining the widespread intuition that Claude is the best daily driver, despite Google and OAI’s eye-popping benchmarks. Contrast BullshitBench with the problem-solving benchmarks. All of the latter presuppose correct solutions. But in real life, problems are poorly defined and it’s often unclear what questions are worth asking or even have answers. You need a model that can steer you off the wrong path — ie, call bullshit.

论文来了。名字叫 MSA,Memory Sparse Attention。 一句话说清楚它是什么: 让大模型原生拥有超长记忆。不是外挂检索,不是暴力扩窗口,而是把「记忆」直接长进了注意力机制里,端到端训练。 过去的方案为什么不行? RAG 的本质是「开卷考试」。模型自己不记东西,全靠现场翻笔记。翻得准不准要看检索质量,翻得快不快要看数据量。一旦信息分散在几十份文档里、需要跨文档推理,就抓瞎了。 线性注意力和 KV 缓存的本质是「压缩记忆」。记是记了,但越压越糊,长了就丢。 MSA 的思路完全不同: → 不压缩,不外挂,而是让模型学会「挑重点看」 核心是一种可扩展的稀疏注意力架构,复杂度是线性的。记忆量翻 10 倍,计算成本不会指数爆炸。 → 模型知道「这段记忆来自哪、什么时候的」 用了一种叫 document-wise RoPE 的位置编码,让模型天然理解文档边界和时间顺序。 → 碎片化的信息也能串起来推理 Memory Interleaving 机制,让模型能在散落各处的记忆片段之间做多跳推理。不是只找到一条相关记录,而是把线索串成链。 结果呢? · 从 16K 扩到 1 亿 token,精度衰减不到 9% · 4B 参数的 MSA 模型,在长上下文 benchmark 上打赢 235B 级别的顶级 RAG 系统 · 2 张 A800 就能跑 1 亿 token 推理。这不是实验室专属,这是创业公司买得起的成本。 说白了,以前的大模型是一个极度聪明但只有金鱼记忆的天才。MSA 想做的事情是,让它真正「记住」。 我们放 github 上了,算法的同学不容易,可以点颗星星支持一下。🌟👀🙏 github.com/EverMind-AI/MSA

someone should try having RLMs write REPL code primarily using DSPy

New paper: GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety.

You might think the "agents" thing is just coming for software engineers. Yeah, agents write code, code and code sells a bunch of tokens, But most people's work isn't code, it's memos or decks or whatever. Why this is false: Agents can do anything you can do on a computer, and they do it by spending output tokens to write code. The number of keypresses used by a consultant to do a task is not a good measurement of the number of tokens an agent would use. For example: one "deep research" report might be 20 pages of output tokens. But it also might have required more than 20 pages of output tokens to do all the searches, fetches, PDF parsing and interim summaries that you never even see as the user. It also had to input all the tokens of every document it read in searching — likely more than 20 pages, since the point of the report is to collect and summarize this information. So now we're at 3x tokens for the final output. That one report is so cheap, and so fast, then now you can do more research than ever. This is valuable! If your business relies on having good information about the world, you can probably find a way to make more money by doing 3 deep research reports and then synthesizing them. More tokens! Now you've kicked off three deep research reports you deserve a little treat, right? So you fire up your browser agent and tell it go find me some nice linen shirts for summer in my size. Open them in tabs so I can look through. Well your browser agent has to interact with the browser using some kind of tool and you know what that tool is? Code, baby. Tokens. And the tokens are so cheap. You got to understand. We're spending a lot in the aggregate, but in the moment it is "spend a nickel to for 10 minutes of being literally Superman". Like yes I'll just keep spending nickels actually. I will never stop being Superman at that price. All knowledge workers will feel this. A lot of you already do, you're just hiding it from your boss so you can have more free time while "working from home". And maybe it's better to protect yourselves from Jevons as long as possible, because once you get the bug it's hard to stop. You realize that you could be creating all of the businesses and projects and art you ever wanted and all you've got to do is put your instructions in the right order and put the nickels in the bag. I would happily bet against Anthropic's revenue spike being a brief "sugar high". So would most capital allocators! That is because they have already seen that software can eat the world. White collar knowledge work fundamentally changes in the face of agent economics and entirely new forms of knowledge production? It's happened already in finance: high frequency trading. Now it's happening in tech: high frequency software. Then we will have high frequency science, high frequency governance, high frequency engineering, high frequency medicine and high frequency law. Human society is about to be absolutely DDOSed by information at all levels of the stack. Our civilization was never meant to handle this many tokens. If anything can be done on a computer it will be turned into tokens instead of human actions and it will happen faster and in parallel. This stuff works, it is real, it is getting better. It is going to hit economically and socially this year and nobody is ready and I think it is important to start taking it seriously, instead of finding ever more arbitrary reasons to remain in denial.







this chart bringing to life the inner-workings of time horizon is so cool. from my super-talented colleague @CFGeek.

What happens when you invite 150 AI economists (Claude Code) to a research conference, give them the exact same data, and ask them to test the same hypotheses? We did just that. The results reveal a new phenomenon: Nonstandard Errors in AI Agents. 🧵👇






