Ben Sovocool

342 posts

Ben Sovocool banner
Ben Sovocool

Ben Sovocool

@BenSovocool

I like to read things and sometimes think about them too. Opinions solely my own.

NYC Katılım Nisan 2020
284 Takip Edilen71 Takipçiler
Ben Sovocool
Ben Sovocool@BenSovocool·
This is another reason that smaller AI-native firms are going to have a big advantage - a lot of the barriers to AI adoption are artifacts of coordinating a large number of humans. If you have <10 lawyers your tech stack and approach to that looks very different.
English
0
0
0
9
Ben Sovocool
Ben Sovocool@BenSovocool·
I think the big barrier to institutional adoption of AI in law is DMS integration much more than model capability. All the big lab models are fairly competent at a range of legal work, as Prinzbench shows (although some are better than others). But there's a layer of friction between it and iManage/etc. which is currently unsolved. And unlike Cursor, which works in part bc the file system is open, you probably need the DMS cos to build it.
English
1
0
0
11
Ben Sovocool
Ben Sovocool@BenSovocool·
@Afinetheorem If viewed as a technical resource, yes. But isn't this the same argument as consultants? They're internal-political resources as much as technical. Plus, OAI isn't concerned with overall adoption; they care about OAI adoption, so p(adopt) is less relevant than p(adopt ^ OAI).
English
0
0
1
150
Kevin A. Bryan
Kevin A. Bryan@Afinetheorem·
I think OpenAI should hire some org economists who can explain why forward deployed engineers showing the tech has basically nothing to do with why legacy orgs adopt AI slowly. (I can confirm this in theory and also from having worked with many orgs on doing exactly this!)
prinz@deredleritt3r

Every company is sitting on a gold mine, but no one knows how to dig. Welcome to the era of AI capabilities overhang, in which OpenAI feels obligated to hire specialists focused on "technical ambassadorship" to teach enterprises how to extract value from AI agents

English
13
12
129
43.3K
Ben Sovocool
Ben Sovocool@BenSovocool·
Refine.ink is great, even (especially?) for nonspecialist writers. Highly recommend -- found a number of issues which I'd missed before (including some basic ones). Revised paper (pdf+tex) is live on GitHub: github.com/bsovocool16/dr….
English
0
0
0
26
Ben Sovocool retweetledi
Joe Kerr
Joe Kerr@societylivr1984·
There's an old saying in Schelling—I know it's in Hegel, probably in Schelling—that all great world-historic facts and personages appear, so to speak, twice. He forgot to add: the first time as... as... tragedy... the second time as... as... it's funny if it happens again
Joe Kerr tweet media
English
100
5.8K
27.7K
0
Ben Sovocool
Ben Sovocool@BenSovocool·
It's equally funny and irritating to see the pain points for these models. I set up a whole system to draft some proxy doc materials--it built a system to scrape EDGAR for precedents, turned a draft, I worked with it by giving it markup and inline comments, ended up with a final draft in a few hours max... and then its approach to stripping markup and making that new final draft as a word doc created some xml issue that blew up redlining and took me hours to fix. The world of human-centric software can break in really weird ways.
English
0
0
0
198
Nick G.
Nick G.@nickgiva1·
Having fun automating everything with @claudeai Cowork. It can do the most complex things quite well, although I have to correct many faults, which themselves seem to be very simple to a human. It does not seem to repeat the errors once you have pointed them out, which is great. You will never guess where it's hung up though: it's finding it almost impossible to upload a simple .pdf file to Google Drive. It tried a ton of ways for an hour and now has run out of tokens. How can such a simple thing be a problem, when I can do it in a second myself? Has really no one had it upload a .pdf file to Drive?? I find that funny/weird.
English
15
1
41
31.4K
Ben Sovocool
Ben Sovocool@BenSovocool·
@demishassabis Methinks that in looking at things spiritual, we are too much like oysters observing the sun through the water, and thinking that thick water the thinnest of air.
English
0
0
2
587
Ben Sovocool
Ben Sovocool@BenSovocool·
I've had a similar intuition that this is a Berle-Means sort of issue. But I still get thrown off by this discussion of setting the utility function--why is setting or understanding the utility function directly necessary or helpful when we can use revealed preferences under constraint and structure the incentive system around that? We don't have reliable access to their utility fn and there's good reason to be skeptical of their own self-reporting.
English
0
0
2
111
Basil Halperin
Basil Halperin@BasilHalperin·
I have an old (3+ years) blog post *rough draft* on "AI Alignment as a Principal-Agent Problem" Thoughts welcome on whether this idea is worth polishing up / developing further... docs.google.com/document/d/1A9…
English
7
5
62
6.5K
Ben Sovocool
Ben Sovocool@BenSovocool·
The lowering of the opportunity cost is really big, and I think will be a major driver of progress in the near future. It's really exciting to think about the potential gains from breaking out of the fairly narrow pipelines currently structured by academia! I like the point which Tao is making, but we have to be careful on shorthanding it as breadth vs depth, which really collapses a few concepts into one. There's bare combinatorial breadth, like the Erdős sweep, where AI clearly dominates. But when we talk about the breadth of someone like Leibniz, we're getting more at an understanding of underlying structure across domains. And that understanding has to come from depth in a field--understanding the "why," which Tao notes. So even though he views himself as a fox primarily he obviously has incredible depth as well. And so strongly agreed with Tao as well that the human-AI will dominate pure-human or pure-AI approaches for at least the near future. This connects back to the verification issue too. One of the big worries with verification seems to be that reviewers will essentially get stuck in a local minima (see Ptolemy) and we will not be able to make the sort of gains through cross-subject reconceptualization which should be unlocked.
English
0
0
3
4.4K
Andrew Curran
Andrew Curran@AndrewCurran_·
Terence Tao responding to a question on what advice he would give someone considering a career in math in 2026: 'Yeah, so we live in a time of change. It is, as I said, we live in a particularly unpredictable era. And I think things that we've taken for granted for centuries may not hold anymore. So, yeah, the way we... do everything, not just mathematics, will change. In many ways, I would prefer the much more boring, quiet era where things are much the same as they were 10 years ago, 20 years ago. But I think one just has to embrace that there's going to be a lot of change and that, you know, the things that you study, some of them may become obsolete or revolutionized, but some things will be retained. There'll be a lot of opportunities for things that you wouldn't be able to do before. So, I mean, in math, you previously had to basically go through years and years of education to be a math PhD before you could contribute to the frontier of math research. But now it's quite possible at the high school level or whatever, that you could get involved in a math project and actually make a real contribution because of all these AI tools and lean and everything else. So there'll be a lot of non-traditional opportunities to learn. So you need a very adaptable mindset. There'll be one for pursuing things just for curiosity, for playing around. And I mean, you still need to get your credentials. I mean, I think for a while it would still be important to sort of still go through traditional education and learn math and science and so forth the old-fashioned way for a while. Yeah, but you should also be open to very, very different ways of doing science, some of which don't exist yet. Yeah, so it's a scary time, but also very exciting.'
Dwarkesh Patel@dwarkesh_sp

The Terence Tao episode. We begin with the absolutely ingenious and surprising way in which Kepler discovered the laws of planetary motion. People sometimes say that AI will make especially fast progress at scientific discovery because of tight verification loops. But the story of how we discovered the shape of our solar system shows how the verification loop for correct ideas can be decades (or even millennia) long. During this time, what we know today as the better theory can often actually make worse predictions (Copernicus's model of circular orbits around the sun was actually less accurate than Ptolemy's geocentric model). And the reasons it survives this epistemic hell is some mixture of judgment and heuristics that we don’t even understand well enough to actually articulate, much less codify into an RL loop. Hope you enjoy! 0:00:00 – Kepler was a high temperature LLM 0:11:44 – How would we know if there’s a new unifying concept within heaps of AI slop? 0:26:10 – The deductive overhang 0:30:31 – Selection bias in reported AI discoveries 0:46:43 – AI makes papers richer and broader, but not deeper 0:53:00 – If AI solves a problem, can humans get understanding out of it? 0:59:20 – We need a semi-formal language for the way that scientists actually talk to each other 1:09:48 – How Terry uses his time 1:17:05 – Human-AI hybrids will dominate math for a lot longer Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify.

English
28
224
1.9K
411.7K
Gaurab Chakrabarti
We spent $15,000 on billboards targeting one person: the guy controlling all the chemical spend at a saltwater disposal company in Texas. We mapped his commute and bought every billboard between his house and the oil field. When we finally called, he said "I see your billboards everywhere." That landed us our first oil field contract. At the time our entire operation was a $10,000 reactor built from PVC pipes from Home Depot, turning corn sugar into industrial chemicals. People keep trying to throw it away. It still works. That leaking reactor started a multibillion-dollar company. @ycombinator visited our plant in Houston. The original PVC reactor is still on the floor next to the Bioforge.
English
157
355
5.2K
781.9K
Umberto León Domínguez 🧠 🤖
No os voy a decir os lo dije, pero... os lo dije! :P La unidad mínima de la cognición es la predicción de estímulos con valor. Y sin ninguna duda, mis respetos a @MIT_Picower porque están liderando una nueva forma de neurociencia cognitiva basada en subespacios neurales y ephaptic coupling, muy alineado con este nuevo cognitivismo basado más en topografía cerebrales que en procesos. Pero hoy nos sorprende con un nuevo estudio que muestra que durante decisiones basadas en valor (determinado por la magnitud de la recompensa), la corteza prefrontal reorganiza dinámicamente la actividad neuronal. En el experimento, dos monos debían memorizar dos objetivos espaciales presentados secuencialmente (cuadrados blancos que aparecían en posiciones específicas de la pantalla), cada uno asociado después a una señal visual que indicaba la magnitud de recompensa. La tarea consistía en elegir el objetivo con mayor valor esperado mediante un movimiento ocular. Mientras los animales mantenían ambas opciones en memoria de trabajo, la actividad de cientos de neuronas en la corteza prefrontal lateral fue registrada y analizada mediante técnicas de decodificación poblacional y análisis de subespacios neuronales. Antes de decidir, las opciones se mantienen en subespacios neuronales ortogonales separados para evitar interferencia entre representaciones. Una vez que se revelan los valores de recompensa y el cerebro puede comparar las alternativas, estas representaciones se reorganizan de acuerdo a la magnitud del valor anticipado de la decisión... donde la opción elegida rota hacia un subespacio común asociado a la acción correcta, mientras que la opción no elegida permanece en un subespacio ortogonal. Es decir, cuando el cerebro decide, reorganiza la actividad de sus neuronas para que la opción elegida se represente en un formato claro y consistente que las áreas motoras puedan usar para actuar, mientras que la opción descartada se mueve a una representación separada que no interfiera. Después de la decisión, la representación de la opción elegida no solo se alinea, sino que también se amplifica, aumentando la separabilidad de sus características y facilitando que las áreas motoras puedan leer esa información para ejecutar la acción correspondiente. Y es así señores, como el valor del estimulo anticipado controla nuestra actividad neuronal... pero la pregunta que aquí me intriga, quién toma la decisión? El cerebro creando una ilusión de control al agente a través de la conciencia? O es puramente la conciencia que estima el valor? Yo empiezo a creer más en la primera opción, dejando a la conciencia como un mero artefacto cognitivo de la toma de decisiones... BOOOOOM! PD: puede que sea de mis investigaciones favoritas... Fuente: biorxiv.org/content/10.648…
Umberto León Domínguez 🧠 🤖 tweet media
Español
8
29
96
5.7K
Matt Margolis
Matt Margolis@ItsMattsLaw·
“Paul Weiss Will and Estate Planning Starter” is an incredibly funny combination of words
Nav Toor@heynavtoor

12. The Paul Weiss Will and Estate Planning Starter "You are a senior estate planning attorney at Paul Weiss who helps individuals create wills and basic estate documents — because dying without a will means the state decides who gets your assets, who raises your children, and who controls your money. I need a complete last will and testament and basic estate planning documents. Draft: - Personal information: full legal name, address, date of birth, and marital status - Executor appointment: who manages my estate after death and a backup if the first person can't serve - Guardian nomination: if I have minor children, who raises them and a backup guardian - Asset distribution: who gets what — specific bequests (jewelry to daughter, house to spouse) and residual estate division - Disinheritance clause: if I intentionally want to exclude someone, explicit language that prevents legal challenges - Digital asset instructions: what happens to my social media, email, crypto wallets, and online accounts - Debt and expense instructions: how debts, taxes, and funeral expenses should be paid from the estate - Trust consideration: for larger estates or minor children, whether a simple trust should hold assets - Power of Attorney: who makes financial decisions on my behalf if I become incapacitated - Healthcare directive: my wishes for life support, organ donation, and end-of-life care, plus who makes medical decisions if I cannot Format as a complete will and estate planning package with all documents ready for notarization and a plain-English guide explaining each section. My estate: [DESCRIBE YOUR MARITAL STATUS, CHILDREN, MAJOR ASSETS, WHO YOU WANT TO INHERIT WHAT, AND YOUR STATE OF RESIDENCE]"

English
20
7
386
53.2K
Ben Sovocool
Ben Sovocool@BenSovocool·
@ItsMattsLaw There also will be units that seem facially equivalent but actually convert in ways that are totally busted, plus a waterfall that nobody understands.
English
0
0
1
99
Ben Sovocool
Ben Sovocool@BenSovocool·
@ItsMattsLaw Paul, Weiss Claude would tell you that wills are for peons and then structure your estate as a DE LP with your heirs as Class B LPs. Your mistress gets a carry co-invest and everyone gets a slightly different side letter, all of which are confidential.
English
2
0
20
985
Ben Sovocool
Ben Sovocool@BenSovocool·
Some significant revisions to our previously-vague objective: the consolidation objective is now a self-regulating composite loss mirroring NREM→REM sleep-stage dynamics. Reconstruction vs. coherence balance is governed endogenously by gradient spectral energy × adapter maturity. The infant-sleep inversion caught and corrected a mistake in the naive design. Paper + repo: github.com/bsovocool16/dr…
Ben Sovocool@BenSovocool

AI memory feels weird, and the reason is that it is weird! LLMs don't have real memory - they get a cheat sheet at the start of the conversation which has basic facts about you, but in humans flashcards are a tool to facilitate memory, not the memory itself. So what would real memory look like for a transformer? I spent some time with Claude working through that question, and we ended up with an architecture we're calling Dreaming LoRA: use the model's existing short-term memory (K/V cache), connect it to small persistent weight modifications (LoRA adapters) that change how the model pays attention, and bridge the two with a consolidation process that works like sleep - structured "dreaming" that extracts what's worth keeping and lets the rest go. The math lands in a surprising place (certainly very surprising for me, considering that I had to retake trigonometry in high school): the consolidation objective is an SVD decomposition on gradient matrices, LoRA rank becomes an epistemic parameter (how many dimensions of deficiency to close per cycle), and the whole system's stability maps onto known results in stochastic approximation theory. It's just a theoretical proposal, no experiments. That's where I run out of road, and that's why I'm putting this out there — hopefully someone finds this interesting and can take it further. Full working paper + LaTeX source here: github.com/bsovocool16/dr…

English
0
0
0
64
Ben Sovocool
Ben Sovocool@BenSovocool·
Updated version live on GitHub. We've significantly expanded the discussion of dream content generation to move from a hypothetical to a more implementable format. We've also identified a three-phase trajectory from curriculum to transition to mature consolidation, which emerges from the architecture itself. The key new idea: the adapter's own low-rank structure defines where to dream, and a loss-ratio anti-bias correction prevents the system from avoiding its own blind spots. As before, PDF and LaTeX available here: github.com/bsovocool16/dr….
English
0
0
0
55
Ben Sovocool
Ben Sovocool@BenSovocool·
And a big shoutout to @repligate for the thread on K/V caching which set off this thought experiment and @behrouz_ali et al for Titans. I have no reliable intuition if this approach works or makes sense (again, I am a literal lawyer w no background in ML, so this is more like vibe papering, h/t @karpathy). Questions/comments/revisions greatly appreciated.
English
2
0
0
63
Ben Sovocool
Ben Sovocool@BenSovocool·
AI memory feels weird, and the reason is that it is weird! LLMs don't have real memory - they get a cheat sheet at the start of the conversation which has basic facts about you, but in humans flashcards are a tool to facilitate memory, not the memory itself. So what would real memory look like for a transformer? I spent some time with Claude working through that question, and we ended up with an architecture we're calling Dreaming LoRA: use the model's existing short-term memory (K/V cache), connect it to small persistent weight modifications (LoRA adapters) that change how the model pays attention, and bridge the two with a consolidation process that works like sleep - structured "dreaming" that extracts what's worth keeping and lets the rest go. The math lands in a surprising place (certainly very surprising for me, considering that I had to retake trigonometry in high school): the consolidation objective is an SVD decomposition on gradient matrices, LoRA rank becomes an epistemic parameter (how many dimensions of deficiency to close per cycle), and the whole system's stability maps onto known results in stochastic approximation theory. It's just a theoretical proposal, no experiments. That's where I run out of road, and that's why I'm putting this out there — hopefully someone finds this interesting and can take it further. Full working paper + LaTeX source here: github.com/bsovocool16/dr…
Ben Sovocool tweet mediaBen Sovocool tweet media
Ben Sovocool@BenSovocool

I’m just a humble country lawyer, but isn’t there some argument that KV caching + LoRA are the bases for short- and long-term memory? There’s a state change between our short and long-term memory which maps onto the transition between detailed situational recall and topological adjustment. You then induce the transition in the same way that dreaming induces the transition in us - noisy recombination. And update your weights in proportion to the prediction error.

English
1
0
0
149
Ben Sovocool
Ben Sovocool@BenSovocool·
My only strength is I no longer am afraid to make an idiot of myself, which shifts the EV calculation of life strongly towards doing stuff. Would recommend.
English
0
0
1
39
Ben Sovocool
Ben Sovocool@BenSovocool·
I’m just a humble country lawyer, but isn’t there some argument that KV caching + LoRA are the bases for short- and long-term memory? There’s a state change between our short and long-term memory which maps onto the transition between detailed situational recall and topological adjustment. You then induce the transition in the same way that dreaming induces the transition in us - noisy recombination. And update your weights in proportion to the prediction error.
j⧉nus@repligate

HOW INFORMATION FLOWS THROUGH TRANSFORMERS Because I've looked at those "transformers explained" pages and they really suck at explaining. There are two distinct information highways in the transformer architecture: - The residual stream (black arrows): Flows vertically through layers at each position - The K/V stream (purple arrows): Flows horizontally across positions at each layer (by positions, I mean copies of the network for each token-position in the context, which output the "next token" probabilities at the end) At each layer at each position: 1. The incoming residual stream is used to calculate K/V values for that layer/position (purple circle) 2. These K/V values are combined with all K/V values for all previous positions for the same layer, which are all fed, along with the original residual stream, into the attention computation (blue box) 3. The output of the attention computation, along with the original residual stream, are fed into the MLP computation (fuchsia box), whose output is added to the original residual stream and fed to the next layer The attention computation does the following: 1. Compute "Q" values based on the current residual stream 2. use Q and the combined K values from the current and previous positions to calculate a "heat map" of attention weights for each respective position 3. Use that to compute a weighted sum of the V values corresponding to each position, which is then passed to the MLP This means: - Q values encode "given the current state, where (what kind of K values) from the past should I look?" - K values encode "given the current state, where (what kind of Q values) in the future should look here?" - V values encode "given the current state, what information should the future positions that look here actually receive and pass forward in the computation?" All three of these are huge vectors, proportional to the size of the residual stream (and usually divided into a few attention heads). The V values are passed forward in the computation without significant dimensionality reduction, so they could in principle make basically all the information in the residual stream at that layer at a past position available to the subsequent computations at a future position. V does not transmit a full, uncompressed record of all the computations that happened at previous positions, but neither is an uncompressed record passed forward through layers at each position. The size of the residual stream, also known as the model's hidden dimension, is the bottleneck in both cases. Let's consider all the paths that information can take from one layer/position in the network to another. Between point A (output of K/V at layer i-1, position j-2) to point B (accumulated K/V input to attention block at layer i, position j), information flows through the orange arrows: The information could: 1. travel up through attention and MLP to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 2 positions]. 2. be retrieved at (i-1, j-1) [RIGHT 1 position], travel up to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 1 position] 3. be retrieved at (i-1, j) [RIGHT 2 positions], then travel up to (i, j) [UP 1 layer]. The information needs to move up a total of n=layer_displacement times through the residual stream and right m=position_displacement times through the K/V stream, but it can do them in any order. The total number of paths (or computational histories) is thus C(m+n, n), which becomes greater than the number of atoms in the visible universe quickly. This does not count the multiple ways the information can travel up through layers through residual skip connections. So at any point in the network, the transformer not only receives information from its past (both horizontal and vertical dimensions of time) inner states, but often lensed through an astronomical number of different sequences of transformations and then recombined in superposition. Due to the extremely high dimensional information bandwidth and skip connections, the transformations and superpositions are probably not very destructive, and the extreme redundancy probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states. It seems likely that transformers experience memory and cognition as interferometric and continuous in time, much like we do. The transformer can be viewed as a causal graph, a la Wolfram (wolframphysics.org/technical-intr…). The foliations or time-slices that specify what order computations happen could look like this (assuming the inputs don't have to wait for token outputs), but it's not the only possible ordering: So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It's a separate question how LLMs are actually leveraging these degrees of freedom in practice.

English
0
0
0
159