/sesh/null

587 posts

/sesh/null banner
/sesh/null

/sesh/null

@nerdsane

VP, Observability-Data@datadoghq | peripatetic | minimalist { engineer | athlete | artist } | I have opinions-of-my-own

New York, NY Katılım Aralık 2017
681 Takip Edilen235 Takipçiler
/sesh/null retweetledi
Dylan Garcia
Dylan Garcia@_dylanga·
The first thing I did at @tryramp was set up distributed tracing, structured logging, and metrics for Inspect, our background coding agent. We now have full visibility in to everything the system is doing: the browser, CF workers/DOs, @modal sandboxes, database calls, etc. Most importantly, Inspect now has visibility in to itself. It can self-triage runtime errors it encounters and create PRs to fix them. Every morning, it reviews the past 24 hours of its own @datadoghq dashboard, identifies systemic issues, new errors, and long tail latencies, and has a summary + PR waiting for me at 9am.
Dylan Garcia tweet media
English
22
20
427
44.3K
/sesh/null
/sesh/null@nerdsane·
Our self-improving system leverages Shinka Evolve from @SakanaAILabs underneath. “On a second workload—100 different group-by tag combinations on the same metric—the improvement reached 541% over the baseline. The pattern mirrors ShinkaEvolve’s dynamics: Most generations explore incrementally, but occasional mutations discover qualitatively different algorithms” x.com/nerdsane/statu…
Sakana AI@SakanaAILabs

“When AI Discovers the Next Transformer” Robert Lange (Sakana AI) joins Tim Scarfe (@MLStreetTalk) to discuss Shinka Evolve, a framework that combines LLMs with evolutionary algorithms to do open-ended program search. Full Video: youtu.be/EInEmGaMRLc

English
0
0
0
74
/sesh/null retweetledi
Debasish (দেবাশিস্) Ghosh 🇮🇳
Datadog has been working on the verifiability of agentic coding loop with observability driven harnesses. They are working on BitsEvolve, inspired from Google Deepmind's AlphaEvolve, an evolutionary algorithm that mutates the code, evaluate each variant against a benchmark and iterate. Here are some interesting blog posts that describe their approach and progress .. 1. Closing the verification loop: Observability-driven harnesses for building with agents (datadoghq.com/blog/ai/harnes…) 2. From hand-tuned Go to self-optimizing code: Building BitsEvolve (datadoghq.com/blog/engineeri…) 3. Closing the verification loop, Part 2: Fully autonomous optimization (datadoghq.com/blog/ai/fully-…)
English
0
6
57
3.5K
/sesh/null
/sesh/null@nerdsane·
We had AI agents build Functional equivalents of Redis and Kafka from scratch in a few days with formal verification, deterministic simulation testing, and TLA+ specs. Then we let them optimize a live production service. Autonomously! What we're seeing looks like the early industrialization of software engineering. And at its core: observability is becoming the control layer for agent-produced software. In late 2025, while working on BitsEvolve—our LLM-backed evolutionary optimizer— we (like many around reported) suddenly started to notice a step-function improvement in model capabilities and saw an opportunity to raise our ambition levels. We wanted to see exactly how far we could push agent-driven systems engineering to whole distributed systems. So we built a functional equivalent of Redis and Kafka with different design decisions and trade-offs and shadow tested them with our workloads. We are humbled to report from a first hand account that the capabilities have reached a state that is now possible to transition toward self-evolving architectures that continuously measure, adapt, and optimize themselves. At the core of this shift, observability is emerging as the explicit feedback control mechanism for agent-produced software. We have documented our methodology and the resulting codebases (redis-rust & Helix) in a two-part technical series, detailing how we could safely empower AI agents to build, test, and optimize complex distributed systems. Part 1: We utilize "harness-first engineering" to build complex infrastructure. By defining strict system invariants upfront and building rigorous automated harnesses—using deterministic simulation testing, formal specifications (TLA+), and observability-driven feedback loops—we enable AI agents to autonomously iterate against these constraints. Part 2: We then extended this verification framework directly into active environments. Using BitsEvolve, we implemented fully autonomous optimization for our time-series aggregation service. The system actively proposes algorithmic improvements, formally verifies safety properties, shadow-evaluates against live traffic, and hot-swaps improved WebAssembly modules. By enabling the LLM to discover and deploy structural algorithmic changes (such as shifting from O(N) iterations to O(1) lookups), we achieved performance improvements of up to 5x on targeted workloads. People involved in no particular order - @atalwalkar @Keleesssss Arun Parthiban Jai Menon Ming Chen Vyom Shah. If you are curious and want to compare notes on how we can raise the bar by pushing the rigor in agent built software, please give a read and hit us up with your thoughts. Links in comments
English
4
3
38
3K
geoff
geoff@GeoffreyHuntley·
In Loom, yeah: TLA+ and model checkers. In Latent Patterns, just QuickCheck is good enough for now. I need to do some more engineering to perform mutation testing and get into the realm of DST. But at this stage, all the engineering foundations are in place, that make it sound enough to launch. So I'm moving on to other aspects of the business now, like sales automation, marketing, and all that.
English
2
0
2
646
/sesh/null retweetledi
swyx
swyx@swyx·
Lots more things Are Database than you think - Git - Datadog: 800 db's in a UI trenchcoat - Temporal: basically Stored Procedures - Notion: vaguely remember @sliminality's "database monetizing as a productivity app" quote - Honeycomb: @mipsytipsy spent 2 years Building Database
English
8
6
101
0
Sajid Mehmood
Sajid Mehmood@smehmood·
Guess we’re all calling them “claws” from now on
Andrej Karpathy@karpathy

Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :) I'm definitely a bit sus'd to run OpenClaw specifically - giving my private data/keys to 400K lines of vibe coded monster that is being actively attacked at scale is not very appealing at all. Already seeing reports of exposed instances, RCE vulnerabilities, supply chain poisoning, malicious or compromised skills in the registry, it feels like a complete wild west and a security nightmare. But I do love the concept and I think that just like LLM agents were a new layer on top of LLMs, Claws are now a new layer on top of LLM agents, taking the orchestration, scheduling, context, tool calls and a kind of persistence to a next level. Looking around, and given that the high level idea is clear, there are a lot of smaller Claws starting to pop out. For example, on a quick skim NanoClaw looks really interesting in that the core engine is ~4000 lines of code (fits into both my head and that of AI agents, so it feels manageable, auditable, flexible, etc.) and runs everything in containers by default. I also love their approach to configurability - it's not done via config files it's done via skills! For example, /add-telegram instructs your AI agent how to modify the actual code to integrate Telegram. I haven't come across this yet and it slightly blew my mind earlier today as a new, AI-enabled approach to preventing config mess and if-then-else monsters. Basically - the implied new meta is to write the most maximally forkable repo and then have skills that fork it into any desired more exotic configuration. Very cool. Anyway there are many others - e.g. nanobot, zeroclaw, ironclaw, picoclaw (lol @ prefixes). There are also cloud-hosted alternatives but tbh I don't love these because it feels much harder to tinker with. In particular, local setup allows easy connection to home automation gadgets on the local network. And I don't know, there is something aesthetically pleasing about there being a physical device 'possessed' by a little ghost of a personal digital house elf. Not 100% sure what my setup ends up looking like just yet but Claws are an awesome, exciting new layer of the AI stack.

English
1
0
5
498
Arjun Narayan
Arjun Narayan@narayanarjun·
I'm optimistic that formal verification is the solution to our current situation where LLMs are writing our code and nobody's reading it. Formal methods can give us a world where we write succinct specs and agent-generated code is proven to comply. But we have a long way to go. There are several open challenges that stand between our situation today and that future, but none appear insurmountable. I’ve written a brief overview of what I consider to be the big open problems, and some of the directions that researchers are taking today to address them: from verifying mathematics to building standard libraries of verified code that can be built upon. Here are a few highlights: 1) A Brief History of Formal Verification Verification is fundamentally about understanding what your program can or can’t do, and verifying it with a proof. In order to verify, you must first have a specification that you are verifying your program against. Most of you leverage some formal verification day to day: namely, some of the compiler errors in statically-typed languages like C++ and Java are verification errors. Static type checking is the version of formal verification programmers are most familiar with. Type systems (and related formal verification tools) have gotten quite impressive, and they are becoming a lot more relevant in constraining the behavior of AI coding models. 2) Rust Type checking represents a middle ground for verification. The hard part is choosing the right balance: reject too many good programs and it becomes hard to program in this language as the programmer has to “guess what the type checker will permit”. Recently the language that has brought the most interesting advances from type systems to the real world is Rust. Its ownership type language and associated type checker is known as the “borrow checker”. The borrow checker is conservative, and “fighting with the borrow checker” is part and parcel of everyone’s Rust experience. This gives us the following lesson: we can prove more interesting things, but at a larger burden to the developer. Finding elegant middle points is hard, and Rust represents a real design breakthrough in navigating that tradeoff. 3) Mechanically verified math Recently, groups of mathematical researchers have recently been writing mathematical proofs in a specialized programming language called a proof assistant. This language, LEAN, comes with a powerful type checker capable of certifying complex mathematical proofs. LEAN is exciting, but working in LEAN can be frustrating - because of the nontermination properties of the type checker’s search, such languages rely heavily on programmer annotation. And this is why more complex type systems have stayed relatively academic: the Rust borrow checker sits at a genuinely elegant point in the design space: complex enough to reason about a complex property like memory references, yet simple enough to not need too much extra annotation. But this is a critically important point: Mathematical proofs and type checking aren’t just analogous: they are the literal same task. They are different only in the degree of complexity along two axes: the complexity of the underlying objects, and the complexity of the properties we are proving. 4) There is still a long way to go for proof assistants While the world I describe is exciting, bluntly, we’re not anywhere close to that world yet. Proofs break easily when programs are modified, the standard library of proofs is too small, and specifications seldom capture everything about the program’s behavior. Overall there’s a long way to go before these techniques reach a mainstream programming language with broad adoption. But, AI is a huge accelerant to proof assistants. Much of the energy towards AI-assisted mathematics is coming from AI researchers who see it as a very promising domain for building better reasoning models. Verified math is a domain rich in endless lemmas, statements, and proofs, all of which can be used as “ground truth” - which means we can use them as strong reward signals in our post-training workflows. There are several startups being built by seasoned foundation model researchers - Harmonic, Math Inc - that are based on this premise. I’m no expert here, but it sure seems to me that formally verified code would lead to a clear domain of tasks that have strong verifiable rewards ripe for use in reinforcement learning to build better agents period. I’m excited about the efforts to use verified mathematics in reinforcement learning. But I’d love to see even more experiments in bringing verification to the agentic coding world. This is an exciting time in programming languages and formal methods research. There’s only one way out of the increasingly unwieldy mountain of LLM generated code: We must prove. We will prove.
English
17
20
132
19.7K
Dominik Tornow
Dominik Tornow@DominikTornow·
We asked Claude to build a durable execution platform from scratch The pit of success?! The agent harness. All of our effort went into specification and verification The result?! A correct and complete @resonatehqio server
Dominik Tornow tweet media
English
5
4
36
10.7K
/sesh/null
/sesh/null@nerdsane·
Now @Keleesssss can finally say our prognosis has reached mainstream 😄
English
0
0
1
70