AJ Welch

90 posts

AJ Welch

@AJWelch

Ex-@Google.

Boston, MA Katılım Şubat 2024

0 Takip Edilen116 Takipçiler

AJ Welch@AJWelch·23 Nis

I like that he talked about the effort that went into working out the details of each chapter. He worked on Rapportive, Kafka, Samza, Bottled Water and I’m sure many other projects. But still had to put years into the book. “Those are the sort of high-level topics that were clear from my initial book proposal to the publisher. The details within each chapter, that is something that I often figured out once I got to that chapter. So, I wrote one chapter at a time and started each chapter with just a lot of background research to actually get up to speed on the topic myself. And it’s often only then that, say, for replication, I decided, okay, well, it seems like the three major ways of doing this are single-leader, multi-leader, or leaderless. I would decide on that structure essentially while writing each chapter and then try to fit the various points I wanted to make into this narrative structure.”

Gergely Orosz@GergelyOrosz

Building distributed systems at scale is about assuming that the unlikely will happen - because at scale, it probably will! Great take from Marin Kleppmann:

English

AJ Welch@AJWelch·23 Nis

For the record I think Keploy is clever. I'm just tired of the "AI-Gen era" hype.

English

AJ Welch@AJWelch·23 Nis

They just had to go for "exemplary". Couldn't settle for “commendable”.

English

AJ Welch@AJWelch·23 Nis

A "must-have tool for developers in the AI-Gen era for 90% test coverage" dog foods their own product and achieves 75% test coverage. Fun stuff.

English

109

AJ Welch@AJWelch·23 Nis

Bonus points for overlaying alternatives in a single visualization.

English

AJ Welch@AJWelch·23 Nis

Nice trick in pg_plan_alternatives. Postgres structs are opaque to eBPF, so instead of copying struct definitions or hard-coding byte offsets, the userspace loader parses DWARF debug info from the Postgres binary to extract field offsets and injects them into the eBPF program as #defines at load time.

Planet PostgreSQL@planetpostgres

Jan Kristof Nidzwetzki: pg_plan_alternatives: Tracing PostgreSQL’s Query Plan Alternatives using eBPF postgr.es/p/7un

English

AJ Welch@AJWelch·22 Nis

@shazow @mitchellh Great signal for a go-specific harness/agent. But it largely has to be engineered into the harness today, whatever that harness may be: skill, agent, workflow…

English

463

Andrey 🦃 Petrov@shazow·22 Nis

@mitchellh @AJWelch Thankfully the lovely built in testing and benchmarking tooling in Go has allocation measurement and everything. Shouldn't be a big lift to instrument and optimize some numbers down.

English

520

Mitchell Hashimoto@mitchellh·22 Nis

Observations from writing Go again, exacerbated by agents but not unique to them. First, its far too easy to allocate and agents (probably people too) do it too often. For example, to "undo" work on error, its enticing to keep track of the work done but that's a mistake. If an error case is rare (and they usually are), you should pessimize the error case and optimize the success case. Don't allocate unnecessarily on the happy path if its going to succeed 99+% of the time. Let the error case be slower. On error, just redo the work but do the undo step instead of the apply step. This doesn't work if the apply step had a ton of side effects but it works more often than you think. Real world example of that not in Go, but the Zig compiler: when it parses, it doesn't store any file/line/col info, because its a waste of memory when parsing succeeds most of the time. And memory is speed in modern CPUs since cache locality owns everything around us. If an error happens, Zig just reparses the file from the beginning in a slow path that does collect error information. That pattern is generally useful.

English

1.4K

119.2K

AJ Welch@AJWelch·22 Nis

Agreed. I always thought of it as roughly a natural inverse relationship between ACV and scale. High ACV, fewer more conventional enterprise customers with smaller data and stricter isolation requirements. Thus single-tenant makes sense. Associating single-tenant with scale seems to be a newer phenomenon.

English

David Cramer@zeeg·22 Nis

@AJWelch I think it just depends on your model and particularly your cost structures

English

161

David Cramer@zeeg·22 Nis

shoulda just used RLS obv

PlanetScale@PlanetScale

Postgres has three ways to isolate tenants: - Logical databases - Per-tenant schemas - Tenant ID in a shared schema Counterintuitively, the last is the best way to scale. Read about why in our latest article.

English

30.6K

AJ Welch@AJWelch·22 Nis

Indeed haha. Interestingly, I do think models will be decent short term at teaching these things to humans. Once this thread makes it into the training set, I’m sure it will get quasi-regurgitated in code reviews or casual study buddy style convos. Distributing the knowledge is a lower bar than applying it. I guess those who can’t do, teach.

English

125

Mitchell Hashimoto@mitchellh·22 Nis

@AJWelch Agreed, not automatic. You'd have to have tools and results criteria that depend on minimizing allocations. In the short term, humans are good. lol.

English

5.1K

AJ Welch@AJWelch·22 Nis

@dinodaizovi @capsule8 Do you have a preferred or recommended eBPF-based EDR solution?

English

Dino A. Dai Zovi@dinodaizovi·22 Nis

So... auditd being used for D&R on Linux servers and breaking prod is why I started building a Linux EDR based on perf, which became @capsule8. Please use something based on eBPF today, it's way safer and higher performance. Perf comparison (2019): docs.google.com/document/d/12L…

Florian Roth ⚡️@cyb3rops

Many of you know the Linux #auditd config I’ve maintained for years. It was always meant to be a simplified, detection-agnostic baseline for #Linux 🐧 We’ve now changed the way it works ⚡️ The core idea is: audit.rules should act as the sensor, not the detection engine That means: - generic process_creation - fewer brittle per-binary rules - better portability - CI validation We preserved the old baseline as v0.1.0 and released v0.2.0 as the new streamlined model github.com/Neo23x0/auditd… co-op with @petri_ph

English

7.7K

AJ Welch@AJWelch·22 Nis

@QingQ77 Love all the cool little TUIs and eBPF tooling coming out nowadays

English

Geek Lite@QingQ77·21 Nis

基于 eBPF 的 Linux 系统调用追踪器，替代 strace，提供实时 TUI、智能过滤、TLS 解密和可读的参数解码。 github.com/pandaadir05/sn… snoop 用 eBPF tracepoint 替代 ptrace，被追踪进程不会被反复挂起，性能比 strace 好很多。它有实时 TUI、60 多种 syscall 参数解码、TLS 明文捕获和堆分配追踪，还能录制/回放/对比两次 trace 的差异。

GIF

中文

287

15.3K

AJ Welch@AJWelch·22 Nis

Great post. Reminds me of Brendan Gregg’s post on AI flame graphs where he concluded: “It feels to me like GPU/AI debugging, OS style, is about two years old. Better than zero, but still early on, and lots more ahead of us. A decade, at least.” brendangregg.com/blog/2024-10-2… “Compare against peers, not against absolute thresholds” - took this same approach alerting on anomalous query execution times in BigQuery. Defining “peers” was actually the hard part as we were comparing heterogenous queries not homogenous GPUs. Ended up bucketing by query plan complexity.

English

ingero@ingero_io·22 Nis

One slow GPU idles 999 peers at every AllReduce barrier. Production data: 60% of 512+ GPU jobs hit fail-slow events, adding 34% to average job time. nvidia-smi hides it. ingero.io/gpu-stragglers… #GPUObservability #eBPF #GPU #MLOps

English

AJ Welch@AJWelch·22 Nis

@sspaeti @evidence_dev Very much like the Swiss Army knife indeed!

English

Simon Späti 🏔️@sspaeti·22 Nis

@AJWelch @evidence_dev amazing. Yeah, I believe the versatility and its lightweight being a single binary really help with AI and using it in so many different use cases. A little bit like the Swiss Army Knife 🙂

English

Simon Späti 🏔️@sspaeti·17 Eki

People often ask: - Is DuckDB like Snowflake? Not really. - Is DuckDB like PostgreSQL? No, maybe cousins? - Is DuckDB like Pandas? It's complicated. - Is DuckDB like SQLite? Yes and no. - Is DuckDB like Apache Spark? Interesting. I've been exploring DuckDB for a while, and in my second article (motherduck.com/blog/duckdb-en…), I delve into these questions and the use cases not just for us data wranglers and enthusiasts but also for larger enterprises. While many know DuckDB for its speed and in-memory analytics, there's more under the hood that's incredibly useful for handling data.

English

492

53.4K

AJ Welch@AJWelch·22 Nis

@sysxplore Unfortunately this is only going to get worse with AI.

English

sysxplore@sysxplore·22 Nis

Last time I called this guy out for copying my work, the excuse was “AI generated.” Funny how this one follows the exact same structure, layout, and breakdown… and even carries the same typo from my original. In the HPA section, it still says “Node 1 – After VPA” instead of HPA. That mistake is from my graphic. Sometimes you have to wonder if the person sharing this even understands what they’re posting. I’ve said it before, I don’t care if people share my graphics or get ideas from them. But if you’re going to take ideas, at least put in some effort or give proper credit. If you don’t want to give credit, just post the original as it is.

Uday👨‍💻@uday_devops

🚀 Kubernetes Scaling Strategies - Beyond Just “Add More Pods.” Scaling in Kubernetes isn’t one-size-fits-all. It’s a toolkit of strategies, each solving a different problem depending on workload patterns, resource constraints, and business needs. Here’s a quick breakdown of the key approaches: 🔹 Horizontal Pod Autoscaling (HPA): Scale *out* by adding more pods based on metrics like CPU or memory. Ideal for handling traffic spikes and stateless applications. 🔹 Vertical Pod Autoscaling (VPA): Scale *up* by adjusting CPU and memory for existing pods. Useful when workloads are stable but resource needs are unpredictable. 🔹 Cluster Autoscaling: Automatically adds or removes nodes based on scheduling demands. Ensures your cluster always has the right capacity—no more, no less. 🔹Manual Scaling: Still relevant for controlled environments or predictable workloads. Gives full control, but requires active management. 🔹 Predictive Scaling (KEDA, ML-based): Move from reactive -> proactive. Anticipate demand using historical data and event-driven triggers. 🔹 Custom Metrics Scaling: Go beyond CPU/memory. Scale based on business metrics like queue length, request rate, or user activity. Key takeaway: The real power comes from combining these strategies- not choosing just one. Smart scaling = better performance + optimized cost. How are you handling scaling in your Kubernetes workloads today? Are you still reactive, or moving toward predictive systems?

English

6.8K

AJ Welch@AJWelch·22 Nis

Nutanix writeup on how their AHV hypervisor keeps an accurate vNIC-to-IP mapping for microsegmentation and flow analytics. eBPF program filters ARP/DHCP/NDP/DHCPv6 packets and forwards them via ring buffer to a userspace program for all the heavy lifting. Similar in design to Inspektor Gadget gadgets.

Nutanix Community@NutanixNation

Amit Gupta, Senior Product Manager at Nutanix, presents a collaborative technical networking piece along with Jaspal Singh Dhillon and Deepankur Gupta from Nutanix Engineering. It covers how Nutanix AHV uses eBPF for vNIC-IP Mapping. nutanix.com/tech-center/bl… #nutanix #ahv #ebpf

English

234

AJ Welch@AJWelch·22 Nis

Will be cool if we get a WASM Neovim build out of GSoC 2026 #run-neovim-in-a-web-browser" target="_blank" rel="nofollow noopener">github.com/neovim/neovim/…

English

154

Keşfet

@shazow @mitchellh @dinodaizovi @capsule8 @QingQ77 @elonmusk @BarackObama @taylorswift13