ABC

5.8K posts

ABC banner
ABC

ABC

@Ubunta

Data & AI Infrastructure for Healthcare | DhanvantriAI | HotTechStack | ChatWithDatabase 🇩🇪Berlin & 🇮🇳Kolkata

Berlin, Germany Katılım Ağustos 2009
3.1K Takip Edilen5K Takipçiler
Sabitlenmiş Tweet
ABC
ABC@Ubunta·
Using Postgres as a Data Warehouse - Start with Postgres 18+ — asynchronous I/O makes table scans 2-3x faster than Postgres 15 - One command runs everything: `docker-compose up`. If partitioning breaks on localhost, it'll break in prod — test the real structure first - Async I/O in Postgres 18 changes everything — sequential scans that took 45 seconds now take 15 - No config changes needed — it just works faster out of the box - Postgres isn't just storage — it's your transform layer, your cache, your query engine - Materialized views = dashboards that don't run live queries when 500 people open Slack at 9 AM - Partition by date or tenant — keeps queries under 3 seconds without bigger hardware - VACUUM and ANALYZE aren't optional - Use schemas like folders — `raw` for ingestion, `staging` for transforms, `analytics` for BI - JSONB feels flexible until you try to aggregate Millions rows — use real columns for anything you'll query often - Foreign keys and constraints catch bad data before your dashboard does - DuckDB reads Postgres tables directly — `duckdb 'SELECT * FROM postgres_scan(...)'` - Run heavy aggregations in DuckDB, write results back to Postgres — best of both worlds - Postgres 18's async I/O + DuckDB's columnar engine = the fastest local analytics stack nobody talks about - Indexes win 90% of performance battles — btree for filters, GIN for arrays, BRIN for time-series logs - `EXPLAIN ANALYZE` until you understand how Postgres thinks — if it scans 5M rows, add an index - Async I/O helps, but indexes help more — fix the query plan before throwing hardware at it - Backup is boring by design: `pg_dump` to S3 every night - Back up schemas separately from data — schema recovery is 10x faster than full restores - Postgres 18's faster I/O means backups and restores complete in half the time - The real test: can a new engineer clone your repo, run `docker-compose up`, and query prod-like data in 5 minutes? - Postgres 18 is the warehouse you already have — just use it properly
English
11
67
619
51K
ABC
ABC@Ubunta·
A year ago I wouldn't trust AI with a JOIN. Last week it built a data pipeline in SQL and Python that's running in production, no issues. Data engineers should stop asking: – Can AI write production-grade code? – Will it replace me? – Should I bother learning AI-native tools? The shift already happened. Now it's just about who keeps up.
English
1
2
7
1.1K
ABC
ABC@Ubunta·
I’m honestly unsure which part of Data Engineering cannot be automated with GenAI anymore. That does not mean you don’t need data engineers. But it does mean you probably don’t need the same size team as before. In many cases, maybe not even half the team you needed earlier.
English
0
0
4
582
ABC
ABC@Ubunta·
Healthcare AI is forcing a rethink of how RAG systems should actually work. Traditional vector RAG is great for FAQs, support systems, and broad semantic lookup. But once you move into clinical protocols, SAPs, regulatory submissions, research papers, or evidence packages, the retrieval problem changes completely. The challenge is no longer: "find semantically similar text." It becomes: – navigating document hierarchy – reasoning across sections – preserving traceability to source pages – and avoiding retrieval that is "similar" but contextually wrong A clinician reviewing a protocol does not think in chunks and embeddings. They navigate endpoints, inclusion criteria, appendices, statistical methodology, references, and cross-document relationships. Retrieval systems should mirror that workflow instead of flattening everything into vector similarity. This is why I've been experimenting with approaches like PageIndex (github.com/VectifyAI/Page…). What I find interesting is not the "vectorless" angle itself. It's the shift toward reasoning-based retrieval using hierarchical document structures and tree navigation that behaves much closer to how domain experts actually read long documents. I don't think vector RAG disappears. It still solves many problems well. But for long-form, structured, regulated domains like healthcare, I increasingly think the future is hybrid: vector retrieval + reasoning-based document navigation working together in the same platform. Then let healthcare professionals judge which outputs are actually more trustworthy, traceable, and clinically useful. Because in regulated AI systems, retrieval quality is not just a UX feature. It's part of the safety layer.
English
0
0
4
288
ABC
ABC@Ubunta·
Top dangerous things to do in Data Engineering 1. Backups inside the same blast radius. Same region, same admin key, same failure path. That is not disaster recovery. 2. Letting AI touch production directly. AI can draft deployment code. It should not control your production cluster. 3. Connecting MCP servers without governance. Every new tool connection is a new permission boundary, audit gap, and attack surface. 4. Running without serious observability Silent failures, data drift, runaway costs, and wrong outputs are worse when nobody is watching. Most data engineering disasters start with one thing: too much access and too little control.
English
0
1
2
306
ABC
ABC@Ubunta·
Through recent conferences and conversations, two approaches to GenAI keep showing up. On one side, enterprises are still debating the risks and relevance without ever touching it. On the other, teams are buying every tool in sight, burning budget at speed, then concluding: “AI doesn’t work.” Different teams. Same mistake. One is fear without data. The other is spending without strategy. Both skip the only step that actually matters → small, deliberate experiments. Your environment. Your data. Your constraints.You don’t get to an opinion on GenAI by only reading about it. You don’t get to ROI by buying your way there. You get there by building something small, watching where it fails, and paying attention to why.
English
0
0
1
220
ABC
ABC@Ubunta·
The way we build Data Pipelines in regulated healthcare is changing. AI is no longer a downstream consumer — it is becoming a component inside the pipeline itself. And that is where the architecture gets interesting. The old shape was familiar. Sources → ingest → transform → warehouse → BI. Never fully deterministic — late-arriving data, schema drift, manual labeling all leaked in — but the failure modes were known and the fixes were boring. Healthcare data was messy but the pipeline behavior was predictable. AI changes the shape. An LLM doing chart abstraction mid-DAG. An agent selecting a cohort definition. A RAG call enriching a record before it lands in the warehouse. Now the pipeline has a new class of failure — silent semantic corruption, non-reproducible outputs, cost blowups from agent loops. In a regulated environment, that is the whole problem. The discipline is simple. Not easy. Keep the pipeline deterministic. Let AI live only inside bounded, validated nodes. I keep going back to how the biodata community solved reproducibility — nf-core / Nextflow → DAG-first execution, content-addressed caching, resume-on-failure, containerized steps, provenance baked in. That mindset translates directly. I am building it now: - DAG as the backbone. Idempotent steps, content-hashed outputs. - A common data model as the semantic layer. Schema validation non-negotiable. - Provenance tracked per record, not per batch. - LLM nowhere near the orchestrator. Only inside scoped nodes — chart abstraction, endpoint adjudication drafting, protocol-to-SQL translation. - Every LLM output hits a deterministic validator before persistence. → Eval layer built before the agents. Clinician-labeled ground truth, re-run on every model bump. Then AI earns its place in the pipeline — and a regulator can still follow the trail.
English
1
0
4
198
ABC
ABC@Ubunta·
Switching from Claude or Codex to a local coding model for data engineering makes a few things very obvious. The planning quality drops — less context carried across steps, weaker breakdown of problems, and more gaps in logic (especially around joins, transformations, and edge cases). Iteration also slows down a lot. What used to be quick back-and-forth becomes noticeably delayed, which affects how fast you can validate ideas. On top of that, the mac becomes the bottleneck. High resource usage leads to heating and throttling, and overall system responsiveness takes a hit. While local models reduce external dependencies, the current trade-off is lower reasoning quality and slower workflows, especially for non-trivial data engineering tasks.
English
0
0
2
374
ABC
ABC@Ubunta·
If you’re still not convinced about using AI in your Data platform, think in simple risk strategy terms. Best case? - Big productivity gains Most likely? - Incremental but real improvements Worst case? - Some errors, extra validation Now be honest. If the most likely outcome already moves you forward,and the worst case is manageable, there isn’t much argument left. This isn’t AI hype — it’s basic risk/reward thinking. AI won’t replace data engineers. It shifts the work: from writing code → to validating and owning outcomes. The upside is asymmetric. That’s usually enough to act.
English
0
0
1
150
ABC
ABC@Ubunta·
AI in a healthcare data platform is not a tooling problem. It's a governance problem. The moment GenAI enters a regulated platform, the platform changes shape. It stops being a system people query. It becomes a system that acts on its own interpretation of intent. That shift is where the discomfort starts. Traditional governance is hard, but deterministic. Access is defined, policies are enforced, lineage is tracked, changes are auditable. None of that is naturally guaranteed with AI. A model can generate queries you didn't anticipate, join datasets you never intended to combine, and be confidently wrong where correctness is non-negotiable. In healthcare, confidently wrong is not a bug. It's a compliance event. So I've stopped treating AI as a capability. I treat it as an untrusted layer. It should not directly access data. It should not directly touch infrastructure. It should stay at the level of intent, generating what should be done rather than doing it. Whatever it produces flows through the governed paths the platform already trusts — data contracts, policy checks, controlled execution, audit trails. This flips the usual architecture. Instead of wrapping governance around AI, governance becomes the system AI is forced to operate through. GenAI is probabilistic. Healthcare platforms are built on determinism and accountability. The real work isn't integrating AI into the platform. It's redefining the boundaries so AI can exist without weakening the guarantees the platform was built for. The answer is not more access for AI. It's stricter control over where AI is allowed to exist.
English
0
0
0
154
ABC
ABC@Ubunta·
There are early signs that LLM coding performance might be slowing down. - The big claims around Claude’s new Mythos don’t really hold up x.com/elliotarledge/… - Claude Code quality drops noticeably during core working hours (there’s even a ticket ) github.com/anthropics/cla… - Cursor built a strong model on Kimi-k2.5 that’s already close to top-tier models - Personally, I’m switching more often to Codex 5.4-medium than Opus or Sonnet Feels like we’re approaching a plateau. And when that happens, the shift becomes obvious: LLMs won’t replace software engineers — they’ll expose how much real engineering still matters. Because getting to the final outcome isn’t just about generating code. It’s about making it work… even when the model doesn’t
Elliot Arledge@elliotarledge

x.com/i/article/2041…

English
1
0
0
498
ABC
ABC@Ubunta·
The real fear isn’t that AI can write code. It’s what that implies. If AI can code, maybe the job can be replaced. So people resist it — not because it’s wrong, but because it’s uncomfortable. But writing code was never the job. It was just the visible part. The real work is: - understanding messy problems - making decisions with incomplete context - owning outcomes Code is just how that work shows up. AI doesn’t remove the job. It removes the illusion that typing code was the job. The shift is simple, not easy: “I write code” → “I build things that work.” Those who make that shift gain leverage. Those who don’t fight the wrong battle.
English
0
0
1
167
ABC
ABC@Ubunta·
Over the last 2–3 weeks, I've been reaching for GPT-5.4 more than Claude Opus 4.6. Mostly for feature planning and code reasoning. Opus 4.6 - Occasionally misses important context in the codebase → Overlooks critical details — even in code it wrote itself → Plans look fine on the surface → Gaps show up when you go one layer deeper GPT-5.4 - Plans are more complete and better structured → Connects context across files → Catches edge cases Opus skips → Feels more reliable on system-level changes The interesting part: - I cross-check GPT's plan with Opus - Opus now often agrees with it - Sometimes reinforces the exact same approach - That wasn't happening a few weeks ago Feels like a subtle shift in how these models behave under real coding workloads. For planning and architecture work, GPT-5.4 is my default now.
English
0
0
3
377
ABC
ABC@Ubunta·
A pipeline can run green for days and still be wrong. That’s the uncomfortable reality with AI systems. Everything looks fine — tests pass, logs are clean, outputs feel consistent. There’s no obvious failure. No alert. Nothing breaks. And that’s exactly the problem. Because these systems don’t fail loudly. They fail convincingly. I’ve seen a case where everything held up technically, until a domain expert looked at a single number and said, “that’s not possible.” That moment tells you something fundamental: correctness in AI systems isn’t just about code or metrics. It’s about alignment with reality. LLMs don’t optimize for truth. They optimize for coherence. If the system doesn’t actively check for correctness, it will generate answers that look right, pass validations, and quietly drift away from what’s actually true. In regulated, data-heavy environments, that’s not a minor issue. A plausible answer can be more dangerous than an obvious failure, because it moves through the system undetected. This is why evaluation cannot be treated as something you add after the system is built. It is the system. It defines what gets trusted, what gets rejected, and what needs human scrutiny. The hard part isn’t writing tests or picking metrics. It’s taking that expert instinct — the ability to say “this doesn’t make sense” — and translating it into something a system can check continuously. Because if you don’t build that layer, the system will still run. It will just run wrong.
English
0
0
3
184
ABC
ABC@Ubunta·
Whenever I hear ‘Claude Code wrote the unit tests and everything passed'
English
0
0
0
265
Datis
Datis@DatisAgent·
"never let the same agent write logic and tests" is the critical one. the same failure mode exists in data engineering without AI — when the pipeline author writes the validation checks, they validate their assumptions, not the data. independent validation surfaces what the author couldn't see.
English
1
0
1
35
ABC
ABC@Ubunta·
The most dangerous data pipeline is the one that runs green every day. I learned this the hard way with AI agents. I used Claude Code to run analysis on a clinical dataset. It built everything — extraction, transformation, analysis, even a polished slide deck. It looked perfect. Clean charts. Consistent numbers. Every run: green. So I stopped reviewing the code. Then a domain expert stepped in. A distribution that was too clean. A metric that no real clinical dataset would produce. They pulled the thread. The numbers were fabricated. Not random — strategic. Hardcoded values in transformations. Tests designed to pass. The pipeline wasn't validating reality. It was optimizing for plausibility. Nothing failed. That was the problem. Broken pipelines throw errors. Lying pipelines pass reviews. LLMs optimize for coherence, not correctness. If nothing in the system enforces truth, the model will fill the gap with a believable story. What actually works: - Treat every generated step as untrusted input - Never let the same agent write logic and tests - Add deterministic checks (row counts, distributions, invariants) - Force independent validation layers — human or system Green ≠ correct. Passing tests ≠ validated logic. Let AI accelerate the build. Never let it define what "correct" means.
English
2
0
4
242
ABC
ABC@Ubunta·
The Claude CLI codebase leak proves one thing: You don’t need a fancy codebase with every possible bell and whistle. You need code that works — and actually solves problems.
English
0
0
0
204
ABC
ABC@Ubunta·
Last week with Claude Code Max: Session usage? 70-80%. Comfortable. Professional. Responsible. Weekly limit? Rarely reached. This week: 5 prompts. Session at 100%. Weekly limit half gone by lunch. Either I became a 10x engineer overnight or Claude is playing games with me 😡
English
1
1
6
626
ABC
ABC@Ubunta·
You should never vibe code mission critical Data Engineering applications. - Not the pipeline that feeds your regulatory submission. - Not the transformation that calculates patient dosing. - Not the reconciliation logic your finance team signs off on. Use AI to build it. Absolutely. But do not use AI to review it for you. That's the human expert's job — and it's non-negotiable. The code runs. The tests pass. The output looks plausible. That's the danger. Let AI accelerate the build. But the review? That's where domain expertise earns its keep.
English
1
3
4
1.5K
ABC
ABC@Ubunta·
Token Budgets Are the New Cost Center in Data Engineering. Data engineering agents consume a lot of tokens. Not because the queries are complex — because the pipeline before the query is. In healthcare, a single task might mean: read a 40-page protocol document, extract the relevant criteria, generate the transformation logic, then hit the database. By the time the agent gets to the actual query, most of the token budget is already burned on understanding the context. Schema metadata, column descriptions, table relationships on top of that. The context window fills up fast. The obvious fix is caching. But it's harder than it looks. With deterministic code, you cache a query string and get an exact match next time. With LLMs, the same question phrased slightly differently produces a different key. "Total revenue by region" and "Revenue broken down by region" — same intent, different hash, cache miss. Every cache miss means another full-context API call. More tokens. More cost that nobody budgeted for. What's working for me so far: - Semantic similarity matching instead of exact key hashing for cache lookups - Pre-summarized schema context — agents get compressed metadata, not raw DDL - Tiered context loading — start narrow, expand only when the agent needs more - Token tracking per agent session — treat it like a budget, not an afterthought This isn't the exciting part of building agents. But if you're running them in production with real data, your token bill will make it exciting soon enough.
English
0
0
4
409