ABC

5.7K posts

ABC banner
ABC

ABC

@Ubunta

Data & AI Infrastructure for Healthcare | DhanvantriAI | HotTechStack | ChatWithDatabase 🇩🇪Berlin & 🇮🇳Kolkata

Berlin, Germany Bergabung Ağustos 2009
3.1K Mengikuti5K Pengikut
Tweet Disematkan
ABC
ABC@Ubunta·
Using Postgres as a Data Warehouse - Start with Postgres 18+ — asynchronous I/O makes table scans 2-3x faster than Postgres 15 - One command runs everything: `docker-compose up`. If partitioning breaks on localhost, it'll break in prod — test the real structure first - Async I/O in Postgres 18 changes everything — sequential scans that took 45 seconds now take 15 - No config changes needed — it just works faster out of the box - Postgres isn't just storage — it's your transform layer, your cache, your query engine - Materialized views = dashboards that don't run live queries when 500 people open Slack at 9 AM - Partition by date or tenant — keeps queries under 3 seconds without bigger hardware - VACUUM and ANALYZE aren't optional - Use schemas like folders — `raw` for ingestion, `staging` for transforms, `analytics` for BI - JSONB feels flexible until you try to aggregate Millions rows — use real columns for anything you'll query often - Foreign keys and constraints catch bad data before your dashboard does - DuckDB reads Postgres tables directly — `duckdb 'SELECT * FROM postgres_scan(...)'` - Run heavy aggregations in DuckDB, write results back to Postgres — best of both worlds - Postgres 18's async I/O + DuckDB's columnar engine = the fastest local analytics stack nobody talks about - Indexes win 90% of performance battles — btree for filters, GIN for arrays, BRIN for time-series logs - `EXPLAIN ANALYZE` until you understand how Postgres thinks — if it scans 5M rows, add an index - Async I/O helps, but indexes help more — fix the query plan before throwing hardware at it - Backup is boring by design: `pg_dump` to S3 every night - Back up schemas separately from data — schema recovery is 10x faster than full restores - Postgres 18's faster I/O means backups and restores complete in half the time - The real test: can a new engineer clone your repo, run `docker-compose up`, and query prod-like data in 5 minutes? - Postgres 18 is the warehouse you already have — just use it properly
English
11
68
621
50.8K
ABC
ABC@Ubunta·
Things that go wrong when people assume a "slide creation agent" is simple People think slide generation is a weekend project. It isn't. The AI part looks easy. The software part is where everything breaks. - Templates are the first trap Generic slides are easy. The real pain starts when a user uploads a custom template. Now your system has to deal with arbitrary layouts, placeholders, theme rules, fonts, colors, spacing, and all the weird things hidden inside PowerPoint files. That is not prompt engineering. That is reverse-engineering presentation software. - Preview is a separate problem Generating a .pptx is one problem. Showing a reliable preview is another. Now you need rendering infrastructure, conversion pipelines, caching, and fallback handling. Most teams plan for the agent. They do not plan for the renderer. - Text fitting will eat your sprint Generating text is easy. Making it fit is not. Text overflows, bullets collapse, titles wrap badly, and one extra sentence can break the whole slide. You spend more time on layout constraints than on the actual AI logic. - Editing turns it into a product The moment a user says "change only slide 6" or "keep the style but rewrite this section," it stops being an agent demo. Now you need state management, partial regeneration, diffing, and undo logic. - Good single slides do not make a good deck An agent can generate decent slides one by one and still fail at the presentation. Repetition, bad narrative flow, inconsistent styling, and uneven detail ruin the deck at the system level. That is the pattern with AI products: the model demo looks simple, but the software reality is what kills you
ABC tweet media
English
1
0
1
213
ABC
ABC@Ubunta·
If you want AI agents to actually work in data engineering — stop giving them blind power. No raw access to warehouses or infrastructure. Everything goes through controlled interfaces: APIs with row caps, cost limits, and query pattern enforcement. Not thin wrappers — real guardrails. Don’t build one agent that touches Snowflake, Spark, Airflow, and Iceberg. Build scoped agents — one job, one toolset, one failure boundary. Keep the blast radius small. And don’t start with hard problems. Start with the boring ones — repetitive, well-defined tasks where agents are actually reliable. Most data teams skip this. They plug an LLM into production and call it automation. What they actually built is an expensive, unpredictable intern with admin access. Agents don’t fail because the models are bad. They fail because nobody designed the boundaries.
English
1
1
2
337
ABC
ABC@Ubunta·
The hardest part of working with LLMs isn’t that they argue. It’s that they don’t. They just say “sure” and start generating code for it.
English
0
0
1
207
ABC
ABC@Ubunta·
LLMs write SQL fast. Not cheap.Your agent just ran 47 queries for one answer. Every one syntactically valid. Every one a full table scan. That distinction costs real money on data platforms. Give a model a schema and a question- it generates a query instantly. But it has zero awareness of compute cost. Full table scans, unnecessary joins, redundant CTEs — all syntactically valid, all expensive. Now add AI Agents to the picture. One user question becomes 10–50 query iterations before an answer surfaces. Each iteration hits the warehouse. Multiply that across a team and you have an AI-powered cost explosion that no one budgeted for. Large schemas make it worse. Expose hundreds of tables and the model joins far more data than the question requires. It doesn't know what's expensive — only what's reachable. Prompting doesn't fix this. Architecture does. - Expose semantic layers, not raw schemas. Give the model 15 curated views instead of 200 raw tables — it joins less because it sees less. - Gate execution before it hits the warehouse. Cost estimation, row limits, credit caps per session — the query gets checked before compute gets burned. - Monitor what agents actually run. Not just failures. Track query volume, cost per question, and whether the same table gets scanned 40 times for one answer. LLMs are excellent SQL generators. But without guardrails, they become the most expensive analyst on your data team — and the one with no spend limit.
English
4
1
16
1.7K
ABC
ABC@Ubunta·
4 patterns I'm seeing when GenAI meets real Data Engineering systems: - LLMs don't understand data sensitivity. Ask one to “analyze customer data” and it will happily join PII, logs, internal metrics, and test tables in the same query. It has no concept of what it shouldn’t touch. That boundary must exist in architecture. - Schema exposure is a security surface. The more raw tables you expose to a GenAI system, the more unpredictable its queries become. Good systems expose curated semantic layers, not the warehouse itself. - Prompting is not governance. Writing “do not access sensitive data” in a system prompt is a suggestion, not a control. Governance lives in permissions, masked views, and query gateways. - Observability matters more with AI than with humans. A human runs a few queries. An agent can run hundreds in minutes. If you're not tracking query patterns and cost spikes in near-real time, you won't notice the problem until the incident report arrives. The common mistake: treating AI like a smart analyst. It's not. It's a high-speed query generator with no judgment that needs guardrails and a strict execution layer between it and anything that matters.
English
1
3
74
4.6K
ABC
ABC@Ubunta·
You can't just give someone AI + a CLI and expect software to come out. Doesn't work like that. Most of engineering isn't writing code. Its noticing when something is off. That instinct takes years — spotting edge cases, bad assumptions, security gaps. Sometimes before anything even breaks. When things break, engineers know where to look. Normal users? They stare at the error, paste it into ChatGPT, get a different error. Loop continues until someone calls an engineer to fix it. AI accelerates engineering. Thinking it replaces engineers just means you don't understand where the complexity actually lives.
English
3
0
9
806
ABC
ABC@Ubunta·
@siva2chinni Thanks 🙏 good good, figuring out AI in data , how are you
English
1
0
1
152
Siva
Siva@siva2chinni·
@Ubunta After a long time seeing your suggestions. Always right to the point. How are you doing?
English
1
0
1
242
ABC
ABC@Ubunta·
3 patterns from using GenAI in real Data Engineering: - Never give an LLM direct DB access. Let it generate SQL. You review. You execute. It has no concept of blast radius. - Never give it infra access. The moment it gets a CLI, it explores aggressively. API spikes, exploding logs, command histories that make no sense. - Never let it validate its own code. If it writes code and tests, it's grading its own homework. Everything passes — until production. Use AI for generation, not authority. The engineer is the control layer.
English
4
18
189
11.4K
ABC
ABC@Ubunta·
Everyone worries about burning LLM tokens. But somehow running expensive warehouse queries and building dashboards nobody uses has been normal for years.
English
2
0
5
438
ABC
ABC@Ubunta·
AI can generate code really fast. No doubt. But after actually building AI agents for data pipelines in healthcare — the reality is very different from the hype. - AI doesn't reduce complexity. It just moves it around. You stop writing logic, but now you're spending time defining constraints, giving examples, and validating outputs. The work doesn't go away — it changes shape. If you skip this part, you'll just get garbage faster. - Code generation is not equal to software engineering. AI gives you the happy path in seconds. But real pipelines? They are full of nulls, schema changes, weird upstream data, edge cases that only you know about because you've been debugging them for months. AI doesn't know your data. You have to teach it. That part is still your job. - AI doesn't understand your data grain. It just predicts tokens. When I asked agents to generate SQL on broad datasets for specific use cases, the output was wrong more often than right. It doesn't get your join logic, your SCD, why one column means different things in different environments. And the more ambiguous your data, the more tokens you burn. It's not just cost — it becomes an iteration loop that doesn't converge. - Complex pipelines will break your AI agent. This was my biggest learning. Don't give a complex multi-step pipeline to an AI agent and expect working code. Break it down. Small steps. Let AI generate code for each step separately, then stitch them together. Yes, you burn more tokens — but if you're burning too many, that's telling you your decomposition is not clean enough. - Testing is what actually saves you. Without proper test coverage, you're just producing bugs at scale. Every AI-generated transformation needs validation — row counts, aggregations, types, business logic. Your test suite is what makes the difference between a useful agent and an expensive experiment. - And if your data is regulated — be extra careful. Same prompt can give you different SQL tomorrow than today. In healthcare, that's not a minor issue. If your output is not reproducible, you don't have automation — you have a random code generator with compliance risk. The engineers who will do well with AI agents are not the ones who prompt better. They are the ones who understand their data deeply, break problems into small pieces, and test properly. That hasn't changed.
English
1
2
6
644
ABC
ABC@Ubunta·
If I think carefully about the capabilities of Claude and Codex CLI — especially the raw power you get directly in the terminal — tools like Cursor and Copilot start to feel obsolete. I barely use them anymore. At most, I might open them just to quickly verify files generated by the Claude Cli… but even that feels unnecessary. Honestly, a lightweight editor like Sublime is probably more than enough for that.
English
1
0
5
782
ABC
ABC@Ubunta·
What does "Data Engineering AI for healthcare" actually look like at the data layer? I've been building a demo — semantic patient search using zvec, an embedded vector database built on Alibaba's Proxima engine. The idea: describe a clinical presentation, find the most similar cases from 10,000 patient records. No keywords. Meaning-based matching. zvec is in-process like SQLite. No server, no Docker, no cloud. Patient data never leaves the machine. I benchmarked it 4 ways — FAISS, zvec, ChromaDB, NumPy. Some honest results: - FAISS is faster than zvec at raw search. That's not a bug — FAISS does less. No persistence, no filtering. Kill the process and the index is gone. - NumPy brute-force beats zvec at 10K vectors. Also expected. HNSW overhead only pays off at scale — at 1M records the projection flips to 38x faster. - zvec pays ~7x over FAISS. What you get: data persists automatically, metadata filtering (age, severity, department) is fused natively into the search. At 0.5ms it's still fast enough for any real clinical use. The comparison that actually matters is zvec vs ChromaDB — same feature tier. zvec wins clearly, especially filtered queries (0.5ms vs 10ms+). Stack: zvec + fastembed (ONNX) + Polars. Fully offline, uv run. Full 4-way benchmark in the repo — link in comments.
English
1
2
19
1.1K
ABC
ABC@Ubunta·
Building AI agents on top of heavy data systems exposes a harsh truth: LLM cost is rarely the model — it’s uncontrolled context. Most token waste isn’t intelligence failure. It’s architectural indiscipline. - Letting an agent explore tables freely is architectural laziness. Expose schema, constraints, and edge cases first. Extract a structural blueprint once. Never let it repeatedly scan raw tables unless absolutely required. - Feeding entire log files and asking the model to “find insights” amplifies noise. Engineers should first isolate suspicious patterns. The LLM should reason over curated slices, not unfiltered system output. - Long documentation degrades reasoning quality. Provide the exact relevant section plus intent annotations. Context precision beats context volume every time. - Streaming systems mutate continuously. Agents will chase variance and exhaust tokens. Capture deterministic snapshots or bounded time windows instead of analyzing moving targets. - Pre-compute row counts, distributions, null ratios, schema diffs. Models reason far better over summaries than millions of rows they cannot truly “see.” - Agents that don’t checkpoint re-derive prior conclusions and re-consume tokens. Persist intermediate findings aggressively. - Schema-as-context beats data-as-context. A concise DDL with constraints and relationships carries more reasoning signal than thousands of sampled rows. The pattern is simple: AI agents are not explorers — they are amplifiers. If you amplify noise, you pay for noise. In serious data engineering systems, the human remains the entropy controller.
English
1
0
5
642
ABC
ABC@Ubunta·
One of the clearest explanations of what OAuth and OIDC really are OIDC is functionally equivalent to "magic link" authentication. We send a secret to a place that only the person trying to identify themselves can access, and they prove that they can access that place by showing us the secret. That's it. The rest is just accumulated consensus, in part bikeshedding (agreeing on vocabulary, etc), part UX, and part making sure that all the specific mechanisms are secure. 📌 OAuth: leaflet.pub/p/did:plc:3vdr… This shows how a very simple core concept can grow into something that looks very complex. As engineers, it’s easy to get lost in the complexity and forget that the underlying idea is actually quite simple.
English
0
0
5
406
ABC
ABC@Ubunta·
AI is trained on code written by humans. And yet we expect it to outperform the humans who trained it. Sounds contradictory — but think about it. AI doesn’t get tired. It doesn’t forget patterns. It has seen more repositories, edge cases, and architectural styles than any single engineer ever will. And it’s improving at a pace we’ve never experienced. Today, it reflects us. Tomorrow, it may out-optimize us. I build healthcare data engineering platforms. Every day I pair with AI to design pipelines, debug infrastructure, and prototype faster than I could alone. It’s not replacing engineers. But it is raising the bar. The real question isn’t: “Can AI write better code?” It’s: Are we building engineers who can evaluate, constrain, and direct AI — especially when correctness, compliance, and data integrity actually matter? Because the edge won’t come from typing speed.
English
1
0
4
431
ABC
ABC@Ubunta·
@saen_dev True but This problem remains even with/without LLM
English
0
0
0
178
Saeed Anwar
Saeed Anwar@saen_dev·
@Ubunta you cannot prompt your way out of a production incident. debug fluency and infra knowledge stay non-negotiable no matter how much LLM-generated code is in your stack. at 2am when the pipeline fails, you need to know which component broke — not which component looks like it broke.
English
1
0
0
275
ABC
ABC@Ubunta·
Data Engineering with LLMs: What Actually Matters - Read the code. LLMs write it, you own it. Debug fluency is non-negotiable. - Know your infrastructure cold. You can't prompt your way out of architectural ignorance. - Map your data flows. Know exactly which components are data-hungry and why. - Test relentlessly. Hand-validated test suites, actively monitored — not "it looks right." - Use LLMs intentionally. They're a tool, not a strategy. Don't force them where they don't fit. - Skip the code-quality theater. Stop debating whether LLM code is "good enough." Ship, measure, iterate. - Optimize for outcomes. Elegant code that doesn't deliver is just decoration. - Layer security and testing heavily. Infrastructure control is your best defense against bad generated code. - Stay model-agnostic. Loyalty to a LLM model is a liability. Use what solves the problem. - Kill your darlings. No feature is personal. If something better exists, switch fast.
English
3
11
107
5.4K
ABC
ABC@Ubunta·
MCP belongs at trust boundaries — not everywhere in your Data Engineering stack. In regulated systems, architecture isn't about elegance. It's about control planes, trust domains, and audit surfaces. When MCP makes sense, Use MCP when the agent crosses a security or governance boundary: - Querying a regulated PostgreSQL clinical database - Executing SQL against Snowflake / Redshift / BigQuery - Reading or writing to Iceberg tables over S3 with IAM enforcement - Accessing PHI in object storage - Calling external EHR or vendor APIs These are not function calls. They are privileged operations. MCP formalizes that boundary — giving you independent auth scopes, structured audit logging, network isolation, rate limiting, and clear ownership. In healthcare, every such boundary must be defensible to security and compliance teams. MCP makes the blast radius explicit. When MCP adds unnecessary complexity If the agent is running DuckDB locally, executing Polars transformations, performing schema validation, or calling a Python library inside the same runtime and IAM boundary — that's intra-runtime orchestration, not integration. Adding MCP here introduces serialization overhead, network hops, distributed error handling, and extra components to threat-model. You increase compliance surface area without improving security For these cases, sandbox the execution directly. Tools like @pydantic Monty let you run untrusted code in isolation without a protocol layer — containment without the distributed systems tax. The practical rule - Crosses a trust boundary — MCP Same runtime, same security domain — direct execution, sandboxed if needed In regulated data engineering, fewer unnecessary boundaries mean smaller attack surface and simpler audits. Not everything needs to be a server. Sometimes the safest architecture is the simplest one.
English
2
0
3
393
ABC
ABC@Ubunta·
Currently immersed in building a clinical Protocol Writing Agent — generating clinical study protocols from prior protocols, ICH guidance, and therapeutic-area standards. The hardest problem isn't generation. It's context selection. - Full context loading works shockingly well when source material is small (≤20 pages). The model reasons holistically. No retrieval misses. But clinical reality isn't 20 pages. It's ICH E6 + indication-specific guidance + legacy protocols + internal templates. You hit context limits fast — and even before that, attention dilution kicks in. - Vector RAG (pgvector + chunking) scales cleanly but medical documents expose a real flaw: embedding similarity ≠ section-level relevance. Eligibility, dosing, safety, PK — they all mention the same drug and population. When you need exclusion criteria, vector search returns a pharmacokinetics paragraph. Technically similar. Practically useless. - Structure-aware navigation (PageIndex -style reasoning trees) performs better for regulatory material — leveraging document hierarchy instead of embeddings. Slower, but noticeably more accurate in compliance-heavy writing. - Zvec caught my attention for raw speed — in-process, sub-ms retrieval. Promising at scale. But fast irrelevant context is still irrelevant. Hybrid retrieval wins. → Fast coarse filtering → Structural narrowing → Focused deep reading In regulated environments, retrieval errors aren't cosmetic. They become protocol deviations or regulatory findings. Building AI Agent in Regulated environment needs more experiments!
English
0
0
1
269