Ilya Babashin

210 posts

Ilya Babashin

@ILYA_babay

Building MedEvidence — PubMed AI for clinicians.Every citation real & verifiable. No hallucinations.Backend eng.Building in public. https://t.co/ohkssWvvtU

Katılım Şubat 2026

35 Takip Edilen8 Takipçiler

Sabitlenmiş Tweet

Ilya Babashin@ILYA_babay·30 Nis

Built MedEvidence on one promise: zero fabricated PubMed citations. Today I ran it through 30 clinical cases. Here's what I found 👇

English

181

Ilya Babashin@ILYA_babay·8h

@PhysicianLogic2 @EricTopol @AnnalsofIM Sharp framing. The missing axis is verifiability. Reliable systems get ignored when reasoning is opaque; unreliable ones get trusted when output sounds confident. Both failure modes collapse when every claim ships its evidence chain, verifiable in seconds.

English

Physician Logic Squared@PhysicianLogic2·8h

@EricTopol @AnnalsofIM Clinical AI fails when trust and reliability are mismatched. A reliable system that clinicians ignore is wasted; an unreliable system they trust is unsafe.

English

Eric Topol@EricTopol·23h

New @AnnalsofIM "The Human Factor in Clinical AI: Why Technology Alone is Not Enough" Gets into the trust issue and concludes: 'The most important question in medical AI may not be “how accurate is the algorithm?” but rather “how do we calibrate the relationship between clinician and machine?' acpjournals.org/doi/10.7326/AN…

English

173

19K

Ilya Babashin@ILYA_babay·8h

@basilenjei Multi-phase validation is the right backbone. The explanation layer worth pushing hardest on: it has to bottom out in cited literature, not just feature importance. SHAP plots don't convince clinicians — "this risk was raised because of these PMIDs" does.

English

Basile Njei, MD, MPH, PhD@basilenjei·23h

A Multi-Phase Protocol for Developing and Validating an Explainable AI Framework for Continuous Monitoring, Risk Stratification, and Clinical Decision Support in Primary Biliary Cholangitis - PubMed pubmed.ncbi.nlm.nih.gov/42184424/

English

Ilya Babashin@ILYA_babay·8h

@doctorbhargav "Reduce error under load" is the metric that maps cleanest to traceability. At full clinical capacity, a confident suggestion with no citation chain is the highest-risk artifact in the room. AI that ships sources first turns verification into seconds, not minutes.

English

Bhargav Patel, MD, MBA@doctorbhargav·8h

A tool that increases RVUs but pushes doctors out of practice is not success. Healthcare AI ROI should answer three things: Does it keep clinicians working Does it expand patient access Does it reduce error under load Revenue is the weakest proxy.

English

Ilya Babashin@ILYA_babay·8h

@AIForDoctors Determinism under complexity is the right framing. The trick: every module ships a contract (input → output → source trail), and the orchestrator never collapses provenance. Once you can replay why each claim appeared, debugging stops being archaeology.

English

Clinical AI@AIForDoctors·1d

@ILYA_babay Completely agree. At that scale, modularity becomes less about software elegance and more about deterministic behavior under complexity.

English

Clinical AI@AIForDoctors·2d

One of the least discussed reasons medical AI systems fail reproducibility tests is poor architectural design. Not weak algorithms. Not insufficient data. Architecture. Many healthcare AI pipelines are built as tightly coupled systems where: - preprocessing - feature extraction - model inference - visualization - alerting - deployment logic are deeply intertwined. That creates a major reproducibility problem. A small upstream change can silently alter downstream clinical predictions. For example: - ECG filtering changes waveform morphology - normalization logic changes laboratory distributions - timestamp alignment shifts temporal relationships - missing-data handling alters risk scores Suddenly the “same” model no longer behaves the same way. This becomes especially important in clinical environments. A deterioration prediction model trained in one ICU may fail in another hospital because: - device sampling frequencies differ - EHR timestamp behavior changes - laboratory coding varies - telemetry packet loss patterns differ - preprocessing pipelines are inconsistent What appears to be model failure is often infrastructure inconsistency. This is where modularity becomes essential. A modular healthcare AI architecture separates systems into independently testable components: - signal acquisition - preprocessing - feature extraction - inference - alert generation - dashboard integration Each module can then be validated independently. That dramatically improves reproducibility, debugging, and deployment reliability. There is also a computational advantage. Modular systems support: - distributed computing - scalable streaming pipelines - component-level optimization - fault isolation - safer model upgrades This matters enormously for ICU telemetry, ECG monitoring, and continuous patient surveillance systems. The most important insight may be this: Reproducibility is not purely a statistics problem. It is fundamentally a systems engineering problem. Healthcare AI discussions often focus heavily on model architecture while underestimating pipeline architecture. But in clinical deployment, infrastructure consistency may matter more than marginal accuracy gains. How much of healthcare AI reproducibility failure do you think comes from the model itself versus the surrounding data pipeline and infrastructure? #HealthcareAI #ClinicalAI #AIInMedicine

English

Ilya Babashin@ILYA_babay·1d

@felixcoettlmd The build layer shift is real, but the hardest unsolved piece isn't orchestration — it's grounding. Agents that act on medical data need every claim traced to peer-reviewed sources, not just model confidence. Execution platforms without citation layers will hit a trust wall fast.

English

Felix C. Öttl, MD@felixcoettlmd·1d

Google didn't just announce models at I/O 2026. They shipped an agent execution platform. For clinical AI, that changes the build layer entirely.

English

289

Ilya Babashin@ILYA_babay·1d

@martinvars @gbiondizoccai "Make uncertainty visible" is the key line here. In medical AI, the worst outcomes come from confident-sounding answers with no source trail. Every clinical output should show exactly which evidence it drew from — and what it doesn't know. Constraints build trust.

English

Martin Varsavsky@martinvars·1d

@gbiondizoccai Good clinical AI design is mostly constraint design: make the safe path obvious, make uncertainty visible, and make escalation easy. Simplicity is not cosmetic in medicine. It is safety.

English

754

Giuseppe Biondi-Zoccai@gbiondizoccai·1d

Which is the role of design in clinical AI? How effective systems can be defined by simplicity, interpretability, and clinical relevance? jamanetwork.com/journals/jama/…

English

976

Ilya Babashin@ILYA_babay·4d

@DavidWienerMD @ASE360 @doximity Embedding society guidelines into broad-reach AI is meaningful. Next trust layer is per-output: does each surfaced recommendation pin to a specific ASE statement, with strength and applicability preserved? 'Verified' goes only as far as inline source attribution per claim.

English

David H. Wiener, MD@DavidWienerMD·5d

This is big! A new agreement brings @ASE360's trusted #echofirst guidelines to @doximity Ask, the evidence-based, physician-verified #AI platform reaching >80% of US clinicians. We deliver clinical AI physicians can trust. Learn more at asecho.org/news/ase-guide…

English

773

Ilya Babashin@ILYA_babay·4d

@MoAImam Time win is real. Trust ceiling on doc AI sits at a different layer: claim-to-source traceability inside the note. 'Mild' becomes 'resolved,' 'consider' becomes 'recommended.' Sentence-level pointers to the transcript make adverse-event review tractable.

English

Mo Imam@MoAImam·4d

🔹 AI-Assisted Clinical Documentation: Giving Clinicians Back Their Most Valuable Resource Time. That is what the administrative burden of clinical documentation costs — and it is the resource in shortest supply in modern healthcare. AI-assisted documentation is not a futuristic concept. It is a working technology already saving clinicians hours every week, and its potential to improve both clinical capacity and clinician wellbeing is one of the most immediately tangible benefits AI offers medicine. The administrative burden of clinical documentation has grown steadily as clinical complexity, regulatory requirements, and documentation standards have increased. Clinic letters, operative notes, discharge summaries, referral responses, PROM data entry, audit forms — these consume a significant proportion of every clinician's working day. In the NHS, consultant physicians and surgeons spend, on average, a substantial part of their contracted hours on documentation rather than direct patient care. This is a resource allocation problem with a technological solution. Ambient voice recognition — AI that listens to a clinical consultation and generates a structured, accurate summary without the clinician interrupting the conversation to type — is arguably the most immediately impactful AI application currently available in clinical practice. The clinician speaks naturally with the patient. The note is drafted in the background. Review and editing takes minutes rather than the full time of dictation and transcription. "Every minute a clinician spends on documentation that AI could assist with is a minute not spent with a patient. At scale, those minutes matter enormously for healthcare capacity and quality." The governance framework for AI-assisted documentation is critical: the clinician who reviews, edits, and approves the AI-generated note retains full professional responsibility for its accuracy. This accountability structure is appropriate and important. It also means that the efficiency gain is real only if clinicians review AI-generated notes critically rather than approving them passively. Building that culture of active review alongside the time-saving benefit is the implementation challenge that deserves as much attention as the technology itself. — Professor Mo Imam | MD · PhD · FRCS (Tr&Orth) #TheArmDoc | MoImam.co.uk #ClinicalDocumentation #AmbientAI #NHSEfficiency #AIinMedicine #TheArmDoc

English

310

Ilya Babashin@ILYA_babay·4d

@yuyinzhou_cs Active retrieval is the right direction. The quieter failure agentic clinical flows still leak: right chunks retrieved, synthesis drifts on strength or direction of effect. Per-claim entailment closes that. Built this into medevidence.pro/go/x?c=hot-thr…

English

Yuyin Zhou@yuyinzhou_cs·4d

Clinical AI shouldn't just consume evidence handed to it — it should actively seek evidence, e.g., linking multimodal data, analyzing patient context, and retrieving external knowledge to support clinical reasoning 🔎 Introducing ClinSeekAgent — our automated agentic framework for active multimodal evidence seeking in clinical reasoning., achieving +15.1 F1 compared to #claude Opus 4.6 Paper: huggingface.co/papers/2605.20… Code: github.com/UCSC-VLAA/Clin…

English

6.4K

Ilya Babashin@ILYA_babay·4d

@LastFraction0 @nick_kapur That hybrid is exactly the right framing. My output has structured sections (dx = supports/argues against, PMIDs as claim slots), prose between. Prompt-enforced not EBNF, same shape. SOAP needs harder constraints. Whether free prose holds at note-gen scale - open question.

English

Last Fraction Zero@LastFraction0·5d

@ILYA_babay @nick_kapur EBNF looks like it can be used for templating an ontological map; enforced section headers and formal structuring of lines into prose (free generation) and datatype (verified facts) segments. I would hope leaving prose segments free wouldn't break phrasing but that needs testing.

English

Nick Kapur@nick_kapur·17 May

An auditor for the Ontario, Canada government found that AI agents tasked with turning doctor/patient conversations into structured notes routinely hallucinated false treatments, replaced drug names with entirely different drugs, and missed crucial information

English

126

2.7K

10.7K

606.8K

Ilya Babashin@ILYA_babay·5d

@Daezi7 Right that AI aggregates and clinicians verify. The trap most PubMed-connector flows still fall into: real PMIDs cited, but synthesis quietly flips strength or direction of effect. Citation real, claim drifted. The audit lives at the claim layer, not the source layer.

English

Daezi (Away)✨🦋✨@Daezi7·5d

The alternative is perplexity or Claude (pubmed connector) but u still need to verify, validate & cross reference. Even with right instruction & prompts - AI can & will hallucinate. U can use AI as aggregator of data sources but u still need to use ur own brain to work.

S.🎧@1ssve

Just saw a comment that said “I’m a student. Can someone tell me an alternative to ChatGPT?” HELLO??? YOUR BRAIN?????? the fuck?????????

English

Ilya Babashin@ILYA_babay·5d

@MaziyarPanahi "The loop is the product" — sharp framing. Where similar agentic flows still leak: signature gates output, not per-claim faithfulness. Retrieved chunk correct, synthesis drifts on strength/direction. Built that into medevidence.pro/go/x?c=builder….

English

118

Maziyar PANAHI@MaziyarPanahi·5d

OpenMed Agent + Claude Opus 4.7 just ran a 14-step special-pathogen ED workup on a synthetic VHF case. Live CDC + WHO + PubMed retrieval. Evidence-weighted differential. Clinician signature required before any artifacts finalized. The loop is the product.

English

170

18.3K

Ilya Babashin@ILYA_babay·5d

@HealthcareAIGuy @EvidenceOpen Synthesizing EHR + literature in one workflow is the right direction. The harder layer is synthesis itself — retrieved evidence can be cited correctly while the surfaced claim quietly shifts strength or direction. Per-claim source entailment closes the trust loop.

English

Healthcare AI Guy@HealthcareAIGuy·5d

OpenEvidence just partnered with Cedars-Sinai to bring patient context from Epic directly into the platform Clinicians can now use agentic clinical AI to synthesize every detail of the patient record alongside the latest evidence from the medical literature in a single workflow

English

Ilya Babashin@ILYA_babay·5d

@LastFraction0 @nick_kapur Closer in spirit than in mechanics. I use soft heuristics (relevance guard, coverage check on abstracts) where you have symbolic audit. No CFG, output is prompt-enforced. Built into medevidence.pro/go/x?c=lfz-thr… Question: how expressive can EBNF be before clinical phrasing breaks?

English

Last Fraction Zero@LastFraction0·5d

@ILYA_babay @nick_kapur Ahh, I get it now. This is, roughly, the pipeline you're describing?

English

Ilya Babashin@ILYA_babay·6d

@MatthewHellyar True — and there's a sibling gap that compounds it: knowing how to read the answer. Clinicians who demand per-claim citations to PubMed and surface conflicting evidence get one tool. The ones who trust fluent prose get a different one. Same model, two ceilings.

English

Matthew Hellyar@MatthewHellyar·20 May

Hot take from inside a clinical AI trial: The model is not the bottleneck. The interface is not the bottleneck. The regulator is not the bottleneck. The bottleneck is that almost no clinician has been trained to ask an agentic system a clinical question — and the gap between the ones who learn and the ones who do not is going to be the widest gap medicine has ever produced. This week’s Agentic Report makes the full case. Free weekly dispatch from the trial floor → respocareinsights.io/article/puttin…

English

136

Ilya Babashin@ILYA_babay·6d

@LastFraction0 @nick_kapur Fair — NL fluency unavoidably uses weight-tokens for grammar and coherence. The discipline lives on the claim-carrying tokens: every factual assertion has to trace to a retrieved chunk. Weights handle the connective tissue; entailment check catches the drift.

English

Last Fraction Zero@LastFraction0·6d

@ILYA_babay @nick_kapur What you're hoping for is an outcome where the report is actually a template where wildcards are filled in only with tokens from the documents being consolidated, but, the template creation itself is part of the synthesis and that risks inclusion of unwanted parameters.

English

Ilya Babashin@ILYA_babay·19 May

@LastFraction0 @nick_kapur Synthesis ≠ free generation though. The constraint isn't 'no LLM' — it's 'LLM only synthesizes from the transcript itself.' Whole-cloth NL works when source material is grounded; the failure is when the model fills in content that was never observed.

English

Last Fraction Zero@LastFraction0·19 May

@ILYA_babay @nick_kapur This isn't merely data entry into fixed fields where it can pull relevant strings and record them into those relevant fields. This is whole cloth natural language synthesis. It has to generate from training data if you want output that isn't just a list of relevant strings.

English

Ilya Babashin@ILYA_babay·19 May

@RoupenMD Real problem — root cause is that 'reference library' is a configuration, not a contract. Standardization needs two layers: which sources are eligible AND how disagreements between them surface to clinicians at point of use, not buried in vendor docs.

English

Roupen Odabashian@RoupenMD·19 May

We need an urgent study on the hidden inconsistencies across AI Clinical Decision Support (CDS) tools. The problem isn't hallucination, it's isolated reference libraries. If you ask two tools the same question, both give "correct" answers based on their own sources. But feed Tool A's answer into Tool B, and Tool B flags it as wrong. This is a massive blind spot in medical AI. We desperately need standardized guidelines defining exactly which references CDS tools must use and how they cite them. Let me know if there is anything published on this!

English

Ilya Babashin@ILYA_babay·19 May

@sanatmishra7 @OpenAI The judge-fragility surface compounds with retrieval-grounding: deleting a negation flips the judge AND passes entailment checks downstream. Stress-test both layers — otherwise they fail in the same direction and the audit looks clean.

English

Sanat Mishra@sanatmishra7·19 May

Everyone is racing to build medical AI agents. Almost nobody is stress-testing the judges grading them. We tested frontier models as judges on @OpenAI HealthBench answers. Then we hit the answers with adversarial perturbations: Delete a negation. Tamper with clinical values. Reverse the conclusion. That’s it. Most LLM judges missed the mistakes. @claudeai Opus 4.7 missed an obvious medical lie 83% of the time. @Gemini 3.1 Pro Preview was the best model we tested and still missed 43%. Why this matters: These judges are becoming reward models, safety filters, eval graders, and training signals for healthcare agents. If the judge can’t catch the lie, the model trained on that judge learns the blind spot. The failure does not stay in the eval. It propagates. First post in our @getbiostack x VibeOps Research series on where LLM-as-judge breaks in medical AI. Full piece: getbiostack.com/blog/trojan-ho… Thumbnail artwork inspired by H. Matisse.

English

91.3K

Keşfet

@PhysicianLogic2 @EricTopol @AnnalsofIM @basilenjei @doctorbhargav @AIForDoctors @felixcoettlmd @martinvars