

emi
1.8K posts

@gpuemi
co-founder @wafer_ai -- agents that optimize gpu. math @uchicago





40% of fatherhood is walking around the house, turning off lights.

GLM 5.2 costs $1.40/4.40 per Mtok at 40 tok/sec and people seriously consider buying GPU rigs for it

🚨 BREAKING: wafer now runs the fastest, lowest-latency GLM-5.2 anywhere ranked #1 across every provider on Artificial Analysis: ⚡ 222 output tok/s (next best: 173) ⚡ 12.6s end-to-end response time (next best: 16.9s) try it: app.wafer.ai






We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader benchmarking effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from realistic assay artifacts rather than memorized facts from the literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. The strongest model-harness configuration was Claude Opus 4.8 + Pi at 59.3%, followed by GPT-5.5 + Pi at 55.3%. While experiments are rate-limited by natural processes, human decisions and organizational consensus often make up significant components of program timelines in drug discovery. Agents promise to accelerate discovery, development, and translation by compressing these interpretation and decision-making loops. However, the practical use of agentic systems in industrial workflows requires standardized and trusted methods of evaluating performance. This is especially challenging in drug discovery because the ecosystem is a sprawling landscape of assay categories, development stages, therapeutic modalities, and decision types. Benchmarks must therefore measure realistic tasks while providing focused treatment of the many local scientific judgments that make up the biotech ecosystem. We evaluated 16 model-harness configurations, comprising 11 models across three agent harnesses, on 100 preclinical pharmacology tasks. Each configuration was run three independent times per task, yielding 4,800 agent trajectories. Performance varied by program stage: accuracy ranged from 27% in screening and hit prioritization to 55% in drug response. Difficult program stages involved decisions across QC, statistics, and chemical or biological judgment of molecular candidates. Trajectory analysis reveals gaps in scientific judgement. Failures included incorrect perception of assay outputs, reliance on literature priors over supplied evidence, and assay-specific reasoning mistakes. Manuscript, results and subset of evals/trajectories available below:

minimizing time to first token is CRUCIAL for voice AI deployment. optimizing GLM-5.1 for Neon Health took ttft 800ms → 550ms at 25% higher peak load. what we learned: • kv locality is a scheduling primitive (95%+ hit rate) • prefill admission > prefill speed — chunked prefill • short decode steps beat speculative decoding under bursts • stable first token > max gpu utilization • optimize the client-observed number, not server-side ttft under 2 weeks, under a BAA, US-only residency.



today is yc demo day. just about a year ago, @gpuemi and i stepped onto that stage and presented wafer (f.k.a. herdora). what felt like the end of a chaotic batch turned out to be the beginning of everything that mattered. for everyone presenting today: enjoy the moment, celebrate how far you've come, and take the photos. wishing u the best, p26♥️


