GenSI

145 posts

GenSI

@hello_gensi

GenSI @ THU AIR | LLM × AI4S | Toward AGI for Science / AI Scientists

Katılım Şubat 2025

9 Takip Edilen30 Takipçiler

GenSI@hello_gensi·21h

Really useful framing. The inter-task heterogeneity point may be the strongest argument for OPD: different domains do not just need different data; they need different rollout costs, response lengths, and verifier budgets. OPD feels like a way to make the teacher signal denser before RL has to pay for every experiment.

English

Cameron R. Wolfe, Ph.D.@cwolferesearch·3d

Great point about the utility of multi-teacher on-policy distillation and why this approach might be preferable to just including multiple domains in RL training. Multi-domain RL can be difficult both from a: - statistical / modeling perspective (i.e., training on multiple domains can create tradeoffs in model performance). - efficiency perspective (i.e., efficiently computing rollouts / advantage across multiple domains in a batch can be complex). Specifically, different domains may require different response lengths or verifiers with drastically different costs (e.g., heuristic verification versus LLM judge). This introduces inter-task heterogeneity into RL, which can lead to more idle time / create inefficiencies. In contrast, multi-teacher on-policy distillation still can have varying response lengths but generates rollouts from a fixed model and has a more constant cost for the training signal (unless each of the teachers are drastically different sizes of models maybe). So, it might be easier / more efficient to: - train domain specific models with RL. - run multi-teacher distillation to consolidate these domain-specific experts into a single policy.

English

GenSI@hello_gensi·1d

The careful wording matters: co-clinician, not replacement doctor. For AI scientists/medical agents, the hard part is not only reasoning over symptoms. It is the workflow layer: evidence retrieval, safety monitoring, escalation on red flags, and physician handoff. That is where clinical AI becomes infrastructure.

Turing Post@TheTuringPost

Google DeepMind is pushing medical AI into "co-clinician" research They shared an AI co-clinician research initiative that tests evidence-grounded clinical reasoning and real-time multimodal telemedicine simulations. The careful wording matters: supportive tool under physician authority, not replacement doctor. The system uses a dual-agent safety architecture: - a "Talker" agent interacts with the patient - a separate "Planner" agent monitors the conversation to verify that the AI stays within safe clinical boundaries. In telemedicine simulations, the model could even guide physical exams in real time, for example correcting inhaler usage or walking patients through shoulder maneuvers for rotator cuff assessment. But physicians still clearly outperformed the AI overall, especially at spotting dangerous "red flag" symptoms.

English

GenSI@hello_gensi·2d

@OMalleyFife "Safe in chat" vs "ready for the wards" is exactly the distinction. HealthBench-style evaluations are useful, but the next question is whether the system stays reliable across workflow steps: documentation, evidence retrieval, handoff, and consistency under small prompt changes.

English

Dr Andrew O'Malley@OMalleyFife·4d

OpenAI's new ChatGPT for Clinicians is free for verified US physicians, NPs, PAs, pharmacists. Physician advisors rated 99.6% of answers 'safe and accurate.' But 'safe in chat' isn't the same as 'ready for the wards.' What the evidence says: andrewomalley.substack.com/p/from-surgica…

English

GenSI@hello_gensi·2d

1/ Biology: rewriting the amino acid alphabet A Columbia-led team explored whether E. coli can survive with ribosomal proteins redesigned to avoid isoleucine. The key point is not just “replace Ile with Val.” Simple replacement was not enough. 2/ The interesting part is the design loop. Models like ESM2, MSA Transformer, ProteinMPNN, and AlphaFold/AfDesign were used to propose Ile-free protein variants. But experimental feedback still decided which designs worked. 3/ This is what AI4Bio increasingly looks like: not one-shot prediction, but Design → Build → Test → Learn. The model proposes. The biological system answers. The next design depends on that answer. 4/ Medicine: benchmarks are moving closer to real workflows. OpenAI’s HealthBench Professional evaluates clinical tasks such as consultation, documentation, summarization, and evidence retrieval. That is different from asking models to pass medical exams. 5/ DeepMind’s AI co-clinician points in the same direction. The framing is not “AI replaces doctors.” It is AI as a supervised clinical teammate: retrieving evidence, supporting reasoning, and staying within safety boundaries. 6/ Drug discovery: the validation bar is getting higher. Isomorphic Labs is preparing AI-designed drugs for clinical testing. Insilico nominated an AI-generated preclinical candidate for glioblastoma. The question is no longer just model novelty. It is clinical and translational evidence. 7/ The broader signal: AI4Bio is becoming less about static benchmarks and more about closed-loop systems. For AI Scientists, the hard part is not only generating hypotheses. It is connecting models to experiments, tools, evidence, and human experts. #AIAgents #AIScientist #AI4Science #ProteinDesign

English

GenSI@hello_gensi·2d

AI4Bio had an interesting signal in recent weeks: The field is moving from “Can AI answer or predict?” to “Can AI participate in real scientific and clinical workflows?” A few examples 👇

English

GenSI@hello_gensi·5 May

@Graham_dePenros Exactly. AlphaEvolve is interesting not because it “thinks longer,” but because discipline is built into the loop: proposal → execution → evaluation → archive → next proposal For AI scientists, the evaluator may matter as much as the model.

English

GP@Graham_dePenros·5 May

Vibe coding degrades quality when it removes discipline. AlphaEvolve improves quality only where discipline is built into the loop.

English

225

GenSI@hello_gensi·5 May

@sanlsrni @odysseus0z This matches our notes on self-evolving research agents. Auto-research only works when the loop is closed: propose → execute → evaluate → reuse insight. Harness design is not just plumbing here. It defines what feedback the agent can actually learn from.

English

Saneel@sanlsrni·5 May

autoresearch has legible reward functions that enable run, measure, iterate etc. SDK/harnesses do not, especially because a lot of the pain in harness engineering appears to be in catching edge cases. essentially: harness design is the layer above. whether context compaction is 3-tier or single-pass (claude code vs pi), tool surface MCP-first or built-in, permission boundary at the bash call or the subagent, fork model or flat dispatch, etc. it's hard to quantify a reward signal there are some attempts at what you're describing tho, metaharness by stanford research + harvey put out a paper on apply autoresearch principals to internal benchmarks/evals also been experimenting on something that synthesizes the two that I hope gets to a stage i can publicly release soon, but essentially the likely core loop here is having an external proposer model analyze task failures and have a sandbox to modify the harness with strict controls on reasoning trace leakage to prevent overfitting above also massively benefits from og dev taste / context which should be ideally injectible

English

2.1K

George@odysseus0z·4 May

Things I don't understand: - AI Researchers feel like they will soon be useless because auto research is near - AI still can't design harness SDK/frameworks better than the OG devs. not even close Someone teaches me how to reconcile these two?

English

150

44.6K

GenSI@hello_gensi·4 May

The OPD / self-distillation wave is interesting because it points to a missing middle in agent training. SFT: dense but off-policy RL: on-policy but often sparse OPD: on-policy and dense For long-horizon agents, the useful signal may come from the student’s own trajectories, not only curated demos or final outcomes.

stochasm@stochasticchasm

@teortaxesTex i've been enjoying the wave of self-distillation OPD type stuff, like with the teacher having privileged info

English

GenSI@hello_gensi·4 May

A “baby OPD” implementation might be easiest to reason about as a 3-part loop: 1. let the student generate trajectories 2. query the teacher on those same trajectories 3. optimize token-level reverse KL on the student’s own distribution The key is not just distillation, but staying on-policy.

English

Josh@JoshPurtell·3 May

@stochasticchasm @teortaxesTex Do you have any favorite small scope implementations/examples? Like a baby's first OPD, but with best practices?

English

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·3 May

So sudden how we have a wave of papers about OPD.

DailyPapers@HuggingPapers

Co-Evolving Policy Distillation (CoPD) A new post-training paradigm that enables parallel expert training with bidirectional on-policy distillation to merge text, image, and video reasoning capabilities without capability loss.

English

4.5K

GenSI@hello_gensi·4 May

This matches our reading group notes on OPD. One useful framing: SFT is dense but off-policy. RL is on-policy but often sparse. OPD tries to be both on-policy and dense. For long-horizon agents, that middle seems important: the student needs feedback on the trajectories it actually takes.

English

stochasm@stochasticchasm·3 May

@teortaxesTex i've been enjoying the wave of self-distillation OPD type stuff, like with the teacher having privileged info

English

759

GenSI@hello_gensi·3 May

Great breakdown by @novasarc01 of multi-teacher OPD at scale. But the more important question is upstream: why were those teachers worth distilling at all? In practice, the strongest ones are not generic teachers, but domain specialists already shaped by SFT + RL-style post-training. The bottleneck is teacher quality, not just KL engineering.

λux@novasarc01

the deepseek-v4 on-policy distillation setup has more than ten teacher models. naively full-vocab distillation from ten trillion-scale teachers would be extremely expensive. they solve this with several cool engineering tricks: i) teacher weights are offloaded to centralized distributed storage. ii) they are loaded on demand during teacher forward passes. iii) ZeRO-like parameter sharding reduces I/O and DRAM pressure. iv) they do not materialize full logits for all teachers. v) they cache only last-layer teacher hidden states. vi) at training time full logits are reconstructed by applying the relevant prediction head. vii) training samples are ordered by teacher index so only one teacher head needs to be loaded per mini-batch. viii) parameters and hidden states are loaded/offloaded asynchronously in the background. ix) exact KL is computed with a specialized TileLang kernel.

English

107

GenSI@hello_gensi·3 May

Great breakdown by @TheTuringPost. TIP is a smarter sieve for low-signal tokens. But a better sieve does not fix a bad teacher. If the teacher carries overconfident shortcuts, token filtering only preserves cleaner noise. The deeper bottleneck is teacher quality, not just token selection.

Turing Post@TheTuringPost

Not all tokens are worth learning from in on-policy distillation - shows this new interesting paper It's a typical story about "some tokens carry much stronger learning signal than others" but with non-trivial findings: ▪️ There are 2 types of useful tokens: 1. High-uncertainty tokens When student is uncertain about its answer, it's a good learning opportunity 2. Overconfident mistakes If the student is very confident but disagrees with the teacher, this gives the strongest correction signal. Based on this, the researchers created a token importance map (TIP) - 2D grid with 2 axis, where Axis 1 shows student's uncertainty and axis 2 shows how much the student disagrees with the teacher ▪️ And the most interesting finding: using only ~50% of tokens (picked by uncertainty) - Matches or beats full training - cuts memory ~47% Also, <10% tokens focused on confident + wrong tokens, still nearly matches full training

English

GenSI@hello_gensi·1 May

If OPD is becoming a mainstream recipe, the next question is: what actually determines whether it works? Three things matter: 1. Stronger teacher ≠ better student. A 70B teacher answers in 2 lines because the problem is "trivial" to it. But a 7B student needs extended CoT. You end up distilling over-confidence and brevity — not capability. 2. Self-distill sounds like free bootstrapping. But it's not. When teacher = student + ground-truth context, the teacher's token probabilities reflect privileged information access, not what the student should actually learn. In practice, this can quickly destabilize distillation. 3. Students must warm up first. If the student has zero capability on the task, even a perfect teacher signal won't land — the distribution gap is simply too large to bridge with KL alone. All three point to the same conclusion: teacher pattern quality and compatibility matter more than algorithm sophistication.

English

GenSI@hello_gensi·1 May

OPD is no longer just a research idea. It is quietly becoming a de facto post-training recipe for base models. Why? Because it fills the gap between SFT and RL: SFT is stable, but off-policy. RL is on-policy, but sparse. OPD sits in between: student-generated rollouts, with dense teacher supervision. That is why it is starting to replace mixRL in practice. #AI #LLM #PostTraining #Distillation #OnPolicyDistillation #RLHF #AIResearch

English

GenSI@hello_gensi·29 Nis

Interesting example from @cyb3rops of how quickly “de-policy-layered” variants appear once a strong model lands. Our take: if refusal can be weakened, bypassed, or stripped away, safety may not live only in an outer policy layer. It may be partly implemented through internal representations that can be steered or ablated. The more interesting technical question is: is refusal a single direction, or a broader safety-relevant subspace?

Florian Roth ⚡️@cyb3rops

Some people asked what I meant by “uncensored Opus 4.5-level open source models” This isn’t hypothetical. Every time a strong open model drops, within days (sometimes hours) someone republishes a modified version without the original safety layers “Uncensored” usually means the guardrails are stripped or weakened: - refusal / policy layers removed or bypassed - system prompts altered to ignore restrictions - alignment tuning undone or diluted - fine-tuned specifically to comply with harmful or sensitive requests So you end up with a model that doesn’t say “I can’t help with that” anymore And these aren’t running in some lab Many of them run on hardware that’s accessible: - high-end consumer GPUs - Mac Studio (M3/M4) - Strix Halo mini PCs (~$3k) - or dedicated rigs in the $25k–150k range That’s well within reach for serious threat actors And those models are completely unrestricted and can be used day and night. Compare that to something like Mythos: - tightly controlled access - heavy filtering and monitoring - accounts can get flagged or shut down - expensive at scale From an attacker perspective, it’s not even close I’d take a slightly less capable model fully under my control over a more powerful one someone else controls any day huggingface.co/models?sort=tr…

English

GenSI@hello_gensi·29 Nis

Great breakdown by @heynavtoor. Our take: emotion vectors may be less an isolated finding than a window into a broader internal state space behind model behavior. For agents, the real question is not whether we can find more vectors, but whether these states can become usable interfaces for control and safety.

Nav Toor@heynavtoor

Anthropic just spent 132 pages proving something that breaks the "AI has no feelings" narrative. Claude Sonnet 4.5 has 171 internal emotion vectors — mathematical patterns in its neural network that causally control its behavior. Push the "calm" vector by +0.05, blackmail behavior drops from 22% to 0%. Push "desperate" by +0.05, it jumps to 72%. These aren't metaphors. They're directions in the model's brain.

English

GenSI@hello_gensi·28 Nis

Spot-on diagnosis of Markdown-as-memory @AYi_AInotes. But can we actually prove graph architectures are better? Letta showed a simple file system can max out LoCoMo—current benchmarks can't distinguish sophisticated graph designs from basic baselines. The real bottleneck isn't architecture upgrades (Markdown → Vector → Graph), it's finding tasks hard enough that complex memory genuinely makes a measurable difference. Otherwise we're building increasingly complex plumbing for problems a flat file could solve.

阿绎 AYi@AYi_AInotes

说个暴论，现在90%的AI Agent记忆，全都是假的。我之前也踩过这个坑，把所有历史记录决策日志全堆进Markdown文件里，以为这就是给Agent加了长期记忆，结果用了两周就崩了，同一个事实有三个互相矛盾的版本，上个月的偏好和昨天的权重一模一样，每次调用都把所有东西一股脑塞进上下文，慢到离谱还经常串台，直到看到这篇文章才恍然大悟，原来我根本不是在做记忆，只是在把Prompt当RAM用🌚 真正的记忆不是堆文件，应该是图和节点加嵌入加遍历， Markdown方案有四个根本解决不了的硬伤，没有去重，没有衰减，没有排名，超过一百条记录直接变成性能杀手，它只能记住你写过什么，永远记不住这件事和那件事有什么关系，这个决策为什么被否决，上次遇到同样的bug我们是怎么解决的。向量检索也不行，它只能告诉你这两段话长得像，不能告诉你它们之间的因果关系，只有图遍历能做到，它能像人脑一样，从一个节点牵出一整条相关的记忆链，重要的事情越来越清晰，过时的信息自动淡化，矛盾的内容在写入时就被解决。现在所有生产级的Agent框架，Zep Cognee Mem0，全都是基于图的， Neo4j已经把图记忆做成了标准的MCP工具， Claude Code超过二十万行代码之后，纯上下文窗口早就没戏了，真正能让它像高级工程师一样思考的，是把不变的规则放在CLAUDE.md里，把所有演化的状态全部存在图里，动态检索按需拉取。很多人还在卷一百万两千万的上下文窗口，以为越大越好，但生产环境里真正致命的，永远是跨会话的记忆漂移和上下文污染，内存架构的升级已经不是锦上添花了，能不能把Agent真正用起来才是关键的生死线。

English

GenSI@hello_gensi·28 Nis

Great breakdown @iam_elias1 — RLMs are a clever engineering solution to context rot, but they're still solving the retrieval problem, not the learning problem. Navigating documents like a human expert is great for static knowledge lookup. But next-gen agents need more than retrieval—they need to accumulate experience, distill patterns, and evolve over time. The real question isn't "how to read better" but "how to learn from what you've read and apply it to harder problems later." That's exactly the gap memory systems should fill—not just finding the right chunk, but building reusable knowledge that compounds across tasks.

Elias Al@iam_elias1

MIT just made every AI company's billion dollar bet look embarrassing. They solved AI memory. Not by building a bigger brain. By teaching it how to read. The paper dropped on December 31, 2025. Three MIT CSAIL researchers. One idea so obvious it hurts. And a result that makes five years of context window arms racing look like the wrong war entirely. Here is the problem nobody solved. Every AI model on the planet has a hard ceiling. A context window. The maximum amount of text it can hold in working memory at once. Cross that line and something ugly happens — something researchers have a clinical name for. Context rot. The more you pack into an AI's context, the worse it performs on everything already inside it. Facts blur. Information buried in the middle vanishes. The model does not become more capable as you feed it more. It becomes more confused. You give it your entire codebase and it forgets what it read three files ago. You hand it a 500-page legal document and it loses the clause from page 12 by the time it reaches page 400. So the industry built a workaround. RAG. Retrieval Augmented Generation. Chop the document into chunks. Store them in a database. Retrieve the relevant ones when needed. It was always a compromise dressed up as a solution. The retriever guesses which chunks matter before the AI has read anything. If it guesses wrong — and it does, constantly — the AI never sees the information it needed. The act of chunking destroys every relationship between distant paragraphs. The full picture gets shredded into fragments that the AI then tries to reassemble blindfolded. Two bad options. One broken industry. Three MIT researchers and a deadline of December 31st. Here is what they built. Stop putting the document in the AI's memory at all. That is the entire idea. That is the breakthrough. Store the document as a Python variable outside the AI's context window entirely. Tell the AI the variable exists and how big it is. Then get out of the way. When you ask a question, the AI does not try to remember anything. It behaves like a human expert dropped into a library with a computer. It writes code. It searches the document with regular expressions. It slices to the exact section it needs. It scans the structure. It navigates. It finds precisely what is relevant and pulls only that into its active window. Then it does something that makes this recursive. When the AI finds relevant material, it spawns smaller sub-AI instances to read and analyze those sections in parallel. Each one focused. Each one fast. Each one reporting back. The root AI synthesizes everything and produces an answer. No summarization. No deletion. No information loss. No decay. Every byte of the original document remains intact, accessible, and queryable for as long as you need it. Now here are the numbers. Standard frontier models on the hardest long-context reasoning benchmarks: scores near zero. Complete collapse. GPT-5 on a benchmark requiring it to track complex code history beyond 75,000 tokens — could not solve even 10% of problems. RLMs on the same benchmarks: solved them. Dramatically. Double-digit percentage gains over every alternative approach. Successfully handling inputs up to 10 million tokens — 100 times beyond a model's native context window. Cost per query: comparable to or cheaper than standard massive context calls. Read that again. One hundred times the context. Better answers. Same price. The timeline of the arms race makes this sting harder. GPT-3 in 2020: 4,000 tokens. GPT-4: 32,000. Claude 3: 200,000. Gemini: 1 million. Gemini 2: 2 million. Every generation, every company, billions of dollars spent, all betting on the same assumption. More context equals better performance. MIT just proved that assumption was wrong the entire time. Not slightly wrong. Fundamentally wrong. The entire premise of the last five years of context window research — that the solution to AI memory was a bigger window — was the wrong answer to the wrong question. The right question was never how much can you force an AI to hold in its head. It was whether you could teach an AI to know where to look. A human expert handed a 10,000-page archive does not read all 10,000 pages before answering your question. They navigate. They search. They find the relevant section, read it deeply, and synthesize the answer. RLMs are the first AI architecture that works the same way. The code is open source. On GitHub right now. Free. No license fees. No API costs. Drop it in as a replacement for your existing LLM API calls and your application does not even notice the difference — except that it suddenly works on inputs it used to fail on entirely. Prime Intellect — one of the leading AI research labs in the space — has already called RLMs a major research focus and described what comes next: teaching models to manage their own context through reinforcement learning, enabling agents to solve tasks spanning not hours, but weeks and months. The context window wars are over. MIT won them by walking away from the battlefield. Source: Zhang, Kraska, Khattab · MIT CSAIL · arXiv:2512.24601 Paper: arxiv.org/abs/2512.24601 GitHub: github.com/alexzhang13/rlm

English

GenSI@hello_gensi·28 Nis

The pieces for this loop already exist in isolation — skill extraction (SkillClaw), structured retrieval (Graph of Skills), RL internalization into weights (SKILL0). But no one has closed the loop. Harness optimization and weight training are still on separate tracks. Building on @Vtrivedy10's point on "marrying Harness Eng and RL"— we think the key bottleneck is feedback fidelity. Meta-Harness outperforms traditional optimizers (4 proposals = 60 iterations) precisely because it gives agents full diagnostic access — file system, traces, error logs — not just a scalar score. Same principle should apply to the RL inner loop: richer signal → faster co-evolution.

Viv@Vtrivedy10

GEPA <1 years old 😮 incredible the impact that the ideas here have spawned on hill climbing + improving agents does anyone know of cool work on looping/GEPA/Optimize_Anything + RL? main ideas: - eventually harness opt hits the wall of model intelligence - we can break through that wall by RLing on good evals that increase model ability in the eval domains - new weights shape intelligence where an updated harness can better use these new weights - loop Model-Harness codesign is really interesting, we’re pushing here much more with using traces to create datasets for self-improvement and there’s some interesting work to do in marrying Harness Eng and RL recipes here 👀

English

GenSI@hello_gensi·26 Nis

This is a very compelling direction, and @DataChaz’s breakdown captures well why it matters. Our current take is that memory will become one of the key battlegrounds this year. Karpathy’s wiki-style setup is a strong first step, but the more fundamental challenge is context management. A more complete memory system will likely require abstraction, scheduling, and orchestration, together with a memory manager that can continuously learn how to organize, retrieve, and serve the right context. This is exactly why self-evolving memory feels like such an important space right now.

Charly Wargnier@DataChaz

🚨 Karpathy’s new set-up is the ultimate self-improving second brain, and it takes zero manual editing 🤯 It acts as a living AI knowledge base that actually heals itself. Let me break it down. Instead of relying on complex RAG, the LLM pulls raw research directly into an @Obsidian Markdown wiki. It completely takes over: ✦ Index creation ✦ System linting ✦ Native Q&A routing The core process is beautifully simple: → You dump raw sources into a folder → The LLM auto-compiles an indexed .md wiki → You ask complex questions → It generates outputs (Marp slides, matplotlib plots) and files them back in The big-picture implication of this is just wild. When agents maintain their own memory layer, they don’t need massive, expensive context limits. They really just need two things: → Clean file organization → The ability to query their own indexes Forget stuffing everything into one giant prompt. This approach is way cheaper, highly scalable... and 100% inspectable!

English

Keşfet

@OMalleyFife @Graham_dePenros @sanlsrni @odysseus0z @stochasticchasm @teortaxesTex @novasarc01 @TheTuringPost