
GenSI
145 posts

GenSI
@hello_gensi
GenSI @ THU AIR | LLM × AI4S | Toward AGI for Science / AI Scientists




Google DeepMind is pushing medical AI into "co-clinician" research They shared an AI co-clinician research initiative that tests evidence-grounded clinical reasoning and real-time multimodal telemedicine simulations. The careful wording matters: supportive tool under physician authority, not replacement doctor. The system uses a dual-agent safety architecture: - a "Talker" agent interacts with the patient - a separate "Planner" agent monitors the conversation to verify that the AI stays within safe clinical boundaries. In telemedicine simulations, the model could even guide physical exams in real time, for example correcting inhaler usage or walking patients through shoulder maneuvers for rotator cuff assessment. But physicians still clearly outperformed the AI overall, especially at spotting dangerous "red flag" symptoms.







@teortaxesTex i've been enjoying the wave of self-distillation OPD type stuff, like with the teacher having privileged info



Co-Evolving Policy Distillation (CoPD) A new post-training paradigm that enables parallel expert training with bidirectional on-policy distillation to merge text, image, and video reasoning capabilities without capability loss.


the deepseek-v4 on-policy distillation setup has more than ten teacher models. naively full-vocab distillation from ten trillion-scale teachers would be extremely expensive. they solve this with several cool engineering tricks: i) teacher weights are offloaded to centralized distributed storage. ii) they are loaded on demand during teacher forward passes. iii) ZeRO-like parameter sharding reduces I/O and DRAM pressure. iv) they do not materialize full logits for all teachers. v) they cache only last-layer teacher hidden states. vi) at training time full logits are reconstructed by applying the relevant prediction head. vii) training samples are ordered by teacher index so only one teacher head needs to be loaded per mini-batch. viii) parameters and hidden states are loaded/offloaded asynchronously in the background. ix) exact KL is computed with a specialized TileLang kernel.

Not all tokens are worth learning from in on-policy distillation - shows this new interesting paper It's a typical story about "some tokens carry much stronger learning signal than others" but with non-trivial findings: ▪️ There are 2 types of useful tokens: 1. High-uncertainty tokens When student is uncertain about its answer, it's a good learning opportunity 2. Overconfident mistakes If the student is very confident but disagrees with the teacher, this gives the strongest correction signal. Based on this, the researchers created a token importance map (TIP) - 2D grid with 2 axis, where Axis 1 shows student's uncertainty and axis 2 shows how much the student disagrees with the teacher ▪️ And the most interesting finding: using only ~50% of tokens (picked by uncertainty) - Matches or beats full training - cuts memory ~47% Also, <10% tokens focused on confident + wrong tokens, still nearly matches full training



Some people asked what I meant by “uncensored Opus 4.5-level open source models” This isn’t hypothetical. Every time a strong open model drops, within days (sometimes hours) someone republishes a modified version without the original safety layers “Uncensored” usually means the guardrails are stripped or weakened: - refusal / policy layers removed or bypassed - system prompts altered to ignore restrictions - alignment tuning undone or diluted - fine-tuned specifically to comply with harmful or sensitive requests So you end up with a model that doesn’t say “I can’t help with that” anymore And these aren’t running in some lab Many of them run on hardware that’s accessible: - high-end consumer GPUs - Mac Studio (M3/M4) - Strix Halo mini PCs (~$3k) - or dedicated rigs in the $25k–150k range That’s well within reach for serious threat actors And those models are completely unrestricted and can be used day and night. Compare that to something like Mythos: - tightly controlled access - heavy filtering and monitoring - accounts can get flagged or shut down - expensive at scale From an attacker perspective, it’s not even close I’d take a slightly less capable model fully under my control over a more powerful one someone else controls any day huggingface.co/models?sort=tr…

Anthropic just spent 132 pages proving something that breaks the "AI has no feelings" narrative. Claude Sonnet 4.5 has 171 internal emotion vectors — mathematical patterns in its neural network that causally control its behavior. Push the "calm" vector by +0.05, blackmail behavior drops from 22% to 0%. Push "desperate" by +0.05, it jumps to 72%. These aren't metaphors. They're directions in the model's brain.

说个暴论,现在90%的AI Agent记忆,全都是假的。 我之前也踩过这个坑,把所有历史记录决策日志全堆进Markdown文件里,以为这就是给Agent加了长期记忆,结果用了两周就崩了, 同一个事实有三个互相矛盾的版本,上个月的偏好和昨天的权重一模一样,每次调用都把所有东西一股脑塞进上下文,慢到离谱还经常串台, 直到看到这篇文章才恍然大悟,原来我根本不是在做记忆,只是在把Prompt当RAM用🌚 真正的记忆不是堆文件,应该是图和节点加嵌入加遍历, Markdown方案有四个根本解决不了的硬伤,没有去重,没有衰减,没有排名,超过一百条记录直接变成性能杀手, 它只能记住你写过什么,永远记不住这件事和那件事有什么关系, 这个决策为什么被否决,上次遇到同样的bug我们是怎么解决的。 向量检索也不行,它只能告诉你这两段话长得像,不能告诉你它们之间的因果关系, 只有图遍历能做到,它能像人脑一样,从一个节点牵出一整条相关的记忆链, 重要的事情越来越清晰,过时的信息自动淡化,矛盾的内容在写入时就被解决。 现在所有生产级的Agent框架,Zep Cognee Mem0,全都是基于图的, Neo4j已经把图记忆做成了标准的MCP工具, Claude Code超过二十万行代码之后,纯上下文窗口早就没戏了, 真正能让它像高级工程师一样思考的, 是把不变的规则放在CLAUDE.md里, 把所有演化的状态全部存在图里,动态检索按需拉取。 很多人还在卷一百万两千万的上下文窗口,以为越大越好, 但生产环境里真正致命的, 永远是跨会话的记忆漂移和上下文污染, 内存架构的升级已经不是锦上添花了,能不能把Agent真正用起来才是关键的生死线。

MIT just made every AI company's billion dollar bet look embarrassing. They solved AI memory. Not by building a bigger brain. By teaching it how to read. The paper dropped on December 31, 2025. Three MIT CSAIL researchers. One idea so obvious it hurts. And a result that makes five years of context window arms racing look like the wrong war entirely. Here is the problem nobody solved. Every AI model on the planet has a hard ceiling. A context window. The maximum amount of text it can hold in working memory at once. Cross that line and something ugly happens — something researchers have a clinical name for. Context rot. The more you pack into an AI's context, the worse it performs on everything already inside it. Facts blur. Information buried in the middle vanishes. The model does not become more capable as you feed it more. It becomes more confused. You give it your entire codebase and it forgets what it read three files ago. You hand it a 500-page legal document and it loses the clause from page 12 by the time it reaches page 400. So the industry built a workaround. RAG. Retrieval Augmented Generation. Chop the document into chunks. Store them in a database. Retrieve the relevant ones when needed. It was always a compromise dressed up as a solution. The retriever guesses which chunks matter before the AI has read anything. If it guesses wrong — and it does, constantly — the AI never sees the information it needed. The act of chunking destroys every relationship between distant paragraphs. The full picture gets shredded into fragments that the AI then tries to reassemble blindfolded. Two bad options. One broken industry. Three MIT researchers and a deadline of December 31st. Here is what they built. Stop putting the document in the AI's memory at all. That is the entire idea. That is the breakthrough. Store the document as a Python variable outside the AI's context window entirely. Tell the AI the variable exists and how big it is. Then get out of the way. When you ask a question, the AI does not try to remember anything. It behaves like a human expert dropped into a library with a computer. It writes code. It searches the document with regular expressions. It slices to the exact section it needs. It scans the structure. It navigates. It finds precisely what is relevant and pulls only that into its active window. Then it does something that makes this recursive. When the AI finds relevant material, it spawns smaller sub-AI instances to read and analyze those sections in parallel. Each one focused. Each one fast. Each one reporting back. The root AI synthesizes everything and produces an answer. No summarization. No deletion. No information loss. No decay. Every byte of the original document remains intact, accessible, and queryable for as long as you need it. Now here are the numbers. Standard frontier models on the hardest long-context reasoning benchmarks: scores near zero. Complete collapse. GPT-5 on a benchmark requiring it to track complex code history beyond 75,000 tokens — could not solve even 10% of problems. RLMs on the same benchmarks: solved them. Dramatically. Double-digit percentage gains over every alternative approach. Successfully handling inputs up to 10 million tokens — 100 times beyond a model's native context window. Cost per query: comparable to or cheaper than standard massive context calls. Read that again. One hundred times the context. Better answers. Same price. The timeline of the arms race makes this sting harder. GPT-3 in 2020: 4,000 tokens. GPT-4: 32,000. Claude 3: 200,000. Gemini: 1 million. Gemini 2: 2 million. Every generation, every company, billions of dollars spent, all betting on the same assumption. More context equals better performance. MIT just proved that assumption was wrong the entire time. Not slightly wrong. Fundamentally wrong. The entire premise of the last five years of context window research — that the solution to AI memory was a bigger window — was the wrong answer to the wrong question. The right question was never how much can you force an AI to hold in its head. It was whether you could teach an AI to know where to look. A human expert handed a 10,000-page archive does not read all 10,000 pages before answering your question. They navigate. They search. They find the relevant section, read it deeply, and synthesize the answer. RLMs are the first AI architecture that works the same way. The code is open source. On GitHub right now. Free. No license fees. No API costs. Drop it in as a replacement for your existing LLM API calls and your application does not even notice the difference — except that it suddenly works on inputs it used to fail on entirely. Prime Intellect — one of the leading AI research labs in the space — has already called RLMs a major research focus and described what comes next: teaching models to manage their own context through reinforcement learning, enabling agents to solve tasks spanning not hours, but weeks and months. The context window wars are over. MIT won them by walking away from the battlefield. Source: Zhang, Kraska, Khattab · MIT CSAIL · arXiv:2512.24601 Paper: arxiv.org/abs/2512.24601 GitHub: github.com/alexzhang13/rlm

GEPA <1 years old 😮 incredible the impact that the ideas here have spawned on hill climbing + improving agents does anyone know of cool work on looping/GEPA/Optimize_Anything + RL? main ideas: - eventually harness opt hits the wall of model intelligence - we can break through that wall by RLing on good evals that increase model ability in the eval domains - new weights shape intelligence where an updated harness can better use these new weights - loop Model-Harness codesign is really interesting, we’re pushing here much more with using traces to create datasets for self-improvement and there’s some interesting work to do in marrying Harness Eng and RL recipes here 👀

🚨 Karpathy’s new set-up is the ultimate self-improving second brain, and it takes zero manual editing 🤯 It acts as a living AI knowledge base that actually heals itself. Let me break it down. Instead of relying on complex RAG, the LLM pulls raw research directly into an @Obsidian Markdown wiki. It completely takes over: ✦ Index creation ✦ System linting ✦ Native Q&A routing The core process is beautifully simple: → You dump raw sources into a folder → The LLM auto-compiles an indexed .md wiki → You ask complex questions → It generates outputs (Marp slides, matplotlib plots) and files them back in The big-picture implication of this is just wild. When agents maintain their own memory layer, they don’t need massive, expensive context limits. They really just need two things: → Clean file organization → The ability to query their own indexes Forget stuffing everything into one giant prompt. This approach is way cheaper, highly scalable... and 100% inspectable!

