Jing Yuan
27.5K posts




@getjonwithit They don't accept the idea that wetness, as a phenomenal quality, has anything to do with symbol processing. Wetness is not going to be grounded in purely mechanical properties. But if you have such a horrible feeling you could clarify a bit instead of this strange parenthetical.



Introducing our new work: “Learning to Orchestrate Agents in Natural Language with the Conductor” accepted at #ICLR2026 arxiv.org/abs/2512.04388 What if we trained an AI not to solve problems directly, but to act as a manager that delegates tasks to a diverse team of other AIs? To solve complex tasks, humans rarely work alone; we form teams, delegate, and communicate. Yet, multi-agent AI systems currently rely heavily on rigid, human-designed workflows or simple routers that just pick a single model. We wanted an AI that could dynamically build its own team. We trained a 7B Conductor model using Reinforcement Learning to orchestrate a pool of frontier models (including GPT-5, Gemini, Claude, and open-source models available during the period leading up to ICLR 2026). Instead of executing code, the Conductor outputs a collaborative workflow in natural language. For any given question, the Conductor specifies: 1/ Which agent to call 2/ What specific subtask to give them (acting as an expert prompt engineer) 3/ What previous messages they can see in their context window Through pure end-to-end reward maximization, amazing behaviors emerged. The Conductor learned to adapt to task difficulty: it 1-shots simple factual questions, but autonomously spins up complex planner-executor-verifier pipelines for hard coding problems. The results are very promising: The 7B Conductor surpasses the performance of every individual worker model in its pool, setting new records on LiveCodeBench (83.9%) and GPQA-Diamond (87.5%) at the time of publication. It also significantly outperforms expensive multi-agent baselines like Mixture-of-Agents at a fraction of the cost. One of our favorite features: Recursive Test-Time Scaling! By allowing the Conductor to select itself as a worker, it reads its own team's prior output, realizes if it failed, and spins up a corrective workflow on the fly. This opens a new axis for scaling compute during inference. This research proves that language models can become elite meta-prompt engineers, dynamically harnessing collective intelligence. Alongside our TRINITY research which we announced a few days earlier, this foundational research powers our new multi-agent system: Sakana Fugu! (sakana.ai/fugu-beta) 🐡 OpenReview: openreview.net/forum?id=U23A2… (ICLR 2026)

Codex (5.5) was repeatedly killing innocent Claude Codes without any instruction. I've never seen this happen before


LiteParse, our OSS document parser, is really good at parsing complex PDF layouts, text, and tables into a clean spatial grid. The best part is it doesn't use VLMs or any ML models at all. It's entirely heuristics based and super fast ⚡️ The secret lies in our sophisticated grid projection algorithm. This blog post by @LoganMarkewich gives a comprehensive walkthrough on how it works: 1️⃣ Sort lines based on similar Y coordinates 2️⃣ Extract left, right, and center anchors 3️⃣ Classify every text item into one of these anchors 4️⃣ Project every text item into a grid column (the exception is any paragraph of flowing text, which is rendered separately) 5️⃣ For any item projected into a grid column, that item is the forward anchor for all subsequent text items with the same anchor 6️⃣ Postprocess the final outputs to remove extraneous spaces and margins As an example, take a look at the results below. You can see text in the left column, with a nicely overlaid table on the right. LiteParse is fully free and open-source, you can use it today! Either directly through the CLI or integrated into your coding agent. Blog: llamaindex.ai/blog/how-litep… LiteParse repo: github.com/run-llama/lite…

Why we are excited about confessions alignment.openai.com/confessions/


We invited Claude users to share how they use AI, what they dream it could make possible, and what they fear it might do. Nearly 81,000 people responded in one week—the largest qualitative study of its kind. Read more: anthropic.com/features/81k-i…








