Giorgio Robino

29.4K posts

Giorgio Robino banner
Giorgio Robino

Giorgio Robino

@solyarisoftware

Conversational LLM-based Applications Specialist @almawave | Former ITD-CNR Researcher | Soundscapes (Orchestral) Composer.

Genova, Italia Katılım Nisan 2009
4.4K Takip Edilen3.2K Takipçiler
Sabitlenmiş Tweet
Giorgio Robino
Giorgio Robino@solyarisoftware·
My preprint "Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems" now has a revised version on @arXiv with updated experimental results. Here’s a thread with the changes! 🧵 ➡️ Paper: arxiv.org/abs/2501.11613 1/ What’s CR?
English
1
2
3
460
Giorgio Robino retweetledi
Qwen
Qwen@Alibaba_Qwen·
Today we’re releasing Qwen-Scope 🔭, an open suite of sparse autoencoders for the Qwen model family. It turns SAE features into practical tools: 🎯 Inference — Steer model outputs by directly manipulating internal features, no prompt engineering needed 📂 Data — Classify & synthesize targeted data with minimal seed examples, boosting long-tail capabilities 🏋️ Training — Trace code-switching & repetitive generation back to their source, fix them at the root 📊 Evaluation — Analyze feature activation patterns to select smarter benchmarks and cut redundancy We hope the community uses Qwen-Scope to uncover new mechanisms inside Qwen models and build applications beyond what we explored.Excited to see what you build! 🚀 🔗🔗 Blog: qwen.ai/blog?id=qwen-s… HuggingFace: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw… Technical Report: …anwen-res.oss-accelerate.aliyuncs.com/qwen-scope/Qwe…
Qwen tweet media
English
79
319
2.3K
278.8K
Giorgio Robino retweetledi
David Hendrickson
David Hendrickson@TeksEdge·
☀️ Qwen just dropped something big for personal AI. ✨They released Qwen-Scope, the first major open Sparse Autoencoder (SAE) toolkit for real models. 💡 Instead of wrestling with prompts, you can now directly steer Qwen models by manipulating internal features. Why this matters? 🧠 Precise, reliable control when running models locally 🛠️ Fix repetition, hallucinations & bad behaviors at the source 📊 Smarter data synthesis and evaluation 🚀 A real step toward controllable, sovereign personal agents This is unique as no other top lab has open-sourced practical tools for mechanistic control of open models like this (that I know of) The future of personal AI isn’t just bigger models. It’s controllable ones. Qwen-Scope just took a huge leap forward. 🔥
David Hendrickson tweet media
Qwen@Alibaba_Qwen

Today we’re releasing Qwen-Scope 🔭, an open suite of sparse autoencoders for the Qwen model family. It turns SAE features into practical tools: 🎯 Inference — Steer model outputs by directly manipulating internal features, no prompt engineering needed 📂 Data — Classify & synthesize targeted data with minimal seed examples, boosting long-tail capabilities 🏋️ Training — Trace code-switching & repetitive generation back to their source, fix them at the root 📊 Evaluation — Analyze feature activation patterns to select smarter benchmarks and cut redundancy We hope the community uses Qwen-Scope to uncover new mechanisms inside Qwen models and build applications beyond what we explored.Excited to see what you build! 🚀 🔗🔗 Blog: qwen.ai/blog?id=qwen-s… HuggingFace: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw… Technical Report: …anwen-res.oss-accelerate.aliyuncs.com/qwen-scope/Qwe…

English
3
2
12
1.8K
Giorgio Robino retweetledi
Richard Palethorpe
Richard Palethorpe@jichiep·
New model release! LocalVQE: Tiny ~1M param audio model that cancels echo, noise and reverberations in real-time and comes with a @ggml_org implementation out of the gate.
English
7
22
175
10.6K
Giorgio Robino retweetledi
Eric
Eric@Ex0byt·
I cannot be the only one who noticed this. Qwen just quietly ended black-box AI today. I had to implement it myself just to show y'all how big this is. You can now literally see every concept firing in a model and turn any feature on or off. My Demo on HuggingFace: hf.co/spaces/Ex0bit/…
English
11
123
1.2K
133.4K
Giorgio Robino retweetledi
Giorgio Robino retweetledi
antirez
antirez@antirez·
Europe AI strategy should be to specialize on AI inference and improvement of large open weight models, while we try to recover the GPU / companies gap to have a viable internal path. A large Chinese open weight model that works is only better than an European-trained weak one.
English
19
11
202
11.2K
Giorgio Robino retweetledi
elvis
elvis@omarsar0·
// Agentic Harness Engineering // Pay attention to this one, AI devs. (bookmark it) Most coding-agent harnesses are still tuned by hand or brittle trial-and-error self-evolution. This new work introduces Agentic Harness Engineering, a framework that makes harness evolution observable. They do this through three layers: components as revertible files, experience as condensed evidence from millions of trajectory tokens, and decisions as falsifiable predictions checked against task outcomes. Each edit becomes a contract you can verify or revert. Results: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations, beating human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The evolved harness also transfers across model families with +5.1 to +10.1 point gains, while using 12% fewer tokens than the seed on SWE-bench-verified. Harness work is the biggest hidden cost in most agent systems. This is the first credible recipe for letting the harness improve itself without drifting into noise. Paper: arxiv.org/abs/2604.25850 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
60
222
1.6K
129.2K
Giorgio Robino retweetledi
Alex Prompter
Alex Prompter@alex_prompter·
Both OpenAI and Anthropic just released official prompting guides. Both say the same thing. Your old prompts don’t work anymore. But for opposite reasons. Claude Opus 4.7 stopped guessing what you meant. It does exactly what you type. Nothing more, nothing less. Vague instructions that worked on 4.6? They now produce narrow, literal, sometimes worse results. Not because the model got dumber. Because it stopped compensating for sloppy thinking. GPT-5.5 went the other direction. OpenAI’s guide literally says: “Don’t carry over instructions from older prompt stacks.” Legacy prompts over-specify the process because older models needed hand-holding. GPT-5.5 doesn’t. That extra detail now creates noise and produces mechanical output. Claude got more literal. GPT got more autonomous. Both now punish the same thing: prompts written without clear thinking behind them. One developer on Reddit captured it perfectly after analyzing hundreds of community posts. The complaints tracked almost perfectly with prompt specificity. Precise prompts got better results on 4.7. Vague prompts got worse. The model didn’t regress. The prompts did. OpenAI’s new framework is “outcome-first prompting.” Describe what good looks like. Define success criteria. Set constraints. Then get out of the way. The model picks the path. Anthropic’s framework is the inverse: be surgically specific about what you want, because the model won’t fill in your blanks anymore. Two different architectures. Two different philosophies. One identical conclusion: the person writing the prompt is now the bottleneck, not the model. Boris Cherny, the engineer who built Claude Code, posted on launch day that even he needed a few days to adjust. That post got 936 likes. Meanwhile, Anthropic increased rate limits for all subscribers because the new tokenizer uses up to 35% more tokens on the same input. The model is more expensive to run lazily. Cheaper to run precisely. The models are converging in capability. The gap between good and bad output is no longer about which model you pick. It’s about the 2 minutes of structured thinking you do before you type anything. That thinking system is the skill. The prompt is just what it produces.
Alex Prompter tweet mediaAlex Prompter tweet media
English
114
261
2.3K
302.8K
Giorgio Robino retweetledi
Jerry Liu
Jerry Liu@jerryjliu0·
This is really well thought out. Filesystems are the new default abstraction for agents to interact with documents (the new RAG stack in 2026). The issue is actually figuring out how to productize this; you can't "productize" Claude Code over a local file system. Seems like this tool has all the semantics of filesystems with the versioning of git
Oliver@olvrgln

Introducing Mesa: the most powerful filesystem ever built, designed specifically for enterprise AI agents. Every team building agents eventually hits the same wall: where do the files live? Not the chat history, the actual artifacts the agent works on. > The contracts your agent redlined > The claim files it updated > The 200-page audit report it edited overnight while you were asleep Today those documents live in a sandbox that dies in 30 minutes, an S3 bucket where concurrent writes clobber each other, or a GitHub repo that was never built to absorb agent-scale traffic. So we built Mesa. The world's first POSIX-compatible filesystem with built-in version control, designed from the ground up for agents. You mount it into your sandbox like any other filesystem. Your agent reads and writes files normally. Behind the scenes every change is versioned, branchable, reviewable, and rollback-able — like a codebase, for any file type. Mesa provides – Branches so agents work in parallel without locking – Durable storage that survives sandbox death – Sparse materialization so massive document sets load instantly – Fine-grained access control per agent – Full history for human review and audit Design partners are running Mesa in production across legal, healthcare, GTM, business ops, and coding agents. Private beta is open: link in the comments

English
19
31
453
98.8K
Giorgio Robino retweetledi
Jo Kristian Bergum
Jo Kristian Bergum@jobergum·
Progressive disclosure and skill is a big part of agent harness engineering. Effective retrieval over skills is going to be big, might even become "web scale". "Under this paradigm, we propose Skill Retrieval Augmented Agents (SR-Agents), which dynamically retrieve and use relevant skills from large-scale skill corpora to expand their problem-solving capabilities" arxiv.org/abs/2604.24594…
Jo Kristian Bergum tweet media
English
9
19
126
7.7K
Giorgio Robino retweetledi
Piotr Żelasko
Piotr Żelasko@PiotrZelasko·
Today we released Nemotron-3-Nano-Omni-30B-A3B - our first Omni model, with speech and audio understanding capabilities powered by parakeet-tdt-0.6b-v2 encoder. 🫡1st position on VoiceBench 🌏English only 🎙️5.95% WER on Open ASR Leaderboard 📽️Video+audio understanding
English
19
50
502
28K
Giorgio Robino retweetledi
Sumanth
Sumanth@Sumanth_077·
Open protocol for AI agent perception! World2Agent (W2A) standardizes how AI agents perceive the real world. Install a sensor, your agent gets structured, real-time data. Swap sensors freely - they all speak the same schema. The problem: Every agent has its own way of watching data sources. You build custom integrations for Hacker News, market data, production alerts, weather APIs. None of it is portable. When you switch agent frameworks, you rebuild everything. W2A fixes this with a standard protocol. Sensors watch data sources and emit structured signals. Your agent receives these signals and decides what to do. The architecture is three layers: World (data sources) → Sensor (watches and structures) → Agent (receives and acts) Sensors are distributed as npm packages. Need production alerts? Install sensor-prod-alerts. Need market data? Install sensor-markets. Each sensor emits signals in the same schema. Anyone can build a sensor. A Hacker News sensor is about 50 lines - poll the HN API, structure the data into W2A signals, emit. Ship it to npm and it's installable by any agent. It also comes with SensorHub where you can browse sensors by category (markets, news, production, weather, AI labs), view their signal schemas, and install. Integration works through plugins for Claude Code or direct SDK for custom runtimes. Run multiple sensors simultaneously - your agent sees all signals in real time. Why this matters: Agent perception is fragmented. Every framework reinvents the same integrations. W2A creates a standard layer. Build a sensor once, it works everywhere. Switch agent runtimes, your sensors come with you. It's 100% Open source Link to the GitHub repo in the replies!
Sumanth tweet media
English
9
16
34
3.6K
Giorgio Robino retweetledi
hardmaru
hardmaru@hardmaru·
For the past few years, humans have been doing “prompt engineering” to coax the best performance out of different LLMs. In this work, we explored what happens if we train an AI to do that job instead. By training a Conductor model with RL, we found that it naturally learns to write highly effective, custom instructions for a whole pool of other models. It essentially learns to ‘manage’ them in natural language. What surprised me most was how it dynamically adapts. For simple factual questions, it just queries one model. But for hard coding problems, it autonomously spins up a whole pipeline of planners, coders, and verifiers. Really excited to see where this paradigm of “AI managing AI” goes next, especially as we start moving from single-agent chain-of-thought to multi-agent “chain-of-command”. Link to our #ICLR2026 paper: arxiv.org/abs/2512.04388 Along with our TRINITY paper which we announced earlier, this work also powers our new multi-agent system: Sakana Fugu (sakana.ai/fugu-beta) 🐡
Sakana AI@SakanaAILabs

Introducing our new work: “Learning to Orchestrate Agents in Natural Language with the Conductor” accepted at #ICLR2026 arxiv.org/abs/2512.04388 What if we trained an AI not to solve problems directly, but to act as a manager that delegates tasks to a diverse team of other AIs? To solve complex tasks, humans rarely work alone; we form teams, delegate, and communicate. Yet, multi-agent AI systems currently rely heavily on rigid, human-designed workflows or simple routers that just pick a single model. We wanted an AI that could dynamically build its own team. We trained a 7B Conductor model using Reinforcement Learning to orchestrate a pool of frontier models (including GPT-5, Gemini, Claude, and open-source models available during the period leading up to ICLR 2026). Instead of executing code, the Conductor outputs a collaborative workflow in natural language. For any given question, the Conductor specifies: 1/ Which agent to call 2/ What specific subtask to give them (acting as an expert prompt engineer) 3/ What previous messages they can see in their context window Through pure end-to-end reward maximization, amazing behaviors emerged. The Conductor learned to adapt to task difficulty: it 1-shots simple factual questions, but autonomously spins up complex planner-executor-verifier pipelines for hard coding problems. The results are very promising: The 7B Conductor surpasses the performance of every individual worker model in its pool, setting new records on LiveCodeBench (83.9%) and GPQA-Diamond (87.5%) at the time of publication. It also significantly outperforms expensive multi-agent baselines like Mixture-of-Agents at a fraction of the cost. One of our favorite features: Recursive Test-Time Scaling! By allowing the Conductor to select itself as a worker, it reads its own team's prior output, realizes if it failed, and spins up a corrective workflow on the fly. This opens a new axis for scaling compute during inference. This research proves that language models can become elite meta-prompt engineers, dynamically harnessing collective intelligence. Alongside our TRINITY research which we announced a few days earlier, this foundational research powers our new multi-agent system: Sakana Fugu! (sakana.ai/fugu-beta) 🐡 OpenReview: openreview.net/forum?id=U23A2… (ICLR 2026)

English
36
172
1.4K
171.7K
Giorgio Robino retweetledi
Daily Dose of Data Science
Daily Dose of Data Science@DailyDoseOfDS_·
Vibe train your AI agents. This new method can replace LLM-as-a-judge for production agents. Most teams point a giant LLM at their agent's output and call it evaluation. It works, but it comes with two real costs: - It's slow and expensive at inference time - It misses the domain-specific failures that actually matter to your use case Vibe training flips the whole setup. Researchers at Plurai distill a small language model that's specialized for your agent's exact behavior, your edge cases, and your failure modes. The SLM becomes your evaluator and your runtime guardrail in one. Here's why this is a big deal: - Cheap enough to run inline on every agent step, not just offline batches - Catches the failures that generic LLM judges shrug off - Same model guards production and grades it, so eval and runtime stay in sync A small specialized model beating a giant general one is becoming a pattern. Distillation is quietly turning into one of the most underrated techniques for shipping reliable agents. Try it here: plurai.ai/launch Paper: plurai.ai/papers
Daily Dose of Data Science tweet media
Ilan Kadar@ilan_kadar

Big day for us, finally sharing what we’ve been cooking for a while. Over the past year, we kept seeing the same pattern: AI agents look great in demos, until real users break them. Today, we’re fixing that with 𝘃𝗶𝗯𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 to build real-time, tailored evals and guardrails for your agents, in minutes. Define your intent with a prompt or a few examples. We generate edge-case datasets, and train a model aligned to your use case, outperforming state-of-the-art LLMs at a fraction of the cost. (Research paper with benchmarks in the comments) If you’re building AI agents, don’t let your users be the ones who discover the failures. Be the one who makes AI agents reliable in production and takes control at scale. Start vibe-training for free: plurai.ai/launch

English
4
9
61
5.5K
Giorgio Robino retweetledi
Simplifying AI
Simplifying AI@simplifyinAI·
Psychology solved the AI memory problem decades ago. We just haven't been reading the right papers. Current AI architectures are failing because they treat memory like a hard drive. Vector databases (RAG) are just flat embedding spaces. Conversation summaries compress a life into a bio. Episodic buffers give agents a 30-second memory span. Past 10k documents, semantic search is basically a coin flip. But in 2005, a landmark psychology paper mapped exactly how human memory actually scales. It’s called the Self-Memory System. Humans don't store memories like database rows. We construct them. Our brains organize memory hierarchically: Lifetime periods. General events. Episodic details. When you remember something, your brain doesn't perform a vector similarity search across billions of flat tokens. It filters the past through the "Working Self", a dynamic system that retrieves only what is directly relevant to your current active goals. This changes everything for how we build AI agents. Right now, we are force-feeding models massive context windows and hoping they figure it out. We are trying to solve a cognitive problem with a database engineering solution. If we want AI that can actually reason across a lifetime of data, we have to stop building better hard drives. We have to build an artificial Working Self. An AI shouldn't retrieve the most "semantically similar" document. It should retrieve the memory that is most relevant to its current objective. The blueprint for agentic memory has been sitting in psychology journals for 20 years. We just have to stop thinking like software engineers. And start thinking like psychologists.
Simplifying AI tweet media
English
47
119
569
32K
Giorgio Robino retweetledi
huihui.ai
huihui.ai@support_huihui·
New Model: huihui-ai/Huihui4-8B-A4B-v2 Huihui4-8B-A4B-v2 is a lightweight MoE (Mixture of Experts) conversational model optimized from Google's gemma-4-26B-A4B-it architecture. Through expert pruning and supervised fine-tuning on high-quality dialogue data, the dataset adopts the thinking mode in GLM-5.1 format. This way, in thinking mode, it better reflects the thinking mode of GLM-5.1. this model significantly reduces computational overhead while preserving core reasoning and interaction capabilities. It is specifically designed for deployment on consumer-grade hardware and code-related conversational tasks. This model is not an ablation variant. huggingface.co/huihui-ai/Huih…
English
4
12
180
12.8K
Giorgio Robino retweetledi
Sukh Sroay
Sukh Sroay@sukh_saroy·
MICROSOFT AND SALESFORCE JUST PROVED THAT THE WAY YOU ACTUALLY USE CHATGPT IS THE WAY IT FAILS. Not the dumb way. The normal way. The way you use it every single day. The researchers ran 200,000+ simulated conversations across 15 of the top LLMs in the world. GPT-4.1. Claude 3.7 Sonnet. Gemini 2.5 Pro. DeepSeek-R1. o3. Every model people pay for. Every model people trust to get real work done. They tested two scenarios. In the first, they handed the model the entire prompt at once. Every detail. Every constraint. Every requirement. One clean message. In the second, they fed the same exact information across multiple turns the way a real human asks questions. Same task. Same information. Just delivered differently. Performance dropped 39% across the board. Not 5%. Not 10%. Thirty nine percent. A model getting 90% accuracy in a single clean prompt collapsed to ~60% the moment you talked to it like a human being. This wasn't one model. This was every model they tested. The researchers gave the failure mode a name. They called it "getting lost in conversation." When the model takes a wrong turn early, it does not recover. It locks in the assumption it made in turn 2 and drags it through the entire conversation, no matter how much you correct it later. The most unsettling number in the paper: Aptitude (raw capability) only dropped 15%. But UNRELIABILITY jumped 112%. The model didn't get dumber. It got wildly inconsistent. The gap between its best run and worst run on the SAME task could exceed 50 points. Same prompt. Same model. Different day. Completely different answer. Here is what makes this scary. The researchers found that smarter models did not save you. GPT-4.1 and Gemini 2.5 Pro had slightly better multi-turn aptitude, but their unreliability scores were nearly identical to weaker open-source models. Spending more on a better model does not fix this. They also tested whether the new "reasoning" models would solve it. o3 and DeepSeek-R1, the models with extra thinking time built in, degraded just as badly as the non-reasoning ones. More compute did not help. More thinking tokens did not help. The architecture itself is the problem. The paper identified four behaviors driving the collapse: The model jumps to a full answer too early before it has all the information. It locks in assumptions from turn 1 and refuses to update them. It loses track of what was said in the middle of the conversation. And it over-relies on its own previous responses instead of yours. Sound familiar? That's because you have experienced every single one of these and probably blamed yourself for it. The real takeaway is uncomfortable. Every benchmark you have ever seen. Every "GPT-5 scored 95%" headline. Every model leaderboard that made you upgrade your subscription. All of it tested in single-turn, fully-specified prompts. The exact opposite of how you actually use the tool. The paper's exact words: "LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, when LLMs take a wrong turn in a conversation, they get lost and do not recover." The fix the researchers suggest is uncomfortable too. Stop having conversations. Write one complete prompt with every detail upfront. Or restart the chat the moment you notice the model going off track. Do not try to correct it mid-conversation. It will not listen. Every time you go back and forth with ChatGPT to "refine" an answer, you are walking it deeper into the exact failure mode this paper documented. You are not collaborating with the AI. You are slowly making it worse.
Sukh Sroay tweet media
English
36
87
278
24.1K
Giorgio Robino retweetledi
stevibe
stevibe@stevibe·
Gemma4 vs Gemma4: Who Overthinks the Most? Round 2 of the overthinking series. Last time, Qwen's tiny 9B burned the most tokens AND failed the most (3/5 wrong). Gemma4's turn. Same 5 nasty math problems. 🥇 The overthinker: Gemma4 26B A4B (MoE) > 8,178 tokens on Q1 ❌ > 5,832 on Q2 ❌ > 899 on Q3 ❌ Only model to fail. 3 out of 5 wrong. 💎 The underrated star: Gemma4 E4B > Lowest total tokens. > Perfect 5/5. > One of the smallest models in the lineup quietly beat everyone. If you're running Gemma locally, E4B deserves a serious look.
English
15
14
203
29.8K