Dr. Carlos Toxtli

13.6K posts

Dr. Carlos Toxtli banner
Dr. Carlos Toxtli

Dr. Carlos Toxtli

@ctoxtli

📜 Assistant Professor @ClemsonUniv 🥼 Director Human-AI Empowerment Lab @ClemsonAI 🤖 Past: Google, United Nations, Snap, Microsoft Research

Clemson Katılım Haziran 2009
3.9K Takip Edilen3.3K Takipçiler
Dr. Carlos Toxtli retweetledi
Harmonic
Harmonic@HarmonicMath·
Aristotle fixes this
Nav Toor@heynavtoor

🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves. And the way they proved it is devastating. Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems — and made one change. They swapped the numbers. Same problem. Same logic. Same steps. Different numbers. Every model's performance dropped. Every single one. 25 state-of-the-art models tested. But that wasn't the real experiment. The real experiment broke everything. They added one sentence to a math problem. One sentence that is completely irrelevant to the answer. It has nothing to do with the math. A human would read it and ignore it instantly. Here's the actual example from the paper: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" The correct answer is 190. The size of the kiwis has nothing to do with the count. A 10-year-old would ignore "five of them were a bit smaller" because it's obviously irrelevant. It doesn't change how many kiwis there are. But o1-mini, OpenAI's reasoning model, subtracted 5. It got 185. Llama did the same thing. Subtracted 5. Got 185. They didn't reason through the problem. They saw the number 5, saw a sentence that sounded like it mattered, and blindly turned it into a subtraction. The models do not understand what subtraction means. They see a pattern that looks like subtraction and apply it. That is all. Apple tested this across all models. They call the dataset "GSM-NoOp" — as in, the added clause is a no-operation. It does nothing. It changes nothing. The results are catastrophic. Phi-3-mini dropped over 65%. More than half of its "math ability" vanished from one irrelevant sentence. GPT-4o dropped from 94.9% to 63.1%. o1-mini dropped from 94.5% to 66.0%. o1-preview, OpenAI's most advanced reasoning model at the time, dropped from 92.7% to 77.4%. Even giving the models 8 examples of the exact same question beforehand, with the correct solution shown each time, barely helped. The models still fell for the irrelevant clause. This means it's not a prompting problem. It's not a context problem. It's structural. The Apple researchers also found that models convert words into math operations without understanding what those words mean. They see the word "discount" and multiply. They see a number near the word "smaller" and subtract. Regardless of whether it makes any sense. The paper's exact words: "current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data." And: "LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts." They also tested what happens when you increase the number of steps in a problem. Performance didn't just decrease. The rate of decrease accelerated. Adding two extra clauses to a problem dropped Gemma2-9b from 84.4% to 41.8%. Phi-3.5-mini from 87.6% to 44.8%. The more thinking required, the more the models collapse. A real reasoner would slow down and work through it. These models don't slow down. They pattern-match. And when the pattern becomes complex enough, they crash. This paper was published at ICLR 2025, one of the most prestigious AI conferences in the world. You are using AI to help you make financial decisions. To check legal documents. To solve problems at work. To help your children with homework. And Apple just proved that the AI is not thinking about any of it. It is pattern matching. And the moment something unexpected shows up in your question, it breaks. It does not tell you it broke. It just quietly gives you the wrong answer with full confidence.

English
5
5
48
5.5K
Dr. Carlos Toxtli retweetledi
Victor M
Victor M@victormustar·
NVIDIA's Kimodo is the release of the week 🔥 Prompt the timeline whatever your want like: "a person walks forward" → "a person starts jumping", hit Generate, and watch a 3D character do it in seconds (700hrs of pro mocap training. Works on human + robot skeletons. Super fast + free to use on HF)
English
57
396
3.1K
413K
Dr. Carlos Toxtli retweetledi
Nainsi Dwivedi
Nainsi Dwivedi@NainsiDwiv50980·
Holy shit... Microsoft open sourced an inference framework that runs a 100B parameter LLM on a single CPU. It's called BitNet. And it does what was supposed to be impossible. No GPU. No cloud. No $10K hardware setup. Just your laptop running a 100-billion parameter model at human reading speed. Here's how it works: Every other LLM stores weights in 32-bit or 16-bit floats. BitNet uses 1.58 bits. Weights are ternary just -1, 0, or +1. That's it. No floats. No expensive matrix math. Pure integer operations your CPU was already built for. The result: - 100B model runs on a single CPU at 5-7 tokens/second - 2.37x to 6.17x faster than llama.cpp on x86 - 82% lower energy consumption on x86 CPUs - 1.37x to 5.07x speedup on ARM (your MacBook) - Memory drops by 16-32x vs full-precision models The wildest part: Accuracy barely moves. BitNet b1.58 2B4T their flagship model was trained on 4 trillion tokens and benchmarks competitively against full-precision models of the same size. The quantization isn't destroying quality. It's just removing the bloat. What this actually means: - Run AI completely offline. Your data never leaves your machine - Deploy LLMs on phones, IoT devices, edge hardware - No more cloud API bills for inference - AI in regions with no reliable internet The model supports ARM and x86. Works on your MacBook, your Linux box, your Windows machine. 27.4K GitHub stars. 2.2K forks. Built by Microsoft Research. 100% Open Source. MIT License
English
153
441
2.3K
296.6K
Dr. Carlos Toxtli retweetledi
BURKOV
BURKOV@burkov·
This joint effort from UIUC, Meta, Google, and other major AI labs presents a unified roadmap for transforming LLMs into autonomous agents capable of planning, acting, and learning in dynamic environments. Read with AI tutor: chapterpal.com/s/219d7f0e/age… Read alone: arxiv.org/pdf/2601.12538
BURKOV tweet media
English
16
55
311
29.1K
Dr. Carlos Toxtli retweetledi
Connor Davis
Connor Davis@connordavis_ai·
MIT just published a paper that quietly explains why LLM reasoning hits a wall and how to push past it. The usual story is that models fail on hard problems because they lack scale, data, or intelligence. This paper argues something much more structural: models stop improving because the learning signal disappears. Once a task becomes too difficult, success rates collapse toward zero, reinforcement learning has nothing to optimize, and reasoning stagnates. The failure isn’t cognitive, it’s pedagogical. The authors propose a simple but radical reframing. Instead of asking how to make models solve harder problems, they ask how models can generate problems that teach them. Their system, SOAR, splits a single pretrained model into two roles: a student that attempts extremely hard target tasks, and a teacher that generates new training problems. The catch is that the teacher is not rewarded for producing clever or realistic questions. It is rewarded only if the student’s performance improves on a fixed set of real evaluation problems. No improvement means zero reward. That incentive reshapes everything. The teacher learns to generate intermediate, stepping-stone problems that sit just inside the student’s current capability boundary. These problems are not simplified versions of the target task, and strikingly, they do not even require correct solutions. What matters is that their structure forces the student to practice the right kind of reasoning, allowing gradient signal to emerge even when direct supervision fails. The experimental results make the point painfully clear. On benchmarks where models start with zero success and standard reinforcement learning completely flatlines, SOAR breaks the deadlock and steadily improves performance. The model escapes the edge of learnability not by thinking harder, but by constructing a better learning environment for itself. The deeper implication is uncomfortable. Many supposed “reasoning limits” may not be limits of intelligence at all. They are artifacts of training setups that assume the world provides learnable problems for free. This paper suggests that if models can shape their own curriculum, reasoning plateaus become engineering problems, not fundamental barriers. No new architectures, no extra human data, no larger models. Just a shift in what we reward: learning progress instead of answers.
Connor Davis tweet media
English
45
174
799
45.5K
Dr. Carlos Toxtli retweetledi
Google Research
Google Research@GoogleResearch·
Sequencing a human genome, which once took 13 years and $3B, can now be done in days with the help of AI. By using AI tools like DeepVariant and DeepConsensus, we’re now helping researchers sequence the genomes of endangered species with incredible speed and accuracy. From the Grevy’s zebra to the African penguin, see how AI is helping pull species back from the brink.
English
36
189
1K
58.2K
Dr. Carlos Toxtli retweetledi
Javi Lopez ⛩️
Javi Lopez ⛩️@javilopen·
⚡ Google Genie 3 but OPEN SOURCE Not even 48h later and the Chinese did it again: they just dropped a free real-time playable world generator. - LingBot-World - Built on Alibaba's Wan2.2 - REAL-TIME interaction at 16fps 100% open source 🧵
English
186
598
5.1K
563.1K
Dr. Carlos Toxtli retweetledi
Google DeepMind
Google DeepMind@GoogleDeepMind·
We're helping AI to see the 3D world in motion as humans do. 🌐 Enter D4RT: a unified model that turns video into 4D representations faster than previous methods - enabling it to understand space and time. This is how it works 🧵
English
91
431
3.2K
367.6K
Dr. Carlos Toxtli retweetledi
Google Research
Google Research@GoogleResearch·
Announcing our latest open medical AI models for developers: MedGemma 1.5, which is small enough to run offline & improves performance on 3D imaging (CT & MRI), & MedASR, a speech-to-text model for medical dictation. Both available on Hugging Face + Vertex AI. goo.gle/3L9oiII #MedGemma #HealthAI #GenerativeAI
English
67
534
3.6K
393.7K
Dr. Carlos Toxtli retweetledi
elvis
elvis@omarsar0·
Major new research from Google and MIT. "More agents is all you need" has become a mantra for AI developers. We know multi-agent systems can be effective, but we do this mostly based on heuristics. The default approach to building complex AI systems today remains adding more agents, more coordination, more communication. It would be helpful to have a more principled way to scale agentic systems. This new research introduces the first quantitative scaling principles for agent systems, testing 180 configurations across three LLM families (OpenAI, Google, Anthropic) and four agentic benchmarks spanning financial reasoning, web navigation, game planning, and workflow execution. The findings: Multi-agent systems show an overall mean MAS improvement of -3.5% across all benchmarks, with massive variance ranging from +81% improvement to -70% degradation depending on task structure and architecture. Three dominant effects emerge from the data: The tool-coordination trade-off: tool-heavy tasks suffer disproportionately from multi-agent overhead. The efficiency penalty compounds as environmental complexity increases. A task with 16 tools makes even the most efficient multi-agent architecture paradoxically less effective than a single agent. The capability ceiling: once single-agent baselines exceed approximately 45% accuracy, coordination yields diminishing or negative returns. This is quantified as a statistically significant effect. Additional agents simply cannot overcome the coordination tax when baseline performance is already reasonable. Architecture-dependent error amplification: independent multi-agent systems amplify errors 17.2x through unchecked propagation. Centralized coordination contains this to 4.4x via validation bottlenecks (these catch errors before propagation). The presence or absence of inter-agent verification determines whether collaboration corrects or catastrophically compounds mistakes. The performance heterogeneity is also interesting to look at: - On parallelizable financial reasoning tasks, centralized multi-agent coordination achieves +80.9% improvement. - On sequential planning tasks requiring constraint satisfaction, every multi-agent variant tested degraded performance by 39-70%. - Decentralized coordination excels on dynamic web navigation (+9.2%) but provides essentially no benefit elsewhere. The researchers derive a predictive model achieving cross-validated 𝑅^2=0.513 that correctly predicts the optimal architecture for 87% of held-out configurations. This model contains no dataset-specific parameters, enabling generalization to unseen task domains. Overall, architecture-task alignment, not the number of agents, determines collaborative success. The research replaces heuristic guidance with quantitative principles: measure task decomposability, tool complexity, and baseline difficulty, then select a coordination structure accordingly. Paper: arxiv.org/abs/2512.08296 Learn to build effective AI agents in my academy: dair-ai.thinkific.com
elvis tweet media
English
53
168
903
74.6K
Dr. Carlos Toxtli retweetledi
Lior Alexander
Lior Alexander@LiorOnAI·
You can now transform LLMs into diffusion models. dLLM released an open recipe that converts any autoregressive model into a diffusion LLM. How the conversion works: 1. Remove the causal mask and enable bidirectional attention 2. Mask random tokens and train the model to fill the gaps 3.Add light supervised training to stabilize outputs
English
23
110
612
40.1K
Dr. Carlos Toxtli retweetledi
Chris Laub
Chris Laub@ChrisLaubAI·
This Stanford University paper just broke my brain. They just built an AI agent framework that evolves from zero data no human labels, no curated tasks, no demonstrations and it somehow gets better than every existing self-play method. It’s called Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning And it’s insane what they pulled off. Every “self-improving” agent you’ve seen so far has the same fatal flaw: they can only generate tasks slightly harder than what they already know. So they plateau. Immediately. Agent0 breaks that ceiling. Here’s the twist: They spawn two agents from the same base LLM and make them compete. • Curriculum Agent - generates harder and harder tasks • Executor Agent - tries to solve them using reasoning + tools Whenever the executor gets better, the curriculum agent is forced to raise the difficulty. Whenever the tasks get harder, the executor is forced to evolve. This creates a closed-loop, self-reinforcing curriculum spiral and it all happens from scratch, no data, no humans, nothing. Just two agents pushing each other into higher intelligence. And then they add the cheat code: A full Python tool interpreter inside the loop. The executor learns to reason through problems with code. The curriculum agent learns to create tasks that require tool use. So both agents keep escalating. The results? → +18% gain in math reasoning → +24% gain in general reasoning → Beats R-Zero, SPIRAL, Absolute Zero, even frameworks using external proprietary APIs → All from zero data, just self-evolving cycles They even show the difficulty curve rising across iterations: tasks start as basic geometry and end at constraint satisfaction, combinatorics, logic puzzles, and multi-step tool-reliant problems. This is the closest thing we’ve seen to autonomous cognitive growth in LLMs. Agent0 isn’t just “better RL.” It’s a blueprint for agents that bootstrap their own intelligence. The agent era just got unlocked.
Chris Laub tweet media
English
113
389
1.7K
113.2K
Dr. Carlos Toxtli retweetledi
Akshay 🚀
Akshay 🚀@akshay_pachaar·
Google just dropped "Attention is all you need (V2)" This paper could solve AI's biggest problem: Catastrophic forgetting. When AI models learn something new, they tend to forget what they previously learned. Humans don't work this way, and now Google Research has a solution. Nested Learning. This is a new machine learning paradigm that treats models as a system of interconnected optimization problems running at different speeds - just like how our brain processes information. Here's why this matters: LLMs don't learn from experiences; they remain limited to what they learned during training. They can't learn or improve over time without losing previous knowledge. Nested Learning changes this by viewing the model's architecture and training algorithm as the same thing - just different "levels" of optimization. The paper introduces Hope, a proof-of-concept architecture that demonstrates this approach: ↳ Hope outperforms modern recurrent models on language modeling tasks ↳ It handles long-context memory better than state-of-the-art models ↳ It achieves this through "continuum memory systems" that update at different frequencies This is similar to how our brain manages short-term and long-term memory simultaneously. We might finally be closing the gap between AI and the human brain's ability to continually learn. I've shared link to the paper in the next tweet!
Akshay 🚀 tweet media
English
257
1K
6K
511.4K
Dr. Carlos Toxtli retweetledi
Google AI Studio
Google AI Studio@GoogleAIStudio·
gemini 3 pro • our most intelligent model yet • SOTA reasoning • 1501 Elo on LMArena • next-level vibe coding capabilities • complex multimodal understanding available now in Google AI Studio and the Gemini API
Google AI Studio tweet media
English
303
1.5K
13.9K
639.1K
Dr. Carlos Toxtli retweetledi
AK
AK@_akhaliq·
DeepAgent A General Reasoning Agent with Scalable Toolsets
AK tweet media
English
2
27
153
17.2K
Dr. Carlos Toxtli retweetledi
AK
AK@_akhaliq·
ByteDance presents Game-TARS Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents
English
3
25
194
18.7K
Dr. Carlos Toxtli retweetledi
DailyPapers
DailyPapers@HuggingPapers·
NVIDIA just released Audio Flamingo 3 on Hugging Face! This fully open, state-of-the-art Large Audio-Language Model excels at understanding & reasoning across speech, sounds, and music, setting new benchmarks on 20+ tasks. huggingface.co/nvidia/audio-f…
English
7
115
673
59.7K
Dr. Carlos Toxtli retweetledi
Tencent HY
Tencent HY@TencentHunyuan·
Today, we are open-sourcing Hunyuan World 1.1 (WorldMirror), a universal feed-forward 3D reconstruction model. 🚀🚀🚀   While our previously released Hunyuan World 1.0 (open-sourced, lite version deployable on consumer GPUs) focused on generating 3D worlds from text or single-view images, Hunyuan World 1.1 significantly expands the input scope by unlocking video-to-3D and multi-view-to-3D world creation.   Highlights: 🔹Any Input, Maximized Flexibility and Fidelity: Flexibly integrates diverse geometric priors (camera poses, intrinsics, depth maps) to resolve structural ambiguities and ensure geometrically consistent 3D outputs. 🔹Any Output, SOTA Results:This elegant architecture simultaneously generates multiple 3D representations: dense point clouds, multi-view depth maps, camera parameters, surface normals, and 3D Gaussian Splattings. 🔹Single-GPU & Fast Inference: As an all-in-one, feed-forward model, Hunyuan World 1.1 runs on a single GPU and delivers all 3D attributes in a single forward pass, within seconds.   🌐Project Page: 3d-models.hunyuan.tencent.com/world/ 🔗Github:github.com/Tencent-Hunyua… 🤗Hugging Face:huggingface.co/tencent/Hunyua… ✨Demo: huggingface.co/spaces/tencent… 📄Technical Report: 3d-models.hunyuan.tencent.com/world/worldMir…
English
46
263
1.6K
167.9K
Dr. Carlos Toxtli retweetledi
Millie Marconi
Millie Marconi@MillieMarconnni·
🚨 This MIT paper just broke everything we thought we knew about AI reasoning. These researchers built something called Tensor Logic that turns logical reasoning into pure mathematics. Not symbolic manipulation. Not heuristic search. Just tensor algebra. Here's how it works: Logical propositions become vectors. Inference rules become tensor operations. Truth values propagate through continuous transformations. Translation? Deduction and neural computation finally speak the same language. This isn't symbolic AI bolted onto deep learning. It's not deep learning pretending to do logic. It's a unified framework where both happen simultaneously. Every major AI model today hits a wall with consistency because logic is discrete and gradients are continuous. You can't backpropagate through "true or false." Tensor Logic erases that boundary completely. The system embeds Boolean reasoning, probabilistic inference, and predicate logic inside a single differentiable framework. That means you can train it end-to-end like a neural network while maintaining logical guarantees. In experiments, the system performs logical inference as matrix operations. Neural nets can now reason with symbolic precision. Symbolic systems can learn from data like neural nets. The numbers are wild. The system handles complex logical queries with the same computational efficiency as matrix multiplication. No expensive search. No combinatorial explosion. But here's the part that should terrify the incumbents: this scales. Traditional symbolic AI chokes on ambiguity. Neural networks hallucinate logical structures. Tensor Logic gets both right simultaneously. If this approach spreads, we might finally get models that don't just predict truths they can prove them. Systems that reason with mathematical certainty while learning from messy real-world data. The implications go way beyond academic AI. Every system that needs both learning and guarantees autonomous vehicles, medical diagnosis, financial systems, legal reasoning just got a new foundation. Current AI is either good at learning or good at logic. Never both. That dichotomy just ended. The fusion of logic and learning isn't coming. It's already here.
Millie Marconi tweet media
English
112
275
1.5K
240.3K