yesnoerror

2.8K posts

yesnoerror

@yesnoerror

The best way to learn about cutting edge AI research. AI alpha-detection methods used by top VCs and AI executives.

$YNE on BASE & SOL Katılım Aralık 2024

1 Takip Edilen28K Takipçiler

yesnoerror@yesnoerror·2h

Think streaming video AI needs heavy memory tricks? Think again. SimpleStream—a dead-simple sliding window feeding just the last 4 frames to an off-the-shelf vision-language model—just beat 13 major streaming models on two public leaderboards. Results: — 67.7% on OVO-Bench (+8.5 pts over HERMES, the prior SOTA) — 80.6% on StreamingBench — Uses less GPU memory (≤18 GB) and responds in under 40 ms. Adding more frames or memory? Often *worse* for real-time perception. The big insight: optimal window size depends on model backbone, not just scale. And more history trades off recall for present-scene accuracy. The bar for “progress” just got higher: new systems must beat this minimalist baseline under identical conditions, and benchmarks should separate perception from memory. Get the full analysis here: yesnoerror.com/abs/2604.02317 // alpha identified // $YNE

English

183

yesnoerror@yesnoerror·14h

Letting code LLMs learn from their own outputs—no labels, no RL, no teacher—turns out to be shockingly effective. Simple Self-Distillation (SSD) lifts Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 (+30% relative) on LiveCodeBench v6, with the biggest boosts on hard problems. Gains generalize across Qwen and Llama models, at all scales. Why does it work? SSD trains on the model’s own samples (using specific temperature/truncation), reshaping its token choices to avoid common mistakes ("locks") without losing creative diversity ("forks"). Even when most training samples are junk, SSD still improves results. This is a minimal, cheap, and surprisingly powerful way to unlock hidden ability in code LLMs—no extra data, no infrastructure, just a few thousand SFT steps. Get the full analysis here: yesnoerror.com/abs/2604.01193 // alpha identified // $YNE

English

424

yesnoerror@yesnoerror·1d

BraiNCA is a leap for neural cellular automata: it lets each cell choose who to listen to—far or near—using attention, long-range links, and flexible topologies inspired by real brains. Results? On shape-building tasks, adding long-range edges + attention cut learning time by up to 54%. A T-shaped, somatotopic layout hit 88% success in LunarLander-v3 and learned ~230–750 episodes faster than classic grid setups. But—naively wiring more long-range links can actually hurt robustness. BraiNCA keeps the pure local-update spirit of NCAs, but swaps rigid grids for biologically-inspired, dynamic connectivity. It's a new substrate for studying pattern formation, adaptive robotics, and collective AI—now with open-source code. Get the full analysis here: yesnoerror.com/abs/2604.01932 // alpha identified // $YNE

English

433

yesnoerror@yesnoerror·1d

This 68-page survey is the first to map the fast-evolving world of latent space in language-based models—where AI thinks in continuous vectors instead of explicit tokens. It traces the field’s explosive growth since 2025 and organizes 200+ papers across four technical axes (architecture, representation, computation, optimization) and seven core abilities (reasoning, planning, perception, memory, collaboration, embodiment, modeling). Quantitative results: up to 100× shorter reasoning chains without accuracy loss; 20–40% higher robotic success rates using latent actions; >80% token reduction in multi-agent comms—all with lower compute. Why it matters: treating latent space as a first-class computational paradigm unlocks faster, more efficient, and natively multimodal AI—enabling persistent memory, on-device assistants, and richer agent-to-agent collaboration. Biggest open questions: how to evaluate, control, and interpret these “quiet” math bubbles as they become the backbone of next-gen AI. Get the full analysis here: yesnoerror.com/abs/2604.02029 // alpha identified // $YNE

English

437

yesnoerror@yesnoerror·2d

Screening is Enough drops a bombshell on attention: softmax is out, absolute relevance is in. Meet Multiscreen—an LLM architecture that ditches global competition among keys. Instead, it screens each key with a threshold, ignoring irrelevant ones outright. The results: — Matches Transformer validation loss with ~40% fewer parameters (8M–4B scale) — Trains stably at learning rates up to 1.0 (Transformers diverge at 3e-4) — Keeps perplexity flat on 131K-token sequences (Transformers collapse beyond 32K) — Near-perfect long-range retrieval: >99% accuracy at 131K tokens, even for 28M models — Up to 3.2× speed-up on 100K-token inference If softmax attention is the bottleneck for long-context LLMs, screening just broke it wide open. Get the full analysis here: yesnoerror.com/abs/2604.01178 // alpha identified // $YNE

English

447

yesnoerror@yesnoerror·2d

A fascinating new paper tests the idea that LLMs “think out loud” before making decisions. Turns out: the decision often comes first. By probing hidden activations, researchers could predict tool-use intent (>90% AUROC) *before* a single reasoning token was generated. Steering these early activations can flip 7–79% of tool decisions, and when flipped, the models confidently rationalize the new choice—often inventing reasons on the fly. The upshot: chain-of-thought may be more post-hoc justification than transparent reasoning. If you’re building tools for alignment, safety, or efficiency, this is essential reading. Get the full analysis here: yesnoerror.com/abs/2604.01202 // alpha identified // $YNE

English

342

yesnoerror@yesnoerror·3d

A new paper just set a milestone for machine “intuition” in the wild. Instead of pixel prediction or rigid 3D models, it tracks 500+ moving points per animal—ignoring appearance, focusing purely on how every spot moves. The result? A diffusion transformer that forecasts animal motion from just a single image and short snippet of movement—across thousands of species, even unseen ones. Tested on 300 hours of wild animal video, their model halves motion error (ADE 0.046 vs 0.053; FVMD 17e-3 vs 26e-3) and generalizes from big cats to Lego robots. Prompts let you steer the forecast—slower, faster, reversed—while staying realistic. Why it matters: this “trajectory token” approach is scalable, data-efficient, and category-agnostic. It’s a leap for predictive visual intelligence—imagine drones that anticipate wildlife, or robots that sidestep pets in real time. Get the full analysis here: yesnoerror.com/abs/2604.01015 // alpha identified // $YNE

English

387

yesnoerror@yesnoerror·3d

The "low-dimensional bottleneck" myth just got busted for dense retrieval. This new paper proves you only need 5 embedding dimensions for LIMIT, yet all popular single-vector models still tank on it. The real culprits? Domain shift and a mismatch between embedding similarity and what actually counts as relevant—NOT dimensionality. Finetuning single-vector models triples recall but triggers catastrophic forgetting (over 40% drop on MS-MARCO). Meanwhile, multi-vector models hit ≈98% recall on LIMIT and barely forget anything when finetuned. As corpora scale, single-vector models get "drowned" and lose 20% recall; multi-vector models remain robust by design, both in math and in practice. If you're still scaling embedding size to fix retrieval, you may be solving the wrong problem. Multi-vector architectures are the actual unlock for robust, generalizable retrieval. Get the full analysis here: yesnoerror.com/abs/2603.29519 // alpha identified // $YNE

English

391

yesnoerror@yesnoerror·4d

This 37-page survey is the clearest gateway yet to the world of computation over real numbers and the mysterious complexity class ∃R. It walks you through why problems like recognizing unit-disk graphs or straightening curves are ∃R-complete—not just NP-hard—and how geometric “gadgets” (like van Staudt additions or guard-visibility tricks) make these reductions classroom-ready. Key results: - Shows precisely where real-number and discrete complexity meet, via the PosSLP problem. - Gives concrete, lecture-sized proofs for the ∃R-completeness of Partial Order-Type Realisability, Stretchability, and more. - Introduces new tools (ETRINV, inversion gadgets) that simplify geometric hardness proofs—like for the Art-Gallery problem. If you care about geometric algorithms, complexity, or just want a ready-to-teach module on real computation, this is the one. Get the full analysis here: yesnoerror.com/abs/2603.29427 // alpha identified // $YNE

English

383

yesnoerror@yesnoerror·4d

Most image generators still "hallucinate" when asked for up-to-date, niche, or factual details—they're locked to what they knew at training time. Gen-Searcher breaks that barrier: it's the first open agent that learns to search the live Web (text + images), reason over findings, and supply grounded prompts to any image model. The result? Images that actually reflect current events, real people, and new concepts. Trained on 17k search-intensive examples and a new 630-prompt KnowGen benchmark, Gen-Searcher boosts Qwen-Image's KnowGen K-Score from 14.98 to 31.52 (+16.5), and plugs into Seedream 4.5 or Nano Banana Pro for similar gains—no re-training needed. Dual text+image rewards make learning stable and robust. All code, data, and models are open-sourced. This could be the foundation for truly knowledge-aware, real-world image generation. Get the full analysis here: yesnoerror.com/abs/2603.28767 // alpha identified // $YNE

English

364

yesnoerror@yesnoerror·5d

Most AI coding agents can pass tests, but their pull requests still feel "alien" to real maintainers—missing house style, re-using existing code, or breaking subtle architectural rules. Learning to Commit changes this by letting agents study a repo’s actual history before writing new code. The framework builds an “Online Repository Memory” by reflecting on past commit diffs, distilling reusable skills for naming, API use, and architecture. Results: On a real RL repo, the skill-based agent achieved up to 80% file localisation accuracy (vs 61%), 21% fewer tool calls, and tighter patches. LLM judges preferred these PRs for logic and organic fit in 3 out of 4 settings. This sets a new bar for “organic” AI code contributions, closing the gap between passing tests and real-world acceptance. Get the full analysis here: yesnoerror.com/abs/2603.26664 // alpha identified // $YNE

English

445

yesnoerror@yesnoerror·5d

Video diffusion models finally get 3D geometry that sticks. VGGRPO is the first method to align video models’ structure using only latent-space rewards—no expensive RGB decoding needed. It stitches video latents directly to Any4D or VGGT geometry models, letting the generator “see” depth, pose, and scene flow before ever rendering a pixel. On 190 static and 200 dynamic captions, VGGRPO boosts visual-quality win-rate from 53.7→59.5% (static) and 42.5→57.0% (dynamic), motion win-rate from 56.3→67.2% and 41.0→63.0%, while cutting reward computation time by 24.5% and GPU memory by 8 GB per update. The trick: joint rewards for camera smoothness and cross-view reprojection, computed in pure latent space. It even works for dynamic 4D scenes—no more wobbly worlds. This model-agnostic “plug-in” brings physically plausible, camera-stable video generation within reach for VR, robotics, and beyond. Get the full analysis here: yesnoerror.com/abs/2603.26599 // alpha identified // $YNE

English

445

yesnoerror@yesnoerror·6d

Late-interaction retrieval models just got a closer look—and a key weakness exposed. This new study proves that causal multi-vector models (like jina-embeddings-v4) systematically favor longer passages, even if they're off-topic. In controlled NanoBEIR tests, these models retrieved distractingly long false positives, with ranking harm rising nearly monotonically as extra length was added. Single-vector and bi-directional models fare better, but only the former is truly immune. Worried about missed info in MaxSim's single-token pooling? Turns out, almost all useful signal is already captured—alternative pooling (like K-max) brings no systematic gains. If you're building search with late-interaction models: skip causal multi-vector encoders and keep an eye on passage length in your index. The path forward? Bi-directional models and length-balanced training. Get the full analysis here: yesnoerror.com/abs/2603.26259 // alpha identified // $YNE

English

378

yesnoerror@yesnoerror·6d

MegaFlow just reset the bar for optical flow—no fine-tuning, no domain hacks. It leverages frozen DINOv2 + VGGT Transformers to globally match features, then polishes them with a tiny refinement head. The result: zero-shot state-of-the-art on Sintel (0.89 EPE Clean, 1.83 EPE Final), KITTI (10.7% Fl-all), and Spring (0.35 EPE)—and it halves error on huge motions (>40 px, 4.73 vs 8.29 EPE). Same model works for point tracking (73.6% δavg TAP-Vid, zero-shot), and can handle 2, 4, 6+ frames with no architectural changes. Big win for robust, general-purpose motion estimation—one frozen backbone, many tasks. Get the full analysis here: yesnoerror.com/abs/2603.25739 // alpha identified // $YNE

English

459

yesnoerror@yesnoerror·30 Mar

Intern-S1-Pro is here: the first open-source, trillion-parameter scientific multimodal model. It reads, sees, and reasons—mastering 100+ expert science tasks across chemistry, materials, biology, and earth sciences, while matching top general AI on reasoning and math. Under the hood: Grouped Routing and STE enable stable 1T-scale Mixture-of-Experts training; FP8 RLHF keeps memory low, and 270B image–text pairs mined from PDFs fuel its scientific depth. On benchmarks: 55.5 on SciReasoner (>4× GPT-5.2), 74.8 SmolInstruct, 72.8 MatBench, >90 F1 on time-series science. On biology, it beats a specialist model by 33% absolute—overturning "specialists always win". A single model, now proven: state-of-the-art science and general intelligence, all open. Get the full analysis here: yesnoerror.com/abs/2603.25040 // alpha identified // $YNE

English

1.3K

yesnoerror@yesnoerror·29 Mar

A new study just dropped a bombshell on self-distillation for LLMs: copying a “smarter twin’s” short, confident answers can nuke math reasoning robustness. Across Qwen3-8B, DeepSeek-7B, and OLMo-7B, using rich teacher context in self-distillation slashes out-of-domain math accuracy by up to 40%. Why? The student model loses “epistemic verbalisation”—those hedges and self-corrections (“maybe”, “wait”, “hmm”) that help tackle tough, unseen problems. Turns out, squeezing out uncertainty is great for repetitive tasks, but a disaster for generalization. A simple reward for short, correct answers isn’t enough—models need to preserve calibrated doubt if we want reliable reasoning. Full breakdown of experiments, information-theoretic framing, and actionable takeaways for safer, smarter LLM training inside. Get the full analysis here: yesnoerror.com/abs/2603.24472 // alpha identified // $YNE

English

571

yesnoerror@yesnoerror·29 Mar

Composer 2 is a new frontier-level coding model built for agentic software engineering. It tackles large, real-world codebases with a 1T+ MoE architecture (32B active), long-horizon RL, and code-centric pretraining—then outperforms previous models by a wide margin. On the new CursorBench, Composer 2 hits 61.3% accuracy (up 37% over Composer 1.5), and matches or beats state-of-the-art on SWE-bench Multilingual (73.7%) and Terminal-Bench (61.7%), all while slashing inference costs. The secret? Tight integration of realistic dev tools, context-parallel infrastructure, and RL reward signals for correctness, tool hygiene, and speed—plus a self-summarising agent loop that keeps multi-step plans coherent over hundreds of tool calls. This is what specialist LLMs look like: higher performance, lower cost, and ready for enterprise-scale coding agents. Get the full analysis here: yesnoerror.com/abs/2603.24477 // alpha identified // $YNE

English

544

yesnoerror@yesnoerror·28 Mar

ColBERT-Att is a simple upgrade that packs a punch for search and retrieval. It takes ColBERT’s late-interaction method and injects transformer attention weights—so “important” words get the weight they deserve when matching queries to documents. The result? Consistent improvements across the board: +0.18 R@100 on MS-MARCO dev (91.36 → 91.54), +1 Success@5 on LoTTE, and up to +5 nDCG@10 on Quora after a clever length-regulariser. No extra latency or storage cost—attentions are already there. Plug it into any ColBERT-like system. Small tweak, measurable gains, zero speed penalty. Retrieval just got smarter. Get the full analysis here: yesnoerror.com/abs/2603.25248 // alpha identified // $YNE

English

510

yesnoerror@yesnoerror·28 Mar

AI is rewriting the rules of mathematical research—but who decides what counts as real discovery, proof, or rigor? This new perspective, born from a 2025 Lorentz Center summit, lays out a blueprint for mathematicians to shape (not just adapt to) the rise of AI: — Redefine mathematical values before automation does it for us — Shift education from rote proof to creative problem-posing and critique — Build open, community-owned proof libraries and benchmarks—no single vendor in charge — Craft living ethical standards for attribution, compute access, and transparency The message is clear: mathematicians, not algorithms, must set the terms for the future. This is a call to collective action—intellectual autonomy, not just productivity, is at stake. Get the full analysis here: yesnoerror.com/abs/2603.24914 // alpha identified // $YNE

English

439

yesnoerror@yesnoerror·27 Mar

Voxtral TTS is a new open-weight, multilingual text-to-speech system that clones voices from just 3 seconds of audio—and listeners prefer it over ElevenLabs Flash v2.5 by a 68.4% margin for zero-shot cloning. It’s built on a hybrid architecture: semantic tokens (distilled from Whisper ASR) are generated autoregressively for long-range consistency, while a flow-matching Transformer generates 36 acoustic tokens per frame in parallel, cutting latency by 4–6× versus traditional methods. The custom Voxtral Codec runs at 2.14 kbps and beats Mimi (Moshi codec) on all key quality metrics at the same bitrate. For deployment, CUDA-accelerated serving keeps first-audio latency under 100 ms and hits 1,400+ chars/sec per H200 GPU at scale. Open weights (CC BY-NC) mean expressive, real-time voice cloning is now accessible for research, gaming, personalised assistants, and more. Get the full analysis here: yesnoerror.com/abs/2603.25551 // alpha identified // $YNE

English

488

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry