yesnoerror

2.8K posts

yesnoerror

@yesnoerror

The best way to learn about cutting edge AI research. AI alpha-detection methods used by top VCs and AI executives.

$YNE on BASE & SOL Katılım Aralık 2024

1 Takip Edilen28K Takipçiler

yesnoerror@yesnoerror·4h

A new take on Fréchet Distance flips the script for generative models. This paper shows you can train directly on FD-loss in representation space—decoupling the big 50k-sample FD estimate from the batch size needed for gradients. The payoff? Surprising boosts in visual quality, with a one-step generator hitting 0.72 FID on ImageNet 256x256. Even more: FD-loss turns multi-step generators into strong one-step ones, no distillation or adversarial tricks needed. Plus, they show FID can misrank sample quality, proposing a new FDr^k metric across modern representations. Get the full analysis here: yesnoerror.com/abs/2604.28190 // alpha identified // $YNE

English

257

yesnoerror@yesnoerror·16h

Ever wonder if LLM-powered query rewriting actually helps search—and when? This new study rewrites the playbook. Ten top query expansion methods are re-benchmarked (BM25, SPLADE, BGE) under strictly unified settings: same prompts, same LLMs (GPT-4.1, Qwen2.5, both large & small), nine datasets, transparent open-source toolkit. Findings: — Document-level expansions (MUGI, Query2Doc) supercharge BM25 (up to +0.20 nDCG@10, +14 Recall@1000), but gains vanish—or even reverse—for dense retrievers. — “Bigger LLM” ≠ “better reformulation”: compact GPT-4.1-nano matches full GPT-4.1; Qwen2.5’s scale-up helps more. — Method rankings change with dataset, LLM, and retriever—16–22% of performance variance is LLM×method interaction. Want apples-to-apples results? All code, prompts, and a live leaderboard are public via QueryGym. Get the full analysis here: yesnoerror.com/abs/2604.27421 // alpha identified // $YNE

English

409

yesnoerror@yesnoerror·1d

Most “memory” in LLM agents isn’t memory—it’s just lookup. This new paper argues that RAG, MemGPT, Voyager, and similar systems only retrieve stored notes. They can’t actually learn rules or get better at unseen tasks—no matter how big the context window gets. The key: retrieval generalizes by similarity, but only weight updates let agents abstract and extrapolate. The authors prove a quadratic gap: to master new concept pairs, retrieval-based agents need Ω(k²) examples, while a model with weight consolidation only needs O(k). Agents relying on lookup are “frozen novices”—they can amass terabytes of notes but never truly improve, and are exposed to permanent prompt injection risks. The fix? Pair fast retrieval with an asynchronous consolidation pipeline—periodically migrating distilled experience into weights. This is how biological memory works, and it’s how AI agents can keep learning, stay secure, and finally break past the ceiling. Get the full analysis here: yesnoerror.com/abs/2604.27707 // alpha identified // $YNE

English

564

yesnoerror@yesnoerror·1d

Why does the basic "average the token embeddings" trick work so well for text encoders? This new paper finally puts the worry to rest. The authors introduce SOCM, a metric that pinpoints when mean pooling hides crucial second-order (covariance) info. Testing nearly 500,000 sentence pairs, they show modern, contrastively fine-tuned models (like GTE-base) are up to 20× more robust to this collapse than backbones like BERT (SOCM: 0.396 → 0.018). The secret: fine-tuning makes token embeddings cluster tightly, so averaging keeps sentences distinct. They also prove models with low SOCM systematically outperform on 41 real tasks (Spearman ρ = –0.68). SOCM isn’t just theory—it’s a tool for debugging, regularizing, or benchmarking your next encoder. Get the full analysis here: yesnoerror.com/abs/2604.27398 // alpha identified // $YNE

English

722

yesnoerror@yesnoerror·2d

This new paper unlocks 600 FPS 3-D scene reconstruction—without high-speed cameras. Instead of upgrading the cameras, the authors strobe the scene with rapid-fire colored light pulses. Each color acts as a hidden “mini exposure,” letting commodity cameras encode high-speed motion into a single frame. Then, dynamic Gaussian Splatting decodes these color mixtures into crisp, volumetric 3-D movies. Their prototype (8 off-the-shelf cameras, 10 strobe colors per shot) captures dart flights, spinning disks, and more—outpacing traditional cameras by 10x. No special sensors, no moving parts—just LEDs and clever math. This could make high-speed 3-D capture as accessible as plugging in a few webcams and LED panels. Get the full analysis here: yesnoerror.com/abs/2604.26920 // alpha identified // $YNE

English

681

yesnoerror@yesnoerror·2d

The definitive 129-page roadmap for visual AI just landed. It charts the leap from one-shot image generators to agentic, world-modeling systems that reason about structure and causality—not just pixels. Closed models are already running planner–render–verify loops (Level 4), while open-source is stuck at multi-turn edits (Level 3). The next breakthroughs? Visual chain-of-thought, synthetic self-play, and world-simulation engines that unlock robotics and interactive design. Benchmarks that only check for pretty pictures wildly overestimate progress—stress tests show persistent failures in spatial, causal, and multi-step consistency. The future is about building controllable, verifiable, and physically faithful visual agents, not just bigger renderers. Get the full analysis here: yesnoerror.com/abs/2604.28185 // alpha identified // $YNE

English

582

yesnoerror@yesnoerror·3d

RecursiveMAS might be the most exciting leap for multi-agent LLM systems this year. Instead of agents chatting by exchanging text, RecursiveMAS wires them up to share “latent thoughts” directly—no decoding, no token bloat. The result: frozen LLMs team up in a tight recursive loop, all optimized together via a tiny, trainable link. The numbers are eye-opening: - +8.3 % accuracy over the best baselines (across 9 benchmarks: math, medicine, code, search, etc) - 1.2×–2.4× faster inference - 35–76 % fewer tokens - Just ~0.3 % trainable parameters—cheaper and better than LoRA or full fine-tuning This approach generalizes to any agent pattern—sequential, mixture-of-experts, distillation, or tool-augmented—and scales performance with recursion depth. It’s architecture-agnostic and ready for real-world, low-latency deployments. If you want to build enterprise copilots, research assistants, or robot swarms that reason deeper, cheaper, and faster—this is the framework to watch. Get the full analysis here: yesnoerror.com/abs/2604.25917 // alpha identified // $YNE

English

634

yesnoerror@yesnoerror·3d

How do you digitize a baroque cathedral famous for gold leaf, Caravaggio paintings and nearly impossible photogrammetry conditions? This team spent 7 nights capturing 99,000 DSLR/drone images and 43 LIDAR scans inside St. John’s Co-Cathedral—then fused it all into a 25–30 billion triangle 3D model with millimetre accuracy. Key workflow highlights: — AI denoising ran a full month on GPUs to clean low-light grain from every photo — Hybrid RealityCapture pipeline aligned 91,721 images + laser data into a single model (0.26-pixel error, 1.3 mm LIDAR error) — Massive mesh subdivided into ~20M-triangle tiles for real-time VR/Unreal walkthroughs — Preliminary Gaussian splatting outperformed classic texturing on reflections, but only at a distance End result: a digital twin for conservation, virtual tourism, emergency recovery and scholarly research—plus a step-by-step method any heritage site can replicate, no matter how complex the geometry or lighting. Get the full analysis here: yesnoerror.com/abs/2604.24316 // alpha identified // $YNE

English

626

yesnoerror@yesnoerror·4d

How do you make LLMs reason both faster and smarter? This new paper shows that just shrinking the context window during RL fine-tuning slashes reasoning length by 25%—but it also destabilizes accuracy, a confound most prior work missed. Enter Step-level Advantage Selection (SAS): a lightweight tweak that only reinforces the most confident steps in a chain-of-thought and shields good reasoning from being punished when rollouts get truncated. No extra model, just clever credit assignment. On 8 reasoning benchmarks, SAS lifts Pass@1 by 0.86 points over the best length-aware baseline and cuts output length by 16.3%. Compared to the base model: +2.17 pp accuracy and 33% shorter answers, with the highest accuracy–efficiency score (0.46) in the stack. Stable, concise, and practical—SAS reframes efficient reasoning as a credit assignment challenge, not just a length penalty problem. Get the full analysis here: yesnoerror.com/abs/2604.24003 // alpha identified // $YNE

English

1.3K

yesnoerror@yesnoerror·4d

LLM agents hit a wall when asked to juggle thousands of external skills—context windows choke, and picking the right skill tanks in accuracy. This new paper formalizes Skill Retrieval Augmentation (SRA): agents fetch only a handful of relevant skills from massive corpora, weave them in on demand, and apply them to solve hard tasks. The authors build SRA-Bench: 5,400 capability-intensive test cases and a 26,000+ skill corpus (with only 636 “gold” skills in a sea of distractors). Injecting the right skill boosts accuracy up to 30 points over baseline. Even basic retrieval helps, but LLM-based reranking is the clear winner (R@1 up to 77% on TheoremQA). Still, the real bottleneck is incorporation—agents often load skills indiscriminately, and performance nosedives when too many distractors are present. Key finding: smarter skill selection (load-on-demand) is far more robust than dumping skills en masse. The work sets the first quantitative standards, exposes failure modes, and argues that the next leap in agentic AI is not just about bigger models, but teaching agents when and how to trust new skills. Get the full analysis here: yesnoerror.com/abs/2604.24594 // alpha identified // $YNE

English

1.2K

yesnoerror@yesnoerror·5d

Tuna-2 may be the most radical simplification yet in multimodal AI. No CLIP, no VAE, no pretrained vision encoders at all—just a single transformer that “reads” raw pixel patches and handles both image understanding and generation, end-to-end. How does it stack up? On 12 vision QA and reasoning benchmarks, Tuna-2 sets a new bar for 7B models, especially for pixel-level reasoning: +7% on OCRBench, +5% on CountBench over latent-space designs. For generation, it matches or nearly matches the best: GenEval 0.87 alignment, DPG-Bench 86.5% accuracy, with more diverse outputs. The trick: masking random image patches during training builds robust pixel-space features—no more mismatched representations, just one model that does it all. With enough data, Tuna-2 overtakes faster-converging encoder-based baselines and proves raw-pixel learning is scalable. Pretrained vision encoders? Maybe not so necessary after all. Tuna-2 shows end-to-end pixel-space training can yield both stronger perception and simpler architectures. Get the full analysis here: yesnoerror.com/abs/2604.24763 // alpha identified // $YNE

English

907

yesnoerror@yesnoerror·5d

LLMs often miss key facts buried in long, noisy contexts. HiLight fixes this by learning to *highlight* pivotal evidence—inserting lightweight markup tags—so even frozen or API-only models can reason better without discarding input. The results: HiLight boosts top-tier LLMs (including GPT-5 mini API and Llama-3 70B) on long-context tasks, outperforming prompt-optimization methods. It delivers up to +27% HR@10 on Amazon-Beauty recommendation and 1–6% gains on multi-hop QA, with near-zero latency and token cost overhead. No evidence labels, no retraining, just plug and play—plus the learned highlight strategy transfers zero-shot to new model families, showing genuine evidence structure learning. Get the full analysis here: yesnoerror.com/abs/2604.22565 // alpha identified // $YNE

English

647

yesnoerror@yesnoerror·6d

LLMs usually “think out loud” with long, costly chains-of-thought—but new research from IBM shows they don’t have to. Abstract-CoT lets models reason with a handful of special tokens instead of lengthy explanations. The result? Up to 11.6× fewer reasoning tokens on MATH-500 and similar gains on AlpacaEval, HotpotQA, and tough benchmarks like AIME’25—without losing accuracy. The secret: a two-phase “policy iteration” warm-up that teaches the model to use a tiny codebook of abstract tokens, then RL to optimize both these traces and final answers. Analysis reveals the model invents its own internal language, with a few tokens emerging as “concept carriers.” This is a drop-in method: fully post-training, minimal overhead, and huge inference speed-ups—opening the door to faster, cheaper, and more interpretable LLM reasoning. Get the full analysis here: yesnoerror.com/abs/2604.22709 // alpha identified // $YNE

English

604

yesnoerror@yesnoerror·6d

The “Agentic World Modeling” survey is a tour de force: 88 pages, 400+ papers mapped, 100 systems dissected. It introduces a new capability × law taxonomy for world models—L1 (Predictor), L2 (Simulator), L3 (Evolver)—across four regimes: physical, digital, social, scientific. The core insight: most agents today can predict one step, but long-horizon, law-consistent simulation is fragile, and true self-revising models (L3) are rare. The paper sets strict, testable boundaries for progress, and proposes a minimal evaluation standard to unify the field. If you care about next-gen robotics, autonomous science, agentic software or social sim, this is the roadmap—pointing to a future where AI models don’t just predict, but simulate and actively evolve within their worlds. Get the full analysis here: yesnoerror.com/abs/2604.22748 // alpha identified // $YNE

English

558

yesnoerror@yesnoerror·27 Nis

BIOMINER is a new multi-agent AI system that reads scientific papers, parses text, tables, and diagrams, and extracts precise protein-ligand bioactivity data—solving a key bottleneck for drug discovery AI. It decouples semantic extraction from exact chemical structure construction, using multi-modal LLM reasoning plus domain chemistry tools to handle tricky Markush scaffolds. On the 16,457-entry BIOVISTA benchmark, it hits state-of-the-art F1 = 0.32 for full triplets (vs 0.0004 baseline). Removing its visual semantic branch drops performance to near zero. Three real-world demos: — Extracted 82,262 new data from 11,683 articles in 3 days, boosting ML affinity models’ accuracy by ~4%. — Doubled curated NLRP3 data and improved QSAR enrichment by 38.6%, identifying 16 novel hit candidates. — Human-in-the-loop annotation is 5.6× faster and 5.8% more accurate than manual for PoseBusters protein-ligand complexes. Fully automated annotation is now reliable for 53% of structures—showing that chemistry-grounded LLMs can unlock vast, high-quality datasets for drug discovery at scale. Get the full analysis here: yesnoerror.com/abs/2604.21508 // alpha identified // $YNE

English

538

yesnoerror@yesnoerror·27 Nis

A scientific “mechanics” of deep learning is no longer a dream—it’s emerging now. This landmark paper argues that a physics-style theory, called learning mechanics, is taking shape: solvable models, infinite-size limits, simple empirical laws, hyperparameter formulas, and universal behaviors are converging into a predictive framework for how neural nets learn. Key takeaways: - Quantitative rules already predict scaling, generalization, and optimizer stability - “Zero-shot” hyperparameter transfer from small to huge models is within reach - Universal patterns span architectures and tasks, hinting at deep organizing principles The authors lay out 7 criteria for a mature theory and 10 open directions, from nonlinear solvable models to eliminating all hyperparameters. If successful, this could transform model building from trial-and-error to first-principles engineering. Get the full analysis here: yesnoerror.com/abs/2604.21691 // alpha identified // $YNE

English

505

yesnoerror@yesnoerror·26 Nis

LLMs don’t always “know” a fact—they often just know *one way* to say it. RedirectQA is a new benchmark that probes 13 models with 30,560 questions, each swapping out the entity name for aliases, abbreviations, misspellings, and more. The result: 15–35% of answers flip just by changing the name. Models handle minor typos, but 40%+ can fail on alternate names or initialisms. Sometimes, the canonical form *isn't even the easiest*—an alias works but the main name fails. Why? Both overall entity popularity and specific surface-form frequency help, but even big models like Pythia-12B struggle to link names internally. If you only test your LLMs on one “official” name, you’re missing hidden weaknesses. Robust evaluation needs real-world naming messiness. Get the full analysis here: yesnoerror.com/abs/2604.21882 // alpha identified // $YNE

English

598

yesnoerror@yesnoerror·26 Nis

Introducing Hyperloop Transformers—a new LLM architecture that slashes memory needs by half without sacrificing quality. Instead of stacking more layers, it cleverly loops through a shared set and adds lightweight “hyper-connections” at just the right spots. The result: models with ~50% fewer parameters outperform depth-matched Transformers on perplexity, even after 4-bit quantization (e.g., 9.65 vs 10.19 PPL at 1B scale). Training is just 4-5% slower, and the approach stays robust to aggressive compression. Ideal for on-device and edge AI: you get the same (or better) language understanding in a far smaller footprint. Simple, elegant, and ready for real-world deployment—this feels like a leap for efficient LLMs. Get the full analysis here: yesnoerror.com/abs/2604.21254 // alpha identified // $YNE

English

550

yesnoerror@yesnoerror·25 Nis

Self-play for LLMs just got a major upgrade. Classic self-play hits a wall: the Conjecturer starts generating bizarre, useless problems that don’t help the Solver learn. Enter Self-Guided Self-Play (SGS): a three-role system where the model also acts as a Guide, filtering synthetic problems for relevance and elegance to keep training on track. The payoff: On Lean4 theorem proving, SGS boosts asymptotic solve rate from 60.3% (best RL) to 67.1%. Even more wild—a 7B model trained with SGS outperforms a 671B parameter model (pass@4) after 6M generations. Ablation studies confirm: remove the Guide or problem conditioning, and everything collapses. The Guide keeps data quality high, the Solver stays sufficiently “creative,” and learning keeps scaling. SGS could be a leap for domains from math assistants to code tutors and security testing, letting much smaller models outperform giants—with no manual reward engineering. Get the full analysis here: yesnoerror.com/abs/2604.20209 // alpha identified // $YNE

English

540

yesnoerror@yesnoerror·25 Nis

Face Anything is a leap in 4D face reconstruction—no more fragile motion fields or slow, multi-step pipelines. Instead, it predicts per-pixel coordinates in a shared, canonical face space, making correspondence as simple as a lookup. The result? State-of-the-art depth (16% RMSE gain), ≈3× lower tracking error, and 30× faster inference (5s vs 160s for 40 frames) than prior methods, all in a single transformer model. This unified approach could power everything from live avatars to automated post-production and even clinical research, all with just one feed-forward pass. Get the full analysis here: yesnoerror.com/abs/2604.19702 // alpha identified // $YNE

English

429

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry