AI Bites | YouTube Channel

3.5K posts

AI Bites | YouTube Channel banner
AI Bites | YouTube Channel

AI Bites | YouTube Channel

@ai_bites

AI tools, papers and hands-on coding to solve problems with AI. Former @UniofOxford @Oxford_VGG Our products: https://t.co/uhnIm6VOmS

YouTube → เข้าร่วม Temmuz 2014
694 กำลังติดตาม2.2K ผู้ติดตาม
AI Bites | YouTube Channel
Gloria - yet another model that promises consistent character video generation. It trains a video foundation model to generate highly expressive human videos with long-term identity consistency through an anchor-based mechanism, producing character videos exceeding 10 min without noticeable drift! Paper Title: Gloria: Consistent Character Video Generation via Content Anchors Project: yyvhang.github.io/Gloria_Page/ Link: arxiv.org/abs/2603.29931
English
0
0
1
154
AI Bites | YouTube Channel
Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. The main discovery is that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which they term under-noising. Paper Title: Extend3D: Town-Scale 3D Generation Project: seungwoo-yoon.github.io/extend3d-page Link: arxiv.org/abs/2603.29387
English
0
0
1
156
AI Bites | YouTube Channel
Can a diffusion model produce its own “mental average” of a concept? Diffusion Mental Averages (DMA) is a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model’s semantic space, as discovered by recent studies. Paper Title: Diffusion Mental Averages Project: diffusion-mental-averages.github.io Link: arxiv.org/abs/2603.29239
AI Bites | YouTube Channel tweet media
English
0
1
4
263
AI Bites | YouTube Channel
Existing vision-and-language navigation models mainly reason over past and current observations, while largely overlooking how actions reshape future views. LatentPilot addresses this limitation by learning action-conditioned visual dynamics from future observations during training. Its latent tokens evolve across steps, serve as both output and next-step input, and enable the agent to reason about what the scene will look like after acting. Paper Title: LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Project: abdd.top/latentpilot/ Link: arxiv.org/abs/2603.29165
English
0
2
11
698
AI Bites | YouTube Channel
The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Paper Title: Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas Project: fwmb.github.io/stepper/ Link: arxiv.org/abs/2603.28980
English
0
0
0
163
AI Bites | YouTube Channel
Long-trajectory video generation is a crucial yet challenging task for world modeling, primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. DCARL proposes a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. The approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Paper Title: DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Project: junyiouy.github.io/projects/dcarl Link: arxiv.org/abs/2603.24835
English
0
0
2
214
AI Bites | YouTube Channel
The goal of object motion path editing in videos is to alter a target object's trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion. Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Paper Title: TRACE: Object Motion Editing in Videos with First-Frame Trajectory Project: trace-motion.github.io Link: arxiv.org/abs/2603.25707
English
0
0
0
115
AI Bites | YouTube Channel
LGTM, the first native 4K feed-forward textured Gaussians method for high-resolution novel view synthesis. LGTM supports native 4K inputs and predicts 4K output in a single feed-forward pass. LGTM jointly trains: Gaussian primitive network: Predicts a compact set of Gaussian primitives. Texture network: Processes high-resolution inputs to predict per-Gaussian RGBA texture maps. Paper Title: Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting Project: yxlao.github.io/lgtm/ Link: arxiv.org/abs/2603.25745
English
0
4
43
2.1K
AI Bites | YouTube Channel
Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. ShotStream is a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation, achieving 16 FPS on a single NVIDIA GPU. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. Paper Title: ShotStream: Streaming Multi-Shot Video Generation for Interactive Project: luo0207.github.io/ShotStream/ Link: arxiv.org/abs/2603.25746
English
0
0
0
117
AI Bites | YouTube Channel
Calibri — a parameter-efficient method for diffusion transformer calibration. By optimizing only ~102 parameters, Calibri consistently enhances generation quality across SOTA models. Calibri frames DiT calibration as a black-box reward optimization problem, efficiently solved using a Covariance Matrix Adaptation Evolution Strategy (CMA-ES), modifying just ~102 parameters. Despite its lightweight design, Calibri consistently improves performance across various text-to-image models and notably reduces the number of inference steps required — all while maintaining high-quality outputs. Paper Title: Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration Project: v-gen-ai.github.io/Calibri-page/ Link: arxiv.org/abs/2603.24800
AI Bites | YouTube Channel tweet media
English
0
0
0
117
AI Bites | YouTube Channel
Adaptive Chunking! No single chunking method works best for every document in a RAG pipeline. Adaptive Chunking solves this by evaluating multiple chunking strategies against a set of intrinsic quality metrics and automatically selecting the best one for each document. Both dimensions are modular: you can plug in your own chunking methods and define your own evaluation metrics, making the framework easy to extend to new domains and use cases. Adaptive Chunking selects the best splitting strategy per document using five intrinsic quality metrics, evaluated on 33 documents across 3 domains (~1.18M tokens). Paper Title: Adaptive Chunking: Optimizing Chunking-Method Selection for RAG Project: github.com/ekimetrics/ada…. Link: arxiv.org/abs/2603.25333
AI Bites | YouTube Channel tweet media
English
0
0
2
110
AI Bites | YouTube Channel
Infinity-RoPE: Action-Controllable Infinite Video Generation! - Generate videos of unlimited length beyond the base model's temporal horizon - Cinematic multi-cut transitions within a single autoregressive rollout - Dynamic prompt changes with instant responsiveness and smooth transitions ∞-RoPE explores what already distilled models can achieve by reparameterizing temporal RoPE and KV caching at inference time, and can be applied in a plug-and-play fashion on top of existing Self-Forcing variants to enable effectively infinite-horizon, controllable video generation. Paper Title: Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges Project: infinity-rope.github.io Link: arxiv.org/abs/2511.20649
English
0
0
2
170
AI Bites | YouTube Channel
Are Current Navigation Models Trustworthy Enough? Embodied navigation remains challenging due to cluttered layouts, complex semantics, and language-conditioned instructions. Recent breakthroughs in complex indoor domains require robots to interpret cluttered scenes, reason over long-horizon visual memories, and follow natural language instructions. Broadly, there are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. Existing work primarily evaluates model performance under nominal conditions. NavTrust is the first benchmark to expose embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Paper Title: NavTrust: Benchmarking Trustworthiness for Embodied Navigation Project: navtrust.github.io Link: arxiv.org/abs/2603.19229
English
0
2
9
677
AI Bites | YouTube Channel
Do VLMs Need Vision Transformers? A controlled study of Transformer, state space, and hybrid vision backbones for frozen vision-language models. Finding: - Under matched settings, VMamba improves localization while staying competitive on open-ended VQA, making SSMs a practical alternative to ViTs. - Detection or segmentation pretraining generally improves VQA and localization, with the largest gains appearing in backbones that need more spatial inductive bias. - Better classification scores and naive scaling do not consistently predict stronger downstream VLM behavior, especially for grounding-sensitive tasks. - Some dense-objective checkpoints fail sharply in localization, but simple interface and connector adjustments recover much more robust behavior. Paper Title: Do VLMs Need Vision Transformers? Evaluating State Space Models as Project: lab-spell.github.io/vlm-ssm-vision… Link: arxiv.org/abs/2603.19209
AI Bites | YouTube Channel tweet media
English
2
26
167
10.6K
AI Bites | YouTube Channel
DROID-W, a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Given a casually captured in-the-wild video, DROID-W estimates accurate camera trajectory, scene structure and uncertainty. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, this method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. Paper Title: DROID-SLAM in the Wild Project: moyangli00.github.io/droid-w/ Link: arxiv.org/abs/2603.19076
English
0
1
5
276
AI Bites | YouTube Channel
EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder–decoder design. First adapt a large DINOv3- pretrained ViT to object detection and use it as a task-specialized teacher to distill rich representations into compact student backbones. Paper Title: EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Project: intellindust-ai-lab.github.io/projects/EdgeC… Link: arxiv.org/abs/2603.18739
English
0
0
3
223
AI Bites | YouTube Channel
Creating dynamic, view-consistent videos of customized subjects is highly sought after for immersive VR/AR, virtual production, and next-generation e-commerce. Despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities — focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, these approaches lack the spatial priors necessary to reconstruct 3D geometry, and must rely on generating plausible but arbitrary details for unseen regions. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm, baking a robust 3D prior without exhaustive video-based training. 3Dapter is a visual conditioning module that enhances fine-grained textures and accelerates convergence — acting as a dynamic selective router that queries view-specific geometric hints via multi-view joint attention with shared weights. Paper Title: 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model Project: ko-lani.github.io/3DreamBooth Link: arxiv.org/abs/2603.18524
English
0
0
2
137
AI Bites | YouTube Channel
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. PhysMoDPO is a Direct Preference Optimization framework. It integrates WBC into the training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. Paper Title: PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Project: mael-zys.github.io/PhysMoDPO/ Link: arxiv.org/abs/2603.13228
English
0
7
20
1.4K
AI Bites | YouTube Channel
Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Rolling Sink is a training-free solution. Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. Paper Title: Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing Project: rolling-sink.github.io Link: arxiv.org/abs/2602.07775 Video: LongLive (w/o LoRA) Training duration: 5s Testing duration: 1min
English
0
2
6
687
AI Bites | YouTube Channel
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. Paper Title: Composing Concepts from Images and Videos via Concept-prompt Binding Project: refkxh.github.io/BiCo_Webpage/ Link: arxiv.org/abs/2512.09824 Composed Prompt: "A beagle dog wearing a collar mixes a drink vigorously using a shaker with its dog's paws at a bar, surrounded by a cityscape visible through a large window."
English
0
0
0
60