
Phúc Lê
336 posts




@skalskip92 Hi, thanks for checking out FP. Are you getting roughly the same number as us, also on A100? This is for an 1024px image. We do try to make it fast in our minimal repo while keeping thing simple (pure torch etc), but it's AR so still would be slower than SAM to pred many objs.


I used Gemma4 + Falcon Perception from this mlx-vlm release to build a grounded reasoning agent runs fully local on M3 the idea: VLMs are great at reasoning but not great at measuring. Falcon Perception is great at segmentation but cant reason. so you loop them: Gemma4 decides what to look for, FP segments it and returns pixel-accurate coordinates, Gemma4 reasons on the numbers ask "is the blue player offside?" → it grounds the players, finds the second-to-last defender, compares centroid positions, applies the rule. check the video for some examples @Prince_Canuma I can submit a PR with this demo if you want


Our Visual Reasoning Benchmark has been accepted to #CVPR2026 We wanted to know if VLMs can actually reason visually or if they're relying on text shortcuts. well -> take away the text context, and even the best models struggle hard we built a benchmark with 19k real images across 3 levels of difficulty. level 1: basic perception. can you complete the pattern or fix the occlusion? level 2: single rules. think raven's matrices but with real objects (color, count, orientation). level 3: multi-attribute. complex rules mixing everything together. paper: arxiv.org/abs/2512.21194 hf page: visres-bench.github.io with the amazing team @BrigiMala @andyhuynh1111 @NarayanSanath @lkhphuc @ChaySofian @griffintaur

Happy that SigLino is a #CVPR2026 Highlight. It started as AMoE, focused purely on efficient MoE distillation (loss, data, multi-res management), and it is now a full series of Agglomerative ViTs (dense and MoE, from 30m to 0.6B params) distilled from SigLIP2 and DINOv3. We used the AMoE variant to initialize the vision experts of an early-fusion grounding MoE with modality-specific experts and show that it is a strong baseline on the small-scale training data regime on the refcoco benchmarks. Later, we figured that full early-fusion with a dense model works well, and even better, which led to Falcon Perception. Models: huggingface.co/collections/ti… Paper: arxiv.org/abs/2512.20157 Code: github.com/tiiuae/siglino… x.com/dahou_yasser/s… With @NarayanSanath @dahou_yasser @lkhphuc @griffintaur @HildeKuehne @hhacid






Falcon-Perception can now be install with `pip install falcon-perception` - 1 model file - 2 variants Perception + OCR - Paged / Batch +KVCache inference engine - Torch+compile + cudagraph - Upsampler w/ async cache for high-res mask Plus: MLX batch inference support 🧵:


We are releasing Falcon Perception, an open-vocabulary referring expression segmentation model. Along with it, a 0.3B OCR model that is on par with 3-10x larger competitors. Current systems solve this with complex pipelines (separate encoders, late fusion, matching algorithms). We developed a novel simpler "bitter" approach: one early-fusion Transformer (image + text from first layer) with a shared parameter space, and let scale + training signal do the work. Please check our work ! 📄 Paper: arxiv.org/pdf/2603.27365 💻 Code: github.com/tiiuae/falcon-… 🎮 Playground: vision.falcon.aidrc.tii.ae 🤗 Blogpost: huggingface.co/blog/tiiuae/fa…



