Sofian Chaybouti

17 posts

Sofian Chaybouti

Sofian Chaybouti

@ChaySofian

PhD candidate at the University of Tubingen | RS intern at TII Prev at Huawei Noah’s Ark lab, Paris MVA and ENSTA Paris

Abu Dhabi, UAE Katılım Mart 2023
583 Takip Edilen72 Takipçiler
Sabitlenmiş Tweet
Sofian Chaybouti retweetledi
Google Gemma
Google Gemma@googlegemma·
Check out this amazing combo using Gemma4 + Falcon Perception for video tracking! 1️⃣Give Gemma 4 video frames 2️⃣It describes what it sees 3️⃣Falcon Perception takes those descriptions, segments the objects, and tracks them across the video! The best part? All running locally!
English
24
143
1.2K
82.6K
Sofian Chaybouti retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
okay another demo fo Gemma4 + Falcon Perception for automated video segmentation & tracking, no human prompts needed the idea: you feed Gemma4 a few sampled frames and ask it to describe what it sees. those descriptions get passed to Falcon Perception which segments and tracks them across the full video (using ByteTrack github.com/FoundationVisi…) you can steer what Gemma4 focuses on with different prompt levels: describe by visible text or brand -> dog with number 2 bib describe by spatial position -> horse on the right, horse in center, horse on the left describe by relationships -> rhinoceros walking with zebra same pipeline, different instructions ->different segmentation results. zero human labeling from raw video to tracked output. all local on M3 using mlx-vlm @Prince_Canuma @MaziyarPanahi Work done by @NarayanSanath Check our Falcon Perception repo: github.com/tiiuae/Falcon-…
English
9
32
347
51.3K
Sofian Chaybouti retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
tested Meta's Muse Spark @AIatMeta on level-1 tasks from our visres-bench (#CVPR2026) what it does is impressive, it doesn't just pick an answer. it crops the region, zooms into each boundary, traces edge continuity, checks lighting gradients. actual visual chain-of-thought. and tbh the reasoning is spot on. it identifies exactly the right cues to look at, but it still gets some wrong. and that's the interesting part, the failure isn't a reasoning failure. it's a perception issue imo. the model knows what to look for, it just can't resolve the fine-grained visual signal precisely enough to land on the right patch in all cases like the gap isn't "can it think about images" it clearly can. the gap is low-level spatial precision. and that's kinda an easier problem to solve ... maybe full traces here: YasserdahouML.github.io/visres-Bench
Yasser Dahou@dahou_yasser

Our Visual Reasoning Benchmark has been accepted to #CVPR2026 We wanted to know if VLMs can actually reason visually or if they're relying on text shortcuts. well -> take away the text context, and even the best models struggle hard we built a benchmark with 19k real images across 3 levels of difficulty. level 1: basic perception. can you complete the pattern or fix the occlusion? level 2: single rules. think raven's matrices but with real objects (color, count, orientation). level 3: multi-attribute. complex rules mixing everything together. paper: arxiv.org/abs/2512.21194 hf page: visres-bench.github.io with the amazing team @BrigiMala @andyhuynh1111 @NarayanSanath @lkhphuc @ChaySofian @griffintaur

English
0
3
22
2.6K
Sofian Chaybouti
Sofian Chaybouti@ChaySofian·
Happy that SigLino is a #CVPR2026 Highlight. It started as AMoE, focused purely on efficient MoE distillation (loss, data, multi-res management), and it is now a full series of Agglomerative ViTs (dense and MoE, from 30m to 0.6B params) distilled from SigLIP2 and DINOv3. We used the AMoE variant to initialize the vision experts of an early-fusion grounding MoE with modality-specific experts and show that it is a strong baseline on the small-scale training data regime on the refcoco benchmarks. Later, we figured that full early-fusion with a dense model works well, and even better, which led to Falcon Perception. Models: huggingface.co/collections/ti… Paper: arxiv.org/abs/2512.20157 Code: github.com/tiiuae/siglino… x.com/dahou_yasser/s… With @NarayanSanath @dahou_yasser @lkhphuc @griffintaur @HildeKuehne @hhacid
English
4
47
278
12.7K
Sofian Chaybouti retweetledi
Sofian Chaybouti retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
haha fair questions, FP does open vocab + referring expression, so the prompts for the agent are more flexible than SAM3's prompting. it can pass things like "the player on the right" or "the sign with whatever written on it" and FP handles it, lesser tool calls overall ... check the paper and PBench please arxiv.org/pdf/2603.27365 table 7 where SAM3 is restricted to levels 0 and 1, whereas FP can go up to level-4 (Relationships & inter- actions), check table 1 for the levels definition
Yasser Dahou tweet media
English
2
3
7
341
Sofian Chaybouti retweetledi
Maziyar PANAHI
Maziyar PANAHI@MaziyarPanahi·
I showed you SAM 3 all week. This is a 0.6B model that outperforms it. Falcon Perception. Type "detect the plane" and it segments every plane in the frame. Pixel-accurate masks from natural language. Fighter jets. Fire. Crowds. All on a MacBook via MLX. No cloud.
English
18
79
892
62.7K
Sofian Chaybouti retweetledi
Prince Canuma
Prince Canuma@Prince_Canuma·
mlx-vlm v0.4.4 is out 🚀🔥 New models: 🦅 Falcon-Perception 300M by @TIIuae Highlights: ⚡️ TurboQuant Metal kernels optimized — upto 1.90x decode speed up over baseline on longer context with 89% KV cache savings. 👀 VisionFeatureCache — multi-turn image caching so you don’t re-encode the same image every turn. 🔧Gemma 4 fixes — chunked prefill for KV-shared models & thinking, vision + text degradation, processor config, and nested tool parsing 📹Video CLI fixes Get started today: > uv pip install -U mlx-vlm Shoutout to the awesome @N8Programs for helping me spot and fix some critical yet subtle issues on Gemma 4 ❤️ Happy easter everyone 🐣 and remember to leave us a star ⭐️ github.com/Blaizzy/mlx-vlm
Prince Canuma tweet media
English
16
41
370
87.3K
Sofian Chaybouti retweetledi
Prince Canuma
Prince Canuma@Prince_Canuma·
mlx-vlm v0.4.3 is here 🚀 Day-0 support: 🔥 Gemma 4 (vision, audio, MoE) by @GoogleDeepMind 🦅 Falcon-OCR + Falcon Perception by @TIIuae 🪨 Granite Vision 4.0 by @IBMResearch New models: 🎯 SAM 3.1 with Object Multiplex by @facebook 🔍 RF-DETR detection & segmentation by @roboflow Infra: ⚡ TurboQuant (KV cache compression) 🖥️ CUDA support for vision models (Sam and RF-DETR) Get started today: > uv pip install -U mlx-vlm Leave us a star ⭐️ github.com/Blaizzy/mlx-vlm
Prince Canuma tweet media
English
77
192
2K
999.9K
Sofian Chaybouti retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
People can find all model variants here huggingface.co/collections/ti… We’ve added dense variants alongside the Agglomerative MoE models.
Yasser Dahou tweet media
Yasser Dahou@dahou_yasser

Happy to share that our paper AMoE is accepted at #CVPR2026! we distill SigLIP2 and DINOv3 into a single MoE student. 📄 Paper: arxiv.org/pdf/2512.20157 🤗 Models: huggingface.co/tiiuae/amoe 💻 Code: github.com/tiiuae/amoe with the amazing team @ChaySofian @lkhphuc @griffintaur @HildeKuehne @NarayanSanath

English
0
4
12
1.6K
Sofian Chaybouti retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
Our Visual Reasoning Benchmark has been accepted to #CVPR2026 We wanted to know if VLMs can actually reason visually or if they're relying on text shortcuts. well -> take away the text context, and even the best models struggle hard we built a benchmark with 19k real images across 3 levels of difficulty. level 1: basic perception. can you complete the pattern or fix the occlusion? level 2: single rules. think raven's matrices but with real objects (color, count, orientation). level 3: multi-attribute. complex rules mixing everything together. paper: arxiv.org/abs/2512.21194 hf page: visres-bench.github.io with the amazing team @BrigiMala @andyhuynh1111 @NarayanSanath @lkhphuc @ChaySofian @griffintaur
Yasser Dahou tweet media
English
2
10
61
7.4K