Phúc Lê

10

87

6.6K

Phúc Lê retweetledi

Maziyar PANAHI@MaziyarPanahi·19 Nis

Turned real footage into a Call of Duty HUD. 3 scenes. 1 open-vocab stack. 0 training runs. Type the thing, the HUD finds the thing, green containers, cargo near the water, incoming aircraft. No GPU. Just a MacBook. What game should the HUD cosplay next?

English

9

4

75

11.5K

Phúc Lê retweetledi

Yasser Dahou@dahou_yasser·16 Nis

Thanks @yacinelearning for having us, please check it out if you wanna details about Falcon Perception

x.com/i/broadcasts/1…

English

4

537

Phúc Lê@lkhphuc·16 Nis

@andrew_n_carr @skalskip92 x.com/lkhphuc/status… is it slower than this?

Phúc Lê@lkhphuc

@skalskip92 Hi, thanks for checking out FP. Are you getting roughly the same number as us, also on A100? This is for an 1024px image. We do try to make it fast in our minimal repo while keeping thing simple (pure torch etc), but it's AR so still would be slower than SAM to pred many objs.

English

11

Andrew Carr 🤸@andrew_n_carr·16 Nis

@skalskip92 not just you

English

2

0

2

218

SkalskiP@skalskip92·16 Nis

is it just me or Falcon Perception is pretty slow? even on A100

English

7

0

19

5.9K

Phúc Lê@lkhphuc·16 Nis

@skalskip92 #scrollTo=51abfe9f" target="_blank" rel="nofollow noopener">colab.research.google.com/drive/1Jy6lRYu… here is our walkthrough notebook

English

103

SkalskiP@skalskip92·16 Nis

@lkhphuc I a lot slower... can you take a look at my notebook?

English

0

192

Phúc Lê retweetledi

Yacine Mahdid@yacinelearning·15 Nis

btw folks I'm going to interview the Falcon Perception team tomorrow on the menu: - walkthrough of the model - a whole bunch of cool demos - behind the scene about what worked and what didn't if you have any burning vision questions shoot them I'm finalizing the Q&A today

I used Gemma4 + Falcon Perception from this mlx-vlm release to build a grounded reasoning agent runs fully local on M3 the idea: VLMs are great at reasoning but not great at measuring. Falcon Perception is great at segmentation but cant reason. so you loop them: Gemma4 decides what to look for, FP segments it and returns pixel-accurate coordinates, Gemma4 reasons on the numbers ask "is the blue player offside?" → it grounds the players, finds the second-to-last defender, compares centroid positions, applies the rule. check the video for some examples @Prince_Canuma I can submit a PR with this demo if you want

English

2

6

53

4.4K

Phúc Lê retweetledi

Yasser Dahou@dahou_yasser·15 Nis

okay another demo fo Gemma4 + Falcon Perception for automated video segmentation & tracking, no human prompts needed the idea: you feed Gemma4 a few sampled frames and ask it to describe what it sees. those descriptions get passed to Falcon Perception which segments and tracks them across the full video (using ByteTrack github.com/FoundationVisi…) you can steer what Gemma4 focuses on with different prompt levels: describe by visible text or brand -> dog with number 2 bib describe by spatial position -> horse on the right, horse in center, horse on the left describe by relationships -> rhinoceros walking with zebra same pipeline, different instructions ->different segmentation results. zero human labeling from raw video to tracked output. all local on M3 using mlx-vlm @Prince_Canuma @MaziyarPanahi Work done by @NarayanSanath Check our Falcon Perception repo: github.com/tiiuae/Falcon-…

English

9

32

347

51.2K

Phúc Lê retweetledi

Maziyar PANAHI@MaziyarPanahi·13 Nis

Gemma 4 analyzes the video. Generates key questions. Calls Falcon Perception. "Find all the people." 156 found. "Detect only white cars." 8 found. A 26B model is running agentic multi-QA vision orchestration. The models are running locally on a MacBook with MLX. No API.

English

51

127

1.6K

157.9K

Phúc Lê retweetledi

Yasser Dahou@dahou_yasser·12 Nis

tested Meta's Muse Spark @AIatMeta on level-1 tasks from our visres-bench (#CVPR2026) what it does is impressive, it doesn't just pick an answer. it crops the region, zooms into each boundary, traces edge continuity, checks lighting gradients. actual visual chain-of-thought. and tbh the reasoning is spot on. it identifies exactly the right cues to look at, but it still gets some wrong. and that's the interesting part, the failure isn't a reasoning failure. it's a perception issue imo. the model knows what to look for, it just can't resolve the fine-grained visual signal precisely enough to land on the right patch in all cases like the gap isn't "can it think about images" it clearly can. the gap is low-level spatial precision. and that's kinda an easier problem to solve ... maybe full traces here: YasserdahouML.github.io/visres-Bench

Our Visual Reasoning Benchmark has been accepted to #CVPR2026 We wanted to know if VLMs can actually reason visually or if they're relying on text shortcuts. well -> take away the text context, and even the best models struggle hard we built a benchmark with 19k real images across 3 levels of difficulty. level 1: basic perception. can you complete the pattern or fix the occlusion? level 2: single rules. think raven's matrices but with real objects (color, count, orientation). level 3: multi-attribute. complex rules mixing everything together. paper: arxiv.org/abs/2512.21194 hf page: visres-bench.github.io with the amazing team @BrigiMala @andyhuynh1111 @NarayanSanath @lkhphuc @ChaySofian @griffintaur

English

Sofian Chaybouti@ChaySofian

3

22

2.6K

Phúc Lê@lkhphuc·9 Nis

@rohanpaul_ai x.com/chaysofian/sta… checkout our tiny encoder distilled from siglip and Dino directly

Happy that SigLino is a #CVPR2026 Highlight. It started as AMoE, focused purely on efficient MoE distillation (loss, data, multi-res management), and it is now a full series of Agglomerative ViTs (dense and MoE, from 30m to 0.6B params) distilled from SigLIP2 and DINOv3. We used the AMoE variant to initialize the vision experts of an early-fusion grounding MoE with modality-specific experts and show that it is a strong baseline on the small-scale training data regime on the refcoco benchmarks. Later, we figured that full early-fusion with a dense model works well, and even better, which led to Falcon Perception. Models: huggingface.co/collections/ti… Paper: arxiv.org/abs/2512.20157 Code: github.com/tiiuae/siglino… x.com/dahou_yasser/s… With @NarayanSanath @dahou_yasser @lkhphuc @griffintaur @HildeKuehne @hhacid

English

1

52

Rohan Paul@rohanpaul_ai·8 Nis

Meta just gave a training recipe that says small models should not learn from many experts directly. They should first learn from one unified teacher that already blended the experts together. Introduced EUPE, or Efficient Universal Perception Encoder, a vision encoder that packs image recognition, dense prediction, and VLM-ready features into one 86M-parameter model without giving up specialist-level accuracy. Before EUPE, you usually had to choose between a model that was good at understanding whole images, a model that was good at pixel-level tasks like segmentation and depth, or a model that worked well with language in VLM systems. That is a real problem on phones, glasses, and other edge devices, because running 2 or 3 separate encoders costs too much memory, power, and latency. People already tried to merge several expert models into one small model, but the small model usually became a mediocre compromise that lost too much from each expert. EUPE fixes that by adding a big 1.9B proxy model in the middle, so the experts first teach a large model that has enough room to combine their knowledge, and only after that does Meta compress the result into a small model. That is why an 86M EUPE model can match or beat specialist models on image understanding, dense prediction, and several VLM tasks instead of being good at only one of them. ---- Paper Link – arxiv. org/abs/2603.22387 Paper Title: "Efficient Universal Perception Encoder"

English

5

14

73

6.4K

Phúc Lê retweetledi

Sofian Chaybouti@ChaySofian·9 Nis

Happy that SigLino is a #CVPR2026 Highlight. It started as AMoE, focused purely on efficient MoE distillation (loss, data, multi-res management), and it is now a full series of Agglomerative ViTs (dense and MoE, from 30m to 0.6B params) distilled from SigLIP2 and DINOv3. We used the AMoE variant to initialize the vision experts of an early-fusion grounding MoE with modality-specific experts and show that it is a strong baseline on the small-scale training data regime on the refcoco benchmarks. Later, we figured that full early-fusion with a dense model works well, and even better, which led to Falcon Perception. Models: huggingface.co/collections/ti… Paper: arxiv.org/abs/2512.20157 Code: github.com/tiiuae/siglino… x.com/dahou_yasser/s… With @NarayanSanath @dahou_yasser @lkhphuc @griffintaur @HildeKuehne @hhacid

English

4

47

278

12.7K

Phúc Lê retweetledi

thestreamingdev()@thestreamingdev·8 Nis

launching `data-label-factory` a generic auto-labeling pipeline. You write a YAML for your object class and run one command: a vision dataset on a 16 GB MacBook. No GPU, no labelers, no vendor. Using Gemma 4 @Google mlx-vlm @Prince_Canuma + Falcon Perception @lkhphuc here's how : 🧵(point your /agent at the claude.md file to start asap)

English

3

10

1K

Phúc Lê@lkhphuc·8 Nis

@dahou_yasser @_tianyu @wanchao We use torchtitan to train from massive model to tiny model, and its simplicity helps us a great deal when implementing non-standard llm components.

English

46

Yasser Dahou@dahou_yasser·8 Nis

Check out our rich repo: github.com/tiiuae/Falcon-… Worth noting that Torchtitan helped us a great deal with the training side. Massive thanks to @_tianyu & @wanchao for the excellent work! #torchtitan #pytorch

Phúc Lê@lkhphuc

Falcon-Perception can now be install with `pip install falcon-perception` - 1 model file - 2 variants Perception + OCR - Paged / Batch +KVCache inference engine - Torch+compile + cudagraph - Upsampler w/ async cache for high-res mask Plus: MLX batch inference support 🧵:

English

2

0

5

1.2K

Phúc Lê@lkhphuc·8 Nis

Plus an API server, MLX support. Please check us out here github.com/tiiuae/Falcon-… and share with us what you want. We are excited for what come next to Falcon Perception: better kvcache usage, quantization, finetuning code, integration to llama.cpp, vllm, ...

English

1

3

93

Phúc Lê@lkhphuc·8 Nis

Balance between good perf and simplicity w/ pure python, in a small and readable package, w/ torch compile, flex attention and a little cuda-graph manipulation, all techniques from llm serving ecosystem. This reinforce our bet on bridging 1st-class grounding vision to LLM world.

English

0

2

96

Phúc Lê@lkhphuc·8 Nis

Falcon-Perception can now be install with `pip install falcon-perception` - 1 model file - 2 variants Perception + OCR - Paged / Batch +KVCache inference engine - Torch+compile + cudagraph - Upsampler w/ async cache for high-res mask Plus: MLX batch inference support 🧵:

We are releasing Falcon Perception, an open-vocabulary referring expression segmentation model. Along with it, a 0.3B OCR model that is on par with 3-10x larger competitors. Current systems solve this with complex pipelines (separate encoders, late fusion, matching algorithms). We developed a novel simpler "bitter" approach: one early-fusion Transformer (image + text from first layer) with a shared parameter space, and let scale + training signal do the work. Please check our work ! 📄 Paper: arxiv.org/pdf/2603.27365 💻 Code: github.com/tiiuae/falcon-… 🎮 Playground: vision.falcon.aidrc.tii.ae 🤗 Blogpost: huggingface.co/blog/tiiuae/fa…

English