Phúc Lê

336 posts

Phúc Lê banner
Phúc Lê

Phúc Lê

@lkhphuc

Katılım Aralık 2014
575 Takip Edilen343 Takipçiler
Phúc Lê retweetledi
Maziyar PANAHI
Maziyar PANAHI@MaziyarPanahi·
Turned real footage into a Call of Duty HUD. 3 scenes. 1 open-vocab stack. 0 training runs. Type the thing, the HUD finds the thing, green containers, cargo near the water, incoming aircraft. No GPU. Just a MacBook. What game should the HUD cosplay next?
English
9
4
75
11.5K
SkalskiP
SkalskiP@skalskip92·
is it just me or Falcon Perception is pretty slow? even on A100
English
7
0
19
5.9K
Phúc Lê
Phúc Lê@lkhphuc·
@skalskip92 #scrollTo=51abfe9f" target="_blank" rel="nofollow noopener">colab.research.google.com/drive/1Jy6lRYu… here is our walkthrough notebook
English
0
0
0
103
SkalskiP
SkalskiP@skalskip92·
@lkhphuc I a lot slower... can you take a look at my notebook?
English
1
0
0
192
Phúc Lê retweetledi
Phúc Lê retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
okay another demo fo Gemma4 + Falcon Perception for automated video segmentation & tracking, no human prompts needed the idea: you feed Gemma4 a few sampled frames and ask it to describe what it sees. those descriptions get passed to Falcon Perception which segments and tracks them across the full video (using ByteTrack github.com/FoundationVisi…) you can steer what Gemma4 focuses on with different prompt levels: describe by visible text or brand -> dog with number 2 bib describe by spatial position -> horse on the right, horse in center, horse on the left describe by relationships -> rhinoceros walking with zebra same pipeline, different instructions ->different segmentation results. zero human labeling from raw video to tracked output. all local on M3 using mlx-vlm @Prince_Canuma @MaziyarPanahi Work done by @NarayanSanath Check our Falcon Perception repo: github.com/tiiuae/Falcon-…
English
9
32
347
51.2K
Phúc Lê retweetledi
Maziyar PANAHI
Maziyar PANAHI@MaziyarPanahi·
Gemma 4 analyzes the video. Generates key questions. Calls Falcon Perception. "Find all the people." 156 found. "Detect only white cars." 8 found. A 26B model is running agentic multi-QA vision orchestration. The models are running locally on a MacBook with MLX. No API.
English
51
127
1.6K
157.9K
Phúc Lê retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
tested Meta's Muse Spark @AIatMeta on level-1 tasks from our visres-bench (#CVPR2026) what it does is impressive, it doesn't just pick an answer. it crops the region, zooms into each boundary, traces edge continuity, checks lighting gradients. actual visual chain-of-thought. and tbh the reasoning is spot on. it identifies exactly the right cues to look at, but it still gets some wrong. and that's the interesting part, the failure isn't a reasoning failure. it's a perception issue imo. the model knows what to look for, it just can't resolve the fine-grained visual signal precisely enough to land on the right patch in all cases like the gap isn't "can it think about images" it clearly can. the gap is low-level spatial precision. and that's kinda an easier problem to solve ... maybe full traces here: YasserdahouML.github.io/visres-Bench
Yasser Dahou@dahou_yasser

Our Visual Reasoning Benchmark has been accepted to #CVPR2026 We wanted to know if VLMs can actually reason visually or if they're relying on text shortcuts. well -> take away the text context, and even the best models struggle hard we built a benchmark with 19k real images across 3 levels of difficulty. level 1: basic perception. can you complete the pattern or fix the occlusion? level 2: single rules. think raven's matrices but with real objects (color, count, orientation). level 3: multi-attribute. complex rules mixing everything together. paper: arxiv.org/abs/2512.21194 hf page: visres-bench.github.io with the amazing team @BrigiMala @andyhuynh1111 @NarayanSanath @lkhphuc @ChaySofian @griffintaur

English
0
3
22
2.6K
Phúc Lê
Phúc Lê@lkhphuc·
@rohanpaul_ai x.com/chaysofian/sta… checkout our tiny encoder distilled from siglip and Dino directly
Sofian Chaybouti@ChaySofian

Happy that SigLino is a #CVPR2026 Highlight. It started as AMoE, focused purely on efficient MoE distillation (loss, data, multi-res management), and it is now a full series of Agglomerative ViTs (dense and MoE, from 30m to 0.6B params) distilled from SigLIP2 and DINOv3. We used the AMoE variant to initialize the vision experts of an early-fusion grounding MoE with modality-specific experts and show that it is a strong baseline on the small-scale training data regime on the refcoco benchmarks. Later, we figured that full early-fusion with a dense model works well, and even better, which led to Falcon Perception. Models: huggingface.co/collections/ti… Paper: arxiv.org/abs/2512.20157 Code: github.com/tiiuae/siglino… x.com/dahou_yasser/s… With @NarayanSanath @dahou_yasser @lkhphuc @griffintaur @HildeKuehne @hhacid

English
0
0
1
52
Rohan Paul
Rohan Paul@rohanpaul_ai·
Meta just gave a training recipe that says small models should not learn from many experts directly. They should first learn from one unified teacher that already blended the experts together. Introduced EUPE, or Efficient Universal Perception Encoder, a vision encoder that packs image recognition, dense prediction, and VLM-ready features into one 86M-parameter model without giving up specialist-level accuracy. Before EUPE, you usually had to choose between a model that was good at understanding whole images, a model that was good at pixel-level tasks like segmentation and depth, or a model that worked well with language in VLM systems. That is a real problem on phones, glasses, and other edge devices, because running 2 or 3 separate encoders costs too much memory, power, and latency. People already tried to merge several expert models into one small model, but the small model usually became a mediocre compromise that lost too much from each expert. EUPE fixes that by adding a big 1.9B proxy model in the middle, so the experts first teach a large model that has enough room to combine their knowledge, and only after that does Meta compress the result into a small model. That is why an 86M EUPE model can match or beat specialist models on image understanding, dense prediction, and several VLM tasks instead of being good at only one of them. ---- Paper Link – arxiv. org/abs/2603.22387 Paper Title: "Efficient Universal Perception Encoder"
Rohan Paul tweet media
English
5
14
73
6.4K
Phúc Lê retweetledi
Sofian Chaybouti
Sofian Chaybouti@ChaySofian·
Happy that SigLino is a #CVPR2026 Highlight. It started as AMoE, focused purely on efficient MoE distillation (loss, data, multi-res management), and it is now a full series of Agglomerative ViTs (dense and MoE, from 30m to 0.6B params) distilled from SigLIP2 and DINOv3. We used the AMoE variant to initialize the vision experts of an early-fusion grounding MoE with modality-specific experts and show that it is a strong baseline on the small-scale training data regime on the refcoco benchmarks. Later, we figured that full early-fusion with a dense model works well, and even better, which led to Falcon Perception. Models: huggingface.co/collections/ti… Paper: arxiv.org/abs/2512.20157 Code: github.com/tiiuae/siglino… x.com/dahou_yasser/s… With @NarayanSanath @dahou_yasser @lkhphuc @griffintaur @HildeKuehne @hhacid
English
4
47
278
12.7K
Phúc Lê retweetledi
thestreamingdev()
thestreamingdev()@thestreamingdev·
launching `data-label-factory` a generic auto-labeling pipeline. You write a YAML for your object class and run one command: a vision dataset on a 16 GB MacBook. No GPU, no labelers, no vendor. Using Gemma 4 @Google mlx-vlm @Prince_Canuma + Falcon Perception @lkhphuc here's how : 🧵(point your /agent at the claude.md file to start asap)
English
1
3
10
1K
Phúc Lê
Phúc Lê@lkhphuc·
@dahou_yasser @_tianyu @wanchao We use torchtitan to train from massive model to tiny model, and its simplicity helps us a great deal when implementing non-standard llm components.
English
0
0
0
46
Phúc Lê
Phúc Lê@lkhphuc·
Plus an API server, MLX support. Please check us out here github.com/tiiuae/Falcon-… and share with us what you want. We are excited for what come next to Falcon Perception: better kvcache usage, quantization, finetuning code, integration to llama.cpp, vllm, ...
English
0
1
3
93
Phúc Lê
Phúc Lê@lkhphuc·
Balance between good perf and simplicity w/ pure python, in a small and readable package, w/ torch compile, flex attention and a little cuda-graph manipulation, all techniques from llm serving ecosystem. This reinforce our bet on bridging 1st-class grounding vision to LLM world.
English
1
0
2
96