Aviraj Bevli

15 posts

Aviraj Bevli

Aviraj Bevli

@LiAvBev

Katılım Mart 2026
86 Takip Edilen9 Takipçiler
Aviraj Bevli retweetledi
thestreamingdev()
thestreamingdev()@thestreamingdev·
launching `data-label-factory` a generic auto-labeling pipeline. You write a YAML for your object class and run one command: a vision dataset on a 16 GB MacBook. No GPU, no labelers, no vendor. Using Gemma 4 @Google mlx-vlm @Prince_Canuma + Falcon Perception @lkhphuc here's how : 🧵(point your /agent at the claude.md file to start asap)
English
1
3
10
1K
Aviraj Bevli retweetledi
Aviraj Bevli retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
People are asking what’s the difference between Falcon Perception and SAM3, so here’s my opinion: SAM3: arxiv.org/pdf/2511.16719 Falcon Perception: arxiv.org/pdf/2603.27365 First, sam3 does "promptable concept segmentation": simple noun phrases (like "yellow bus", "red apple") + exemplars + interactivity + tracking in video. falcon perception is more like "perception as a generation interface": text can get more compositional (ocr / spatial constraints / relations), and the model can emit a lot of instances without redesigning the whole system. architecture: sam3 -> a full system. aligned vl backbone (pe) + prompt/exemplar encoder + fusion encoder + detr decoder with fixed object queries (default Q=200) + maskformer-style mask head + presence token/head. then for video: tracker + memory bank + heuristics, and they even propose that "multiplex" thing to make multi-object more efficient. falcon -> intentionally "bitter lesson"-ish for this task. one early-fusion transformer from layer 1. text and image tokens share the same parameters space. structured AR interface per instance: -> -> . light heads only when output is continuous/dense. masks are not generated token-by-token, they come from a seg token embedding + high-res image features (dot product). where sam3 is genuinely stronger today tbh, & sam3 has clear practical priors: parallel inference on images is predictable + fast. video identity + memory + interactivity is first-class. presence calibration is explicit (presence head). negatives are handled also per query as part of the nothing class. set prediction helps with duplicates. so why do i still lean FP for "final model"? because falcon is way more "llm-shaped" at the core. that matters a lot for future work. when your backbone is basically "tokens + kv cache + long context + sampling", you can re-use the whole llm playbook and it transfers almost directly: prefix / kv caching: in real apps you reuse the same image and ask many prompts. falcon can cache the image+prompt prefix and just decode new stuff. this is like the default vllm serving. paged / continuous batching / caching policies: llm infra is insanely good now. falcon-style inference can ride that wave instead of reinventing custom perception serving tricks. quant + speed tricks: all the turbo quant / rabiq / kernel fusion / compile pipelines are built around transformer backbones. if your perception system is "mostly one transformer", you benefit immediately. also worth saying: moondream2 @vikhyatk and isaac (perceptron) @ArmenAgha are in the same "llm-shaped grounding" family. they also do AR detection/grounding by serializing geometry into text (points/boxes/polygons). so segmentation can use a lot of token with larger instance count -> higher latency. falcon perception is different here: AR is the interface for instances ( -> -> ), but the mask itself is computed in parallel from an upsampled feature map (seg token embedding ⨉ high-res features via dot product). so mask resolution/detail is not bottlenecked by decoding length. and also, dense scenes: sam3 is capped by Q queries unless you scale Q (which is possible but not free). falcon is capped by context/decoding budget, but conceptually it can just keep emitting instances (and pbench dense is exactly testing this kind of regime). A concrete example: Sam3.1 recent "multiplex" (multi-object efficiency) is a good idea and shows the benefits. but for falcon perception, a lot of this is just "llm serving 101": cache prefixes, pack sequences, paged kv, don’t waste padding, etc. like the ecosystem already solved a big part of the systems problem. My take: if you want a strong "product system" today for concept prompts + video + calibration: sam3 is very good, FP still very good too, the repo offers a fast paged inference: github.com/tiiuae/Falcon-…, up to people preferences and vibe checks ... if you want the long-term "perception engine" that agents will call (compositional prompts + dense scenes + reuse caching + quant + all llm tricks): falcon perception has better chances to win. sam3: more modules, more priors. super good negatives/presence, set prediction, video identity. falcon: cleaner primitive. more "llm-like". benefits from the entire llm community roadmap (caching, batching, quant, speculative, rl post-training). not saying sam3 is worse. it's just a different bet. sam3 bet -> engineer the right priors + system pieces now. falcon perception bet -> one scalable backbone + right interface, then scale with data/compute + steal llm tricks + rl to fix selection/calibration.
Yasser Dahou tweet media
English
9
35
266
15.1K
Aviraj Bevli retweetledi
Omar Sanseviero
Omar Sanseviero@osanseviero·
Introducing a Visual Guide to Gemma 4 👀 An in-depth, architectural deep dive of the Gemma 4 family of models. From Per-Layer Embeddings to the vision and audio encoders. Take a look!
Omar Sanseviero tweet media
English
18
176
1.1K
58.4K
Aviraj Bevli retweetledi
Prince Canuma
Prince Canuma@Prince_Canuma·
mlx-vlm v0.4.4 is out 🚀🔥 New models: 🦅 Falcon-Perception 300M by @TIIuae Highlights: ⚡️ TurboQuant Metal kernels optimized — upto 1.90x decode speed up over baseline on longer context with 89% KV cache savings. 👀 VisionFeatureCache — multi-turn image caching so you don’t re-encode the same image every turn. 🔧Gemma 4 fixes — chunked prefill for KV-shared models & thinking, vision + text degradation, processor config, and nested tool parsing 📹Video CLI fixes Get started today: > uv pip install -U mlx-vlm Shoutout to the awesome @N8Programs for helping me spot and fix some critical yet subtle issues on Gemma 4 ❤️ Happy easter everyone 🐣 and remember to leave us a star ⭐️ github.com/Blaizzy/mlx-vlm
Prince Canuma tweet media
English
16
41
370
87.3K
Aviraj Bevli retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
I used Gemma4 + Falcon Perception from this mlx-vlm release to build a grounded reasoning agent runs fully local on M3 the idea: VLMs are great at reasoning but not great at measuring. Falcon Perception is great at segmentation but cant reason. so you loop them: Gemma4 decides what to look for, FP segments it and returns pixel-accurate coordinates, Gemma4 reasons on the numbers ask "is the blue player offside?" → it grounds the players, finds the second-to-last defender, compares centroid positions, applies the rule. check the video for some examples @Prince_Canuma I can submit a PR with this demo if you want
Prince Canuma@Prince_Canuma

mlx-vlm v0.4.4 is out 🚀🔥 New models: 🦅 Falcon-Perception 300M by @TIIuae Highlights: ⚡️ TurboQuant Metal kernels optimized — upto 1.90x decode speed up over baseline on longer context with 89% KV cache savings. 👀 VisionFeatureCache — multi-turn image caching so you don’t re-encode the same image every turn. 🔧Gemma 4 fixes — chunked prefill for KV-shared models & thinking, vision + text degradation, processor config, and nested tool parsing 📹Video CLI fixes Get started today: > uv pip install -U mlx-vlm Shoutout to the awesome @N8Programs for helping me spot and fix some critical yet subtle issues on Gemma 4 ❤️ Happy easter everyone 🐣 and remember to leave us a star ⭐️ github.com/Blaizzy/mlx-vlm

English
16
86
832
107.9K
Aviraj Bevli retweetledi
Yasser Dahou
Yasser Dahou@dahou_yasser·
We are releasing Falcon Perception, an open-vocabulary referring expression segmentation model. Along with it, a 0.3B OCR model that is on par with 3-10x larger competitors. Current systems solve this with complex pipelines (separate encoders, late fusion, matching algorithms). We developed a novel simpler "bitter" approach: one early-fusion Transformer (image + text from first layer) with a shared parameter space, and let scale + training signal do the work. Please check our work ! 📄 Paper: arxiv.org/pdf/2603.27365 💻 Code: github.com/tiiuae/falcon-… 🎮 Playground: vision.falcon.aidrc.tii.ae 🤗 Blogpost: huggingface.co/blog/tiiuae/fa…
English
26
165
989
117.7K
Aviraj Bevli
Aviraj Bevli@LiAvBev·
6) MLX: Apple's own inference engine, optimized for Apple silicon! Seems like it is under rapid development. Since most software engineers use a macbook, it can be huge for apple if they can get them to use MLX to run local AI!
English
0
0
0
24
Aviraj Bevli
Aviraj Bevli@LiAvBev·
Here is a quick summary of the state of LLM inference ecosystem: 1) VLLM : undisputed king of LLM inference. Supports a huge array of autoregressive models. This is your best bet to get blazing fast inference speed in production!
English
5
0
1
26
Aviraj Bevli
Aviraj Bevli@LiAvBev·
5) Ollama : basically just an inference wrapper around llama.cpp. Makes local AI accessible to non-engineers, enabling quick, easy experimentation on supported models
English
0
0
0
17
Aviraj Bevli
Aviraj Bevli@LiAvBev·
4) TensorRT-LLM : NVIDIA's official inference engine. Use this if you want to squeeze maximum performance from your LLM running on Nvidia gpus. Less flexible than vllm, but higher throughput!
English
0
0
0
17
Aviraj Bevli
Aviraj Bevli@LiAvBev·
3) Llama.cpp : for local LLM inference. Written in pure C/C++. Designed to run on pretty much anything — normal laptop a, a Raspberry Pi, edge devices. Because of the focus on edge devices, provides extensive support for quantization (in GGUF format)
English
0
0
0
17
Aviraj Bevli
Aviraj Bevli@LiAvBev·
2) SGLang: similar to VLLM for most practical purposes. Differs in the attention implementation. Uses radix attention. Some people say it is faster than VLLM. Some people claim the other way around. Not sure who is correct. Never done a benchmarking test myself to verify!
English
0
0
0
18
Aviraj Bevli
Aviraj Bevli@LiAvBev·
Taalas HC1 : pretty impressive! Consistent 17000 tokens/second (test it at : chatjimmy.ai) comfortably beats the current speed king, Cerebras, at 2000 tokens/second. The only question now is: Is this approach of hardcoding model into silicon scalable enough? Let's see
English
0
0
0
18