Aviraj Bevli (@LiAvBev) - Twitter Profili | Zamantika Mersobahis Locabet

Aviraj Bevli retweetledi

launching `data-label-factory` a generic auto-labeling pipeline. You write a YAML for your object class and run one command: a vision dataset on a 16 GB MacBook. No GPU, no labelers, no vendor. Using Gemma 4 @Google mlx-vlm @Prince_Canuma + Falcon Perception @lkhphuc here's how : 🧵(point your /agent at the claude.md file to start asap)

English

1

3

10

1K

Aviraj Bevli retweetledi

Phúc Lê@lkhphuc·8 Nis

Falcon-Perception can now be install with `pip install falcon-perception` - 1 model file - 2 variants Perception + OCR - Paged / Batch +KVCache inference engine - Torch+compile + cudagraph - Upsampler w/ async cache for high-res mask Plus: MLX batch inference support 🧵:

Yasser Dahou@dahou_yasser

We are releasing Falcon Perception, an open-vocabulary referring expression segmentation model. Along with it, a 0.3B OCR model that is on par with 3-10x larger competitors. Current systems solve this with complex pipelines (separate encoders, late fusion, matching algorithms). We developed a novel simpler "bitter" approach: one early-fusion Transformer (image + text from first layer) with a shared parameter space, and let scale + training signal do the work. Please check our work ! 📄 Paper: arxiv.org/pdf/2603.27365 💻 Code: github.com/tiiuae/falcon-… 🎮 Playground: vision.falcon.aidrc.tii.ae 🤗 Blogpost: huggingface.co/blog/tiiuae/fa…

English

1

5

30

3.6K

Aviraj Bevli retweetledi

Yasser Dahou@dahou_yasser·7 Nis

People are asking what’s the difference between Falcon Perception and SAM3, so here’s my opinion: SAM3: arxiv.org/pdf/2511.16719 Falcon Perception: arxiv.org/pdf/2603.27365 First, sam3 does "promptable concept segmentation": simple noun phrases (like "yellow bus", "red apple") + exemplars + interactivity + tracking in video. falcon perception is more like "perception as a generation interface": text can get more compositional (ocr / spatial constraints / relations), and the model can emit a lot of instances without redesigning the whole system. architecture: sam3 -> a full system. aligned vl backbone (pe) + prompt/exemplar encoder + fusion encoder + detr decoder with fixed object queries (default Q=200) + maskformer-style mask head + presence token/head. then for video: tracker + memory bank + heuristics, and they even propose that "multiplex" thing to make multi-object more efficient. falcon -> intentionally "bitter lesson"-ish for this task. one early-fusion transformer from layer 1. text and image tokens share the same parameters space. structured AR interface per instance: -> -> . light heads only when output is continuous/dense. masks are not generated token-by-token, they come from a seg token embedding + high-res image features (dot product). where sam3 is genuinely stronger today tbh, & sam3 has clear practical priors: parallel inference on images is predictable + fast. video identity + memory + interactivity is first-class. presence calibration is explicit (presence head). negatives are handled also per query as part of the nothing class. set prediction helps with duplicates. so why do i still lean FP for "final model"? because falcon is way more "llm-shaped" at the core. that matters a lot for future work. when your backbone is basically "tokens + kv cache + long context + sampling", you can re-use the whole llm playbook and it transfers almost directly: prefix / kv caching: in real apps you reuse the same image and ask many prompts. falcon can cache the image+prompt prefix and just decode new stuff. this is like the default vllm serving. paged / continuous batching / caching policies: llm infra is insanely good now. falcon-style inference can ride that wave instead of reinventing custom perception serving tricks. quant + speed tricks: all the turbo quant / rabiq / kernel fusion / compile pipelines are built around transformer backbones. if your perception system is "mostly one transformer", you benefit immediately. also worth saying: moondream2 @vikhyatk and isaac (perceptron) @ArmenAgha are in the same "llm-shaped grounding" family. they also do AR detection/grounding by serializing geometry into text (points/boxes/polygons). so segmentation can use a lot of token with larger instance count -> higher latency. falcon perception is different here: AR is the interface for instances ( -> -> ), but the mask itself is computed in parallel from an upsampled feature map (seg token embedding ⨉ high-res features via dot product). so mask resolution/detail is not bottlenecked by decoding length. and also, dense scenes: sam3 is capped by Q queries unless you scale Q (which is possible but not free). falcon is capped by context/decoding budget, but conceptually it can just keep emitting instances (and pbench dense is exactly testing this kind of regime). A concrete example: Sam3.1 recent "multiplex" (multi-object efficiency) is a good idea and shows the benefits. but for falcon perception, a lot of this is just "llm serving 101": cache prefixes, pack sequences, paged kv, don’t waste padding, etc. like the ecosystem already solved a big part of the systems problem. My take: if you want a strong "product system" today for concept prompts + video + calibration: sam3 is very good, FP still very good too, the repo offers a fast paged inference: github.com/tiiuae/Falcon-…, up to people preferences and vibe checks ... if you want the long-term "perception engine" that agents will call (compositional prompts + dense scenes + reuse caching + quant + all llm tricks): falcon perception has better chances to win. sam3: more modules, more priors. super good negatives/presence, set prediction, video identity. falcon: cleaner primitive. more "llm-like". benefits from the entire llm community roadmap (caching, batching, quant, speculative, rl post-training). not saying sam3 is worse. it's just a different bet. sam3 bet -> engineer the right priors + system pieces now. falcon perception bet -> one scalable backbone + right interface, then scale with data/compute + steal llm tricks + rl to fix selection/calibration.

English

9

35

266

15.1K

Aviraj Bevli retweetledi

Omar Sanseviero@osanseviero·3 Nis

Introducing a Visual Guide to Gemma 4 👀 An in-depth, architectural deep dive of the Gemma 4 family of models. From Per-Layer Embeddings to the vision and audio encoders. Take a look!

English

18

176

1.1K

58.4K

Aviraj Bevli retweetledi

Prince Canuma@Prince_Canuma·4 Nis

mlx-vlm v0.4.4 is out 🚀🔥 New models: 🦅 Falcon-Perception 300M by @TIIuae Highlights: ⚡️ TurboQuant Metal kernels optimized — upto 1.90x decode speed up over baseline on longer context with 89% KV cache savings. 👀 VisionFeatureCache — multi-turn image caching so you don’t re-encode the same image every turn. 🔧Gemma 4 fixes — chunked prefill for KV-shared models & thinking, vision + text degradation, processor config, and nested tool parsing 📹Video CLI fixes Get started today: > uv pip install -U mlx-vlm Shoutout to the awesome @N8Programs for helping me spot and fix some critical yet subtle issues on Gemma 4 ❤️ Happy easter everyone 🐣 and remember to leave us a star ⭐️ github.com/Blaizzy/mlx-vlm

English

16

41

370

87.3K

Aviraj Bevli retweetledi

Yasser Dahou@dahou_yasser·5 Nis

I used Gemma4 + Falcon Perception from this mlx-vlm release to build a grounded reasoning agent runs fully local on M3 the idea: VLMs are great at reasoning but not great at measuring. Falcon Perception is great at segmentation but cant reason. so you loop them: Gemma4 decides what to look for, FP segments it and returns pixel-accurate coordinates, Gemma4 reasons on the numbers ask "is the blue player offside?" → it grounds the players, finds the second-to-last defender, compares centroid positions, applies the rule. check the video for some examples @Prince_Canuma I can submit a PR with this demo if you want

Prince Canuma@Prince_Canuma

mlx-vlm v0.4.4 is out 🚀🔥 New models: 🦅 Falcon-Perception 300M by @TIIuae Highlights: ⚡️ TurboQuant Metal kernels optimized — upto 1.90x decode speed up over baseline on longer context with 89% KV cache savings. 👀 VisionFeatureCache — multi-turn image caching so you don’t re-encode the same image every turn. 🔧Gemma 4 fixes — chunked prefill for KV-shared models & thinking, vision + text degradation, processor config, and nested tool parsing 📹Video CLI fixes Get started today: > uv pip install -U mlx-vlm Shoutout to the awesome @N8Programs for helping me spot and fix some critical yet subtle issues on Gemma 4 ❤️ Happy easter everyone 🐣 and remember to leave us a star ⭐️ github.com/Blaizzy/mlx-vlm

English

16

86

832

107.9K

Aviraj Bevli retweetledi

Yasser Dahou@dahou_yasser·1 Nis

We are releasing Falcon Perception, an open-vocabulary referring expression segmentation model. Along with it, a 0.3B OCR model that is on par with 3-10x larger competitors. Current systems solve this with complex pipelines (separate encoders, late fusion, matching algorithms). We developed a novel simpler "bitter" approach: one early-fusion Transformer (image + text from first layer) with a shared parameter space, and let scale + training signal do the work. Please check our work ! 📄 Paper: arxiv.org/pdf/2603.27365 💻 Code: github.com/tiiuae/falcon-… 🎮 Playground: vision.falcon.aidrc.tii.ae 🤗 Blogpost: huggingface.co/blog/tiiuae/fa…

English

26

165

989

117.7K

Aviraj Bevli@LiAvBev·17 Mar

6) MLX: Apple's own inference engine, optimized for Apple silicon! Seems like it is under rapid development. Since most software engineers use a macbook, it can be huge for apple if they can get them to use MLX to run local AI!

English

0

24

Aviraj Bevli@LiAvBev·17 Mar

Here is a quick summary of the state of LLM inference ecosystem: 1) VLLM : undisputed king of LLM inference. Supports a huge array of autoregressive models. This is your best bet to get blazing fast inference speed in production!

English

5

0

1

26

Aviraj Bevli@LiAvBev·17 Mar

5) Ollama : basically just an inference wrapper around llama.cpp. Makes local AI accessible to non-engineers, enabling quick, easy experimentation on supported models

English

0

17

Aviraj Bevli@LiAvBev·17 Mar

4) TensorRT-LLM : NVIDIA's official inference engine. Use this if you want to squeeze maximum performance from your LLM running on Nvidia gpus. Less flexible than vllm, but higher throughput!

English

0

17

Aviraj Bevli@LiAvBev·17 Mar

3) Llama.cpp : for local LLM inference. Written in pure C/C++. Designed to run on pretty much anything — normal laptop a, a Raspberry Pi, edge devices. Because of the focus on edge devices, provides extensive support for quantization (in GGUF format)

English

0

17

Aviraj Bevli@LiAvBev·17 Mar

2) SGLang: similar to VLLM for most practical purposes. Differs in the attention implementation. Uses radix attention. Some people say it is faster than VLLM. Some people claim the other way around. Not sure who is correct. Never done a benchmarking test myself to verify!

English

0

18

Aviraj Bevli@LiAvBev·9 Mar

I know I am late to the party. But thinkingmachines.ai/blog/defeating… by @thinkymachines proves that there is still so much low hanging fruit left in the area of LLM inference. STILL! @vllm_project does not support batch invariance for all models. A great place to start contributing!

English

0

21

Aviraj Bevli@LiAvBev·3 Mar

Taalas HC1 : pretty impressive! Consistent 17000 tokens/second (test it at : chatjimmy.ai) comfortably beats the current speed king, Cerebras, at 2000 tokens/second. The only question now is: Is this approach of hardcoding model into silicon scalable enough? Let's see

English

0

18

Aviraj Bevli

Keşfet