Satya Mallick

3.3K posts

Satya Mallick

@LearnOpenCV

CEO, https://t.co/CzUdJlxzJM. Course Director, https://t.co/O2Tz9vUOQ8 Entrepreneur. Ph.D. ( Computer Vision & Machine Learning ). Author: https://t.co/olraDEG5Ue

San Diego, CA Katılım Haziran 2008

898 Takip Edilen15.1K Takipçiler

Satya Mallick@LearnOpenCV·12h

Why every frontier LLM is converging on Mixture of Experts 🧵 Trillion-parameter model. Single query. You don't need the whole thing. A router picks a subset of "experts." Medical question → medical expert. Legal → legal. Some models keep one generalist always on. Saves compute. Not memory. → vist.ly/54azz #MoE #LLM #MachineLearning #Qwen3

English

227

Satya Mallick@LearnOpenCV·12h

"VLM" is doing a lot of heavy lifting as a label. CLIP → image-text alignment, zero-shot recognition Moondream → grounding ("find the guy in red") Qwen3-VL → agentic + GUI + long video understanding Same category. Wildly different tools. Dr. Satya Mallick explains → vist.ly/54avy #VLM #ComputerVision #MultimodalAI #CLIP #Qwen3VL

English

293

Satya Mallick@LearnOpenCV·1d

Pt. 2 — YOLO26-Seg is wild: → Distribution Focal Loss removed → MuSGD optimizer (hybrid borrowed from LLM training) → NMS baked into the model → Boundary-aware supervision = razor-sharp masks → Up to 43% faster on CPU → One ONNX export → Pi, drone, phone Deep dive: vist.ly/54825

English

527

Satya Mallick@LearnOpenCV·1d

Depth Anything V2 (Part 2) — synthetic training data, sharper edges, handles glass & mirrors, deploys clean with OpenCV 5. Models from 25M params (edge) to 1.3B (max accuracy). Catch Part 1 first if you missed it. 🔗 vist.ly/545vn #ComputerVision #DepthAnythingV2 #OpenCV5 #EdgeAI

English

750

Satya Mallick@LearnOpenCV·1d

YOLO26 vs. the NMS bottleneck — Part 1 🧵 8,400 noisy boxes → external NMS cleanup → latency spikes. YOLO26 outputs 300 clean detections. NMS baked into the network. Segmentation that doesn't bleed. True end-to-end architecture, runs on CPU. More parts coming. Full breakdown → vist.ly/545ww #YOLO26 #ComputerVision #EdgeAI #InstanceSegmentation

English

303

Satya Mallick@LearnOpenCV·6d

What if accurate depth maps could be generated from a single RGB image — without LiDAR or stereo cameras? That’s exactly what Depth Anything V2 achieves. In 2024, monocular depth estimation reached a major breakthrough: ✔ Fast ✔ Lightweight ✔ Temporally stable ✔ Edge-device friendly Instead of relying on massive diffusion pipelines, Depth Anything V2 uses a highly optimized Vision Transformer architecture trained on millions of pseudo-labeled real-world images. The result? Real-time, surprisingly stable depth estimation from just one camera. This has massive implications for: • Robotics • AR/VR • Autonomous systems • Smart cameras • 3D scene understanding One of the most exciting things is how deployable it is compared to heavier depth models. Technical breakdown by LearnOpenCV: LearnOpenCV – Depth Anything Explained Research Paper: Depth Anything V2 Paper #AI #ComputerVision #OpenCV #DepthAnythingV2 #MachineLearning #DeepLearning #Robotics #EdgeAI #VisionTransformer #ArtificialIntelligence

English

923

Satya Mallick retweetledi

OpenAI@OpenAI·7 May

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API alongside streaming models GPT-Realtime-Translate and GPT-Realtime-Whisper — a new set of audio capabilities for the next generation of voice interfaces.

English

687

1.4K

14.8K

3.5M

Satya Mallick@LearnOpenCV·7 May

The four benefits in order of impact: 1. Prevents overfitting (the big one) 2. Adversarial robustness 3. Augments small datasets 4. Softer decision boundaries Used by experts. Skipped by most novices. Don't be a novice.

English

290

Satya Mallick@LearnOpenCV·7 May

English

723

Satya Mallick@LearnOpenCV·6 May

The full formula: x_mix = λ·x₁ + (1−λ)·x₂ y_mix = λ·y₁ + (1−λ)·y₂ where λ ~ Beta(α, α) Same λ for pixels AND labels — that consistency is the whole trick. Paper: arxiv.org/abs/1710.09412

240

Satya Mallick@LearnOpenCV·6 May

Most CV novices skip this. Most experts use it on every classifier. Mixup: blend two training images + blend their labels with the same λ. Result: less overfitting, smoother boundaries, adversarial robustness. Part 1 explains how it works ↓ Part 2 (PyTorch how-to) coming soon — follow for the drop. 🎥

English

832

Satya Mallick@LearnOpenCV·5 May

Part 2 🧊 In Part 1: accuracy is a trap. In Part 2: failure modes ARE your fine-tuning dataset. Probe the public model → collect data on exactly what it breaks on → fine-tune → repeat. That's the loop most CV teams skip. Dr. Satya Mallick 👇 #ComputerVision #AI #YOLO #FineTuning

English

583

Satya Mallick@LearnOpenCV·4 May

Accuracy is table stakes. Failure modes decide whether your CV model survives production. Same benchmark scores. Opposite real-world performance. Dr. Satya Mallick on what to audit before you ship 👇 #ComputerVision #MachineLearning

English

585

Satya Mallick@LearnOpenCV·1 May

Karpathy's clearest take of the year: The AI debate isn't a debate. It's two groups looking at different parts of the curve. → Consumer AI still fumbles simple tasks → Frontier coding agents collapse days of work into hours Both are real. The progress is just uneven. 🧵 Part 2 ↓ 1/ Why is the gap so wide? Karpathy says: capability follows capital. The most valuable AI use cases sit in B2B technical work. Save an engineer 2 days → that's worth real money. So that's where optimization goes. 2/ Which is why coders feel "AI shock" the hardest. They're using state-of-the-art models with terminal + repo + test access — watching them solve problems that used to take a sprint. The steepest part of the curve is hitting them first. 3/ The mistake on both sides: ❌ Don't judge all AI by weak public demos. ❌ Don't assume strong technical demos = equal progress everywhere. The truth is messier. AI is clumsy in some places and astonishing in others — at the same time. That gap is the whole story. #AI #Karpathy #LLM #AIShock

English

444

Satya Mallick@LearnOpenCV·1 May

YOLOE = real-time object detection with NO retraining. Type "delivery driver in a red jacket" → it finds them. Zero-shot. Open vocabulary. YOLO speed. The closed-world era of computer vision is over. 🧵👇 🔗 vist.ly/42jd3 #YOLOE #ComputerVision #AI #DeepLearning #YOLO Optional thread continuation (if you want to expand): 1/ Standard YOLO models are locked into a fixed list of categories (e.g. COCO's 80 classes). Want to detect something new? Weeks of labeling + retraining. 2/ YOLOE solves this with RepRTA — Re-parameterizable Region-Text Alignment. It maps language directly onto pixels, with zero inference overhead. 3/ The speed problem is solved by SAVPE — Semantic-Activated Visual Prompt Encoder. It bakes your prompt into the detection head, so you keep traditional YOLO real-time performance. 4/ Net result: change a text string → reprogram your detector. On a drone. A factory cam. A robot. No new dataset. No retraining. 5/ This is the foundation for embodied AI — robots that navigate messy rooms and find objects they've never seen before. Full breakdown 👉 vist.ly/42jd3

English

711

Satya Mallick@LearnOpenCV·30 Nis

What if image generators actually understand what they create? Vision Banana proves it. One model handles: → Object detection → Instance segmentation → Metric depth from a single photo → Surface normal estimation No specialist models. Beats SAM 3 on multiple segmentation benchmarks. Generative backbones were generalist vision learners all along. #VisionBanana #ComputerVision #AI

English

1.1K

Satya Mallick retweetledi

Jitendra MALIK@JitendraMalikCV·29 Nis

@jon_barron "World models" has a technical meaning - the transition model/dynamics model from Bellman/Kalman in the context of MDPs/ state space approach to control theory ~ 1960. I gave a talk on this history youtube.com/watch?v=9B4kka…

YouTube

English

292

58.4K

Satya Mallick@LearnOpenCV·29 Nis

YOLO26-Pose tracks 17 human keypoints in a single forward pass. Smallest variant: 1.8 ms on a T4 GPU. ⚡ → RLE for sharper localization → NMS-free inference (predictable latency) → MuSGD for stable training Full breakdown 👇 learnopencv.com/yolo26-pose-es… #ComputerVision #YOLO26 Optional thread version: 1/ YOLO26-Pose is here. It predicts the full human skeleton in a single forward pass — shoulders, elbows, wrists, hips, knees, ankles. 17 COCO keypoints, real-time. 2/ The smallest variant runs at ~1.8 ms on a T4. That's deployable. Fitness, sports analytics, gesture control, rehab, safety — all on the table. 3/ What's new architecturally: RLE → better keypoint localization NMS-free → predictable latency MuSGD (SGD + Muon hybrid) → more stable training 4/ We tested it on yoga, karate, dance, gym, parkour, multi-person. Full LearnOpenCV tutorial walks through architecture, code, and raw outputs: 🔗 vist.ly/428rz 5/ Want the deep dive on why NMS-free matters for edge deployment? Companion piece here: 🔗 vist.ly/428r2

English

613

Satya Mallick@LearnOpenCV·29 Nis

Karpathy's framing of the AI debate is the cleanest I've seen: Two groups. Same industry. Opposite conclusions. → Group 1: judged AI on free/old models. Saw the failures. Wrote it off. → Group 2: uses frontier models for hard technical work. Progress feels shocking. But here's the part most people miss — even frontier users underestimate where the gains are concentrated. It's not casual writing. Not general search. Not everyday advice. It's coding. Math. Research. Terminal work. Anywhere a system can cleanly verify: did this work, yes or no? Clean signals → RL eats the domain. The capability frontier and public perception are diverging fastest exactly where the economic value is. #AI #Karpathy #LLM

English

395

Satya Mallick retweetledi

Nassim Nicholas Taleb@nntaleb·24 Nis

Those who treat humans as machines are also treating machines as humans.

English

160

1.2K

6.1K

187.3K

Keşfet

@jon_barron @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine