Satya Mallick

3.3K posts

Satya Mallick

@LearnOpenCV

CEO, https://t.co/CzUdJlxzJM. Course Director, https://t.co/O2Tz9vUOQ8 Entrepreneur. Ph.D. ( Computer Vision & Machine Learning ). Author: https://t.co/olraDEG5Ue

San Diego, CA Katılım Haziran 2008

898 Takip Edilen15.1K Takipçiler

Satya Mallick@LearnOpenCV·13h

Depth Anything V2 — synthetic training data, sharper edges, handles glass & mirrors, and deploys clean with OpenCV 5. No PyTorch dependency soup. Models from 25M params (edge) to 1.3B (max accuracy). Part 2 coming soon. 🔗 vist.ly/53y8q #ComputerVision #DepthAnythingV2 #OpenCV5 #EdgeAI

English

305

Satya Mallick@LearnOpenCV·3d

What if accurate depth maps could be generated from a single RGB image — without LiDAR or stereo cameras? That’s exactly what Depth Anything V2 achieves. In 2024, monocular depth estimation reached a major breakthrough: ✔ Fast ✔ Lightweight ✔ Temporally stable ✔ Edge-device friendly Instead of relying on massive diffusion pipelines, Depth Anything V2 uses a highly optimized Vision Transformer architecture trained on millions of pseudo-labeled real-world images. The result? Real-time, surprisingly stable depth estimation from just one camera. This has massive implications for: • Robotics • AR/VR • Autonomous systems • Smart cameras • 3D scene understanding One of the most exciting things is how deployable it is compared to heavier depth models. Technical breakdown by LearnOpenCV: LearnOpenCV – Depth Anything Explained Research Paper: Depth Anything V2 Paper #AI #ComputerVision #OpenCV #DepthAnythingV2 #MachineLearning #DeepLearning #Robotics #EdgeAI #VisionTransformer #ArtificialIntelligence

English

888

Satya Mallick retweetledi

OpenAI@OpenAI·5d

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API alongside streaming models GPT-Realtime-Translate and GPT-Realtime-Whisper — a new set of audio capabilities for the next generation of voice interfaces.

English

680

1.4K

14.7K

3.5M

Satya Mallick@LearnOpenCV·5d

The four benefits in order of impact: 1. Prevents overfitting (the big one) 2. Adversarial robustness 3. Augments small datasets 4. Softer decision boundaries Used by experts. Skipped by most novices. Don't be a novice.

English

288

Satya Mallick@LearnOpenCV·5d

English

708

Satya Mallick@LearnOpenCV·6d

The full formula: x_mix = λ·x₁ + (1−λ)·x₂ y_mix = λ·y₁ + (1−λ)·y₂ where λ ~ Beta(α, α) Same λ for pixels AND labels — that consistency is the whole trick. Paper: arxiv.org/abs/1710.09412

238

Satya Mallick@LearnOpenCV·6d

Most CV novices skip this. Most experts use it on every classifier. Mixup: blend two training images + blend their labels with the same λ. Result: less overfitting, smoother boundaries, adversarial robustness. Part 1 explains how it works ↓ Part 2 (PyTorch how-to) coming soon — follow for the drop. 🎥

English

826

Satya Mallick@LearnOpenCV·5 May

Part 2 🧊 In Part 1: accuracy is a trap. In Part 2: failure modes ARE your fine-tuning dataset. Probe the public model → collect data on exactly what it breaks on → fine-tune → repeat. That's the loop most CV teams skip. Dr. Satya Mallick 👇 #ComputerVision #AI #YOLO #FineTuning

English

581

Satya Mallick@LearnOpenCV·4 May

Accuracy is table stakes. Failure modes decide whether your CV model survives production. Same benchmark scores. Opposite real-world performance. Dr. Satya Mallick on what to audit before you ship 👇 #ComputerVision #MachineLearning

English

584

Satya Mallick@LearnOpenCV·1 May

Karpathy's clearest take of the year: The AI debate isn't a debate. It's two groups looking at different parts of the curve. → Consumer AI still fumbles simple tasks → Frontier coding agents collapse days of work into hours Both are real. The progress is just uneven. 🧵 Part 2 ↓ 1/ Why is the gap so wide? Karpathy says: capability follows capital. The most valuable AI use cases sit in B2B technical work. Save an engineer 2 days → that's worth real money. So that's where optimization goes. 2/ Which is why coders feel "AI shock" the hardest. They're using state-of-the-art models with terminal + repo + test access — watching them solve problems that used to take a sprint. The steepest part of the curve is hitting them first. 3/ The mistake on both sides: ❌ Don't judge all AI by weak public demos. ❌ Don't assume strong technical demos = equal progress everywhere. The truth is messier. AI is clumsy in some places and astonishing in others — at the same time. That gap is the whole story. #AI #Karpathy #LLM #AIShock

English

444

Satya Mallick@LearnOpenCV·1 May

YOLOE = real-time object detection with NO retraining. Type "delivery driver in a red jacket" → it finds them. Zero-shot. Open vocabulary. YOLO speed. The closed-world era of computer vision is over. 🧵👇 🔗 vist.ly/42jd3 #YOLOE #ComputerVision #AI #DeepLearning #YOLO Optional thread continuation (if you want to expand): 1/ Standard YOLO models are locked into a fixed list of categories (e.g. COCO's 80 classes). Want to detect something new? Weeks of labeling + retraining. 2/ YOLOE solves this with RepRTA — Re-parameterizable Region-Text Alignment. It maps language directly onto pixels, with zero inference overhead. 3/ The speed problem is solved by SAVPE — Semantic-Activated Visual Prompt Encoder. It bakes your prompt into the detection head, so you keep traditional YOLO real-time performance. 4/ Net result: change a text string → reprogram your detector. On a drone. A factory cam. A robot. No new dataset. No retraining. 5/ This is the foundation for embodied AI — robots that navigate messy rooms and find objects they've never seen before. Full breakdown 👉 vist.ly/42jd3

English

710

Satya Mallick@LearnOpenCV·30 Nis

What if image generators actually understand what they create? Vision Banana proves it. One model handles: → Object detection → Instance segmentation → Metric depth from a single photo → Surface normal estimation No specialist models. Beats SAM 3 on multiple segmentation benchmarks. Generative backbones were generalist vision learners all along. #VisionBanana #ComputerVision #AI

English

1.1K

Satya Mallick retweetledi

Jitendra MALIK@JitendraMalikCV·29 Nis

@jon_barron "World models" has a technical meaning - the transition model/dynamics model from Bellman/Kalman in the context of MDPs/ state space approach to control theory ~ 1960. I gave a talk on this history youtube.com/watch?v=9B4kka…

YouTube

English

293

58.3K

Satya Mallick@LearnOpenCV·29 Nis

YOLO26-Pose tracks 17 human keypoints in a single forward pass. Smallest variant: 1.8 ms on a T4 GPU. ⚡ → RLE for sharper localization → NMS-free inference (predictable latency) → MuSGD for stable training Full breakdown 👇 learnopencv.com/yolo26-pose-es… #ComputerVision #YOLO26 Optional thread version: 1/ YOLO26-Pose is here. It predicts the full human skeleton in a single forward pass — shoulders, elbows, wrists, hips, knees, ankles. 17 COCO keypoints, real-time. 2/ The smallest variant runs at ~1.8 ms on a T4. That's deployable. Fitness, sports analytics, gesture control, rehab, safety — all on the table. 3/ What's new architecturally: RLE → better keypoint localization NMS-free → predictable latency MuSGD (SGD + Muon hybrid) → more stable training 4/ We tested it on yoga, karate, dance, gym, parkour, multi-person. Full LearnOpenCV tutorial walks through architecture, code, and raw outputs: 🔗 vist.ly/428rz 5/ Want the deep dive on why NMS-free matters for edge deployment? Companion piece here: 🔗 vist.ly/428r2

English

612

Satya Mallick@LearnOpenCV·29 Nis

Karpathy's framing of the AI debate is the cleanest I've seen: Two groups. Same industry. Opposite conclusions. → Group 1: judged AI on free/old models. Saw the failures. Wrote it off. → Group 2: uses frontier models for hard technical work. Progress feels shocking. But here's the part most people miss — even frontier users underestimate where the gains are concentrated. It's not casual writing. Not general search. Not everyday advice. It's coding. Math. Research. Terminal work. Anywhere a system can cleanly verify: did this work, yes or no? Clean signals → RL eats the domain. The capability frontier and public perception are diverging fastest exactly where the economic value is. #AI #Karpathy #LLM

English

395

Satya Mallick retweetledi

Nassim Nicholas Taleb@nntaleb·24 Nis

Those who treat humans as machines are also treating machines as humans.

English

160

1.2K

6.1K

186.9K

Satya Mallick@LearnOpenCV·23 Nis

Resources: Paper Link: arxiv.org/pdf/2604.20329… Interested in Computer Vision and AI consulting and product development services? Email us at contact@bigvision.ai or visit us at bigvision.ai

English

385

Satya Mallick@LearnOpenCV·23 Nis

Vision Banana: Rethinking How AI Models See and Generalize In this episode of Artificial Intelligence: Papers and Concepts, we explore Vision Banana, a concept that challenges how vision models learn and generalize from visual data. Instead of focusing purely on performance metrics, Vision Banana highlights how models can latch onto shortcuts and fail to truly understand the underlying structure of images. We break down why modern vision systems can misinterpret simple variations, how dataset biases influence model behavior, and what this reveals about the gap between recognition and real understanding. If you’re interested in computer vision, model robustness, or the limitations of current AI systems, this episode explains why Vision Banana offers an important perspective on building more reliable and generalizable visual intelligence.

English

622

Satya Mallick@LearnOpenCV·23 Nis

The biggest AI model is not always the best solution, especially for real world problems that are narrow and specific. Small, purpose-built models can run faster, cost less, and be deployed directly on devices, making them far more practical. The future of AI is about using the right model for the right job, not just the largest one.

English

608

Satya Mallick@LearnOpenCV·22 Nis

Position Encoding: How Transformers Understand Order in Data In this episode of Artificial Intelligence: Papers and Concepts, we explore Position Encoding, a fundamental concept that enables transformer models to understand the order of information. Since transformers process data in parallel rather than sequentially, position encoding provides the missing sense of sequence - helping models distinguish between “what came first” and “what comes next.” We break down why order matters in language and sequence-based tasks, how different encoding techniques inject positional information into models, and what this means for performance in applications like text generation, translation, and beyond. If you’re interested in transformer architecture, sequence modeling, or the building blocks behind modern AI systems, this episode explains why position encoding is essential for making sense of structured data.

English

821

Keşfet

@jon_barron @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine