Sameer goel
118 posts

Sameer goel
@sameer_goel
Computer Vision Engineer Fixing broken CV deployments (ONNX, opsets, latency) PyTorch → real-time inference systems Production-first
India Katılım Şubat 2026
46 Takip Edilen38 Takipçiler

“Please” and “thank you” just tokenize into a handful of extra subword units literally a few more integers in the sequence. The transformer still runs full attention over the entire context, so the marginal compute from politeness is ~O(n²) noise compared to long prompts, chain-of-thought, or giant context windows.
If you’re optimizing GPU usage, those words are a rounding error the real cost is in sequence length and repeated tokens, not courtesy.
English

@sflorimm Human–AI integration tighter loops, possibly through interfaces like brain-computer links, where the boundary between using AI and thinking with it blurs.
English

@chribjel Great, so now we’re optimizing LLM costs by inventing employees again. Full circle innovation.
English

If quantum breaks Bitcoin, it does not instantly nuke the entire internet.
•HTTPS doesn’t rely on a single primitive
•Systems can rotate keys, switch algorithms, and deploy post-quantum crypto
•Most infrastructure is upgradeable; it’s not a one-shot collapse
Also:
•Bitcoin specifically relies on exposed public keys → easier target surface
•Many banking systems don’t expose keys the same way
•Military / critical systems already plan for crypto migration
So no, it’s not:
“quantum = instant apocalypse”
It’s:
gradual break → patch → migrate → repeat
English

What’s even more interesting is that this turns optimization into a runtime property, not a training artifact the model isn’t getting smarter, the system design is.
Given a fixed set of weights, it’s effectively performing online meta-optimization over its own execution graph tuning prompts, tool policies, memory structures, and control flow in a closed loop.
At that point, the real question isn’t performance it’s stability:
•Does the harness converge or drift?
•Can it overfit to its own eval loop?
•What prevents degenerate self-reinforcing behaviors?
Because once agents can rewrite their own scaffolding,
you’re no longer optimizing outputs you’re optimizing the process that generates them.
English

The first AI that improves without retraining.
(it rewrites its own agent harness)
Every developer I know has one thing in common: they obsess over their setup. The terminal, the scripts, the shortcuts. They don't just write code. They constantly refine how they work.
The code gets better because the environment gets better.
MiniMax just released M2.7, and I think the most interesting thing about it isn't a benchmark number. It's the fact that M2.7 improves its own agent harness. Autonomously.
Let's break this down:
When you run an AI agent today, it operates inside a "harness." Think of it as the agent's operating environment: the skills it can invoke, the tools it can call, its memory, and the rules it follows. Normally, a human engineer builds this harness, and the agent operates within it. The harness stays fixed.
M2.7 treats its harness as something it can rewrite.
Here's what the loop looks like:
- The agent runs a task and analyzes where things went wrong
- It plans changes to its own scaffold: skills, MCPs, memory
- It applies those changes, runs evaluations against a benchmark
- It compares the results and decides whether to keep or revert
- It writes self-criticism into memory so the next round starts smarter
Then it loops back and does it again. And again.
Think of it like a developer who finishes a project, writes a retrospective, restructures their workflow based on what they learned, and shows up the next day with a better setup. Except the developer here is the model itself.
MiniMax ran this self-optimization loop for over 100 rounds internally. Along the way, the model discovered things on its own: it systematically searched for optimal sampling parameters (temperature, penalties), wrote workflow-specific guidelines for itself (like automatically checking for the same bug pattern in other files after a fix), and even added loop detection to avoid getting stuck.
No human had to tell it to do any of this.
They also tested this in a more controlled setting. They had M2.7 compete in 22 ML competitions from OpenAI's MLE Bench Lite. Each trial ran for 24 hours, fully autonomous. After each iteration, the agent wrote a memory file and performed self-criticism, feeding those insights into the next round.
With every round, the ML models it trained achieved higher medal rates. The best run earned 9 gold medals.
I've summarized the self-evolving architecture in the graphic below.
The reason I find this compelling: this isn't about making a smarter model. It's about making a model that makes itself smarter. The weights never change. What changes is the system around it: better skills, better memory, better workflow rules. And that distinction matters because it means the improvement loop can run continuously without any retraining.
We're entering a phase where agents don't just follow instructions. They redesign their own playbook.
If you want to learn more, I've shared a link to their official blog post in the next tweet.
GIF
English

2 questions:
What is the latency improvement from removing NMS compared to total inference time on GPU?
When an image is passed through a traditional YOLO model, at what stage is NMS applied, and how does it process multiple overlapping bounding box predictions to produce the final detections?
English

Real-time object detection will never be the same.
Traditional YOLO needs NMS to remove duplicate boxes; it's slow and inconsistent.
YOLO26 skips it entirely: single-pass predictions, faster inference and up to 300 detections per image.
Download model: platform.ultralytics.com/ultralytics/yo…
Akshay 🚀@akshay_pachaar
English

@tom_doerr How would you handle noisy raster PDFs (scanned images) vs vector PDFs in the same pipeline?
English


@GergelyOrosz What’s interesting is not the leak itself, but that a from-scratch Python reimplementation is reportedly matching or surpassing the original from Anthropic.
English

The interesting part isn’t that Meta is optimizing harnesses — it’s that they’re trying to solve the credit assignment mess we’ve all been ignoring.
Because in practice:
•a change in prompt/tooling today
•affects eval scores hours or days later
•across multiple tasks and traces
and nobody really knows which change actually mattered.
Meta-Harness feels like the inevitable direction:
treat the entire harness (prompts, tools, routing, evals) as one optimization surface, instead of patching pieces blindly.
If this works, “prompt engineering” stops being artisanal tweaking and starts looking more like gradient descent over workflows.
English

Meta just dropped the Efficient Universal Perception Encoder on Hugging Face — curious how it plays with VLM-style fine-tuning.
Like:
•does it take LoRAs cleanly across both vision + language bridges
•how well multi-teacher distilled features adapt under low-rank updates
•whether you can stack lightweight adapters instead of retraining heads
English

The inevitable part isn’t the attack — it’s the speed.
AI is already writing code, reviewing PRs, publishing packages so of course it’s also accelerating supply chain attacks. The window between malicious publish and production impact is collapsing to minutes.
Which basically forces a new equilibrium:
AI attackers vs AI defenders, both operating faster than humans can even context-switch.
The takeaway isn’t nice catch it’s that manual review as a security layer is quietly becoming irrelevant.
English

Devin Review caught the axios supply chain attack for multiple Cognition customers before the attack was publicly known.
These attacks will be 10x more frequent in the age of AI; it is critical that repo maintainers start using AI for defense as well.
(showing one example below where Devin Review caught the attack within an hour of its release - text minorly edited for anonymization)

English

RF-DETR is the best open-source detector for aerial and drone footage…
but now I’m just wondering how it handles:
•a half-visible bike under a tree shadow at 200m altitude
•3 pixels pretending to be a human
•cars that are basically just vibes + motion blur
•and that one guy wearing camouflage who is literally the background
English

RF-DETR is the best open-source detector for aerial and drone footage
link: github.com/roboflow/rf-de…
English

Claude rolling out “enterprise-grade security”…
Meanwhile:
attack vector: view-source
impact: full source disclosure
severity: politely labeled “oops”
Somewhere in a security report:
Threat model included prompt injection, jailbreaks, adversarial attacks…
Did not include “someone opens the .map file”
Red team:
“Did you exfiltrate weights?”
“No.”
“Find a jailbreak?”
“No.”
“…so what did you do?”
“I clicked a CDN link.”
Zero-day exploits ❌
Zero-click curiosity exploit ✅
English

Claude code source code has been leaked via a map file in their npm registry!
Code: …a8527898604c1bbb12468b1581d95e.r2.dev/src.zip

English

Not that surprising if you look under the hood.
MoE ≠ full model capacity per token.
You’re routing through a few experts, not the entire network.
So in tasks like UI recreation:
• layout consistency
• spatial reasoning
• deterministic structure
a smaller dense model can actually be more stable.
Bigger helps with breadth.
Not always with precision.
English

@heynavtoor 397B parameters… running on a MacBook.
At this point the laptop isn’t “running a model” — it’s just politely pretending not to be a data center.
Cloud GPU startups watching this like:
“yeah… but can your SSD scale horizontally?”
English

🚨 397 billion parameters. On a MacBook. No cloud. No GPU cluster. No data center. A laptop.
Someone ran one of the largest AI models on Earth on a machine you can buy at the Apple Store.
It's called flash-moe.
A pure C and Metal inference engine that runs Qwen3.5-397B on a MacBook Pro with 48GB RAM. At 4.4 tokens per second. With tool calling.
No Python. No PyTorch. No frameworks. Just raw C and hand-tuned Metal shaders.
Here's why this should not be possible:
→ The model is 209GB. The laptop has 48GB of RAM.
→ It streams the entire model from the SSD in real time
→ Only loads the 4 experts needed per token out of 512
→ Uses just 5.5GB of actual memory during inference
→ Production-quality output with full tool calling
→ 58 experiments. Hand-optimized Metal compute kernels.
→ The entire engine is ~7,000 lines of C and ~1,200 lines of Metal shaders
Here's the wildest part:
One person built this. A VP of AI at CVS Health. Not Google. Not OpenAI. A healthcare company executive. Side project. Used Claude Code as his coding partner. Built the entire engine in 24 hours.
Running a 397B model on cloud GPUs costs hundreds of dollars per hour. Companies spend millions per year on inference infrastructure for models this size.
This runs on a $3,499 laptop. Offline. Private. No API key. No monthly bill. Forever.
Trending on GitHub. 332 points on Hacker News.
100% Open Source.

English

@hxxwhite No they aren’t.
Manual QA doesn’t die it gets pushed up the stack.
Vision agents can click buttons.
They can’t:
• understand intent behind features
• catch subtle UX regressions
• question why something exists
What dies is repetitive QA.
What survives is judgment.
English
















