Frank Lin

1.8K posts

Frank Lin

@developerlin

Create Limitless Value with AI | Exploring AI’s Future

Katılım Ekim 2007

713 Takip Edilen87 Takipçiler

Frank Lin retweetledi

Songyou Peng@songyoupeng·20h

Yay, finally! Introducing Vision Banana🍌 from @GoogleDeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: vision-banana.github.io (1/5)

English

257

1.9K

180.6K

Frank Lin@developerlin·7h

@fMinZhou coool

151

Min Zhou@fMinZhou·1d

GPT Image 2 is insanely good...I generated a 360° equirectangular panorama in Happycapy with just a skill + prompt. Step 1: Select the generate-image skill Step 2: Enter a prompt like: “Use a frontend 360 viewer to display an equirectangular image of […] using the GPT-Image-2 model.” Wanna see how you all get creative with this

English

397

3.8K

270.4K

Frank Lin@developerlin·15 Nis

very cool

Andrew Ng@AndrewYNg

I'm excited about voice as a UI layer for existing visual applications — where speech and screen update together. This goes well beyond voice-only use cases like call center automation. The barrier has been a hard technical tradeoff: low-latency voice models lack reliability, while agentic pipelines (speech-to-text → LLM → text-to-speech) are intelligent but too slow for conversation. Ashwyn Sharma and team at Vocal Bridge (an AI Fund portfolio company) address this with a dual-agent architecture: a foreground agent for real-time conversation, a background agent for reasoning, guardrails, and tool calls. I used Vocal Bridge to add voice to a math-quiz app I'd built for my daughter; this took less than an hour with Claude Code. She speaks her answers, the app responds verbally and updates the questions and animations on screen. Only a tiny fraction of developers have ever built a voice app. If you'd like to try building one, check out Vocal Bridge for free: vocalbridgeai.com

English

Frank Lin retweetledi

Carlos Miguel Patiño@cmpatino_·12 Nis

On-policy distillation with 100B+ teacher models is now possible in TRL, and up to 40x faster than naive implementations! We distilled Qwen3-235B into a 4B student and gained 39+ points on AIME25. Two engineering optimizations made it work. Blogpost: huggingface.co/spaces/Hugging…

English

354

27.2K

Frank Lin retweetledi

Guri Singh@heygurisingh·11 Nis

🚨BREAKING: An open-source agentic video production system just dropped. 11 pipelines, 49 tools, and a full product ad produced for $0.69 total. It's called OpenMontage. And it's not a text-to-video tool. It's a full production orchestration system where your AI coding assistant (Claude Code, Cursor, Copilot, Windsurf) becomes the director. Describe what you want in plain language. The agent researches, scripts, generates assets, edits, and renders the final video. Here's what the pipeline actually does: → Live web research first: 15-25+ searches across YouTube, Reddit, news sites before writing a single word of script → 12 video generation providers: Kling, Runway Gen-4, Google Veo 3, MiniMax, plus local GPU options (WAN 2.1, Hunyuan, CogVideo) → 8 image generation providers: FLUX, Google Imagen 4, DALL-E 3, Stable Diffusion locally → 4 TTS providers: ElevenLabs, Google (700+ voices), OpenAI, and Piper offline for free → WhisperX word-level subtitles burned in automatically → Remotion for React-based animated composition with spring physics, transitions, TikTok-style captions → Budget governance: cost estimate before execution, per-action approval above $0.50, hard cap at $10 Here's the wildest part: One product ad. 4 AI-generated images, TTS narration, royalty-free music, word-level subtitles, Remotion data visualizations. Total cost: $0.69. Zero manual asset work. Works with zero API keys too. Piper narrates locally, Pexels/Pixabay provide free stock, Remotion animates everything. No spend required to start. 100% Open Source. AGPL v3 License. (Link in the comments)

English

175

1.1K

110.6K

Frank Lin@developerlin·8 Nis

@PawelHuryn Does it support codex sessions?

English

Paweł Huryn@PawelHuryn·7 Nis

Claude Code doesn't show you how many tokens you're using for subscriptions. No breakdown by model. No breakdown by project. Just a progress bar that says "63% used." So I built a local dashboard that reads the files Claude Code already writes to your machine. Turns out every session, every turn, every token is logged to ~/.claude/projects/ in JSONL files. Input tokens, output tokens, cache reads, cache creation, model name, timestamp. It's all there. You just can't see it. My numbers over the last 30 days: 440 sessions. 18,000 turns. $1,588 in API-equivalent costs. On one day, the cache spiked to 700M tokens - visible cache bug, two days in a row. The dashboard scans those local files, builds a SQLite database, and serves charts on localhost:8080. Filter by model (Opus, Sonnet, Haiku). Filter by time range (7d, 30d, 90d, all time). Cost estimates based on current Anthropic API pricing. Works retroactively. First run processes your entire Claude Code history. Install: git clone github.com/phuryn/claude-… cd claude-usage python3 cli.py dashboard Windows: use python instead of python3. Zero dependencies. Python standard library only. Open source, MIT. Star it. Fork it. Make it your own.

English

128

219

2.3K

292.1K

Frank Lin retweetledi

Simplifying AI@simplifyinAI·6 Nis

🚨 Someone just built a fully open-source mocap system that works with any camera. It's called FreeMoCap, a markerless 3D tracking system that runs on ordinary webcams. It turns multiple camera feeds into research-grade skeletal data automatically. 100% Open Source.

English

736

6.3K

355.3K

Frank Lin retweetledi

Nav Toor@heynavtoor·5 Nis

🚨BREAKING: Researchers built an AI that designs better AI than humans can. It discovered 105 new architectures that beat human-designed models. Nobody guided it. It taught itself. The paper is called "ASI-Evolve: AI Accelerates AI." Published this week by researchers at Shanghai Jiao Tong University. Fully open-sourced. And what it demonstrates should stop every AI researcher cold. They built a system that runs the entire AI research loop on its own. It reads scientific papers. It forms hypotheses. It designs experiments. It runs them. It analyzes the results. Then it uses what it learned to design better experiments. Over and over. Without human intervention. They pointed it at neural architecture design first. Over 1,773 rounds of autonomous exploration, the system generated 1,350 candidate architectures. 105 of them beat the best human-designed model. The top architecture surpassed DeltaNet by +0.97 points. That is nearly 3 times the gain of the most recent human-designed state-of-the-art improvement. Humans spent years to get +0.34 points. The AI got +0.97 on its own. Then they pointed it at training data. The AI designed its own data curation strategies and improved average benchmark performance by +3.96 points. On MMLU, the most widely used knowledge benchmark, the improvement exceeded 18 points. Then they pointed it at learning algorithms. The AI invented novel reinforcement learning algorithms that outperformed the leading human-designed method GRPO by up to +12.5 points on competition math. Three pillars of AI development. Data. Architecture. Algorithms. The AI improved all three by itself. Then they tested whether what the AI built actually works in the real world. They applied an AI-discovered architecture to drug-target interaction prediction. It achieved a +6.94 point improvement in scenarios involving completely unseen drugs. The AI designed something that works better than human experts in biomedicine. This is the first system to demonstrate AI-driven discovery across all three foundational components of AI development in a single framework. The recursive loop is now closed. AI is building AI. And it is already better at it than we are.

English

212

97.3K

Frank Lin@developerlin·6 Nis

@MaziyarPanahi This idea can also be used to build auto-annotate agent

English

177

Maziyar PANAHI@MaziyarPanahi·4 Nis

Gemma 4 watches raw video. Understands the scene. Then prompts SAM 3 to segment and RF-DETR to track. One AI directing two others. Fighter jets. Crowds. Aerial defense footage. All three models running locally on a MacBook. No cloud. What scene should I point this at next?

English

102

182

364K

Frank Lin@developerlin·6 Nis

This concept is highly valuable. Currently based on workspaces, it could also function as a skill and is capable of self-evolution. Furthermore, it can be utilized as an agent's memory.

Andrej Karpathy@karpathy

Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an "idea file". The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the other person's agent customizes & builds it for your specific needs. So here's the idea in a gist format: gist.github.com/karpathy/442a6… You can give this to your agent and it can build you your own LLM wiki and guide you on how to use it etc. It's intentionally kept a little bit abstract/vague because there are so many directions to take this in. And ofc, people can adjust the idea or contribute their own in the Discussion which is cool.

English

Frank Lin retweetledi

Sida Peng@pengsida·3 Nis

The training code for InfiniDepth is now open-source. Feel free to use our framework to train a monocular depth estimation model as well as a depth sensor augmentation model on your own data. github.com/zju3dv/InfiniD…

Sida Peng@pengsida

Excited to share our work InfiniDepth (CVPR 2026) — casting monocular depth estimation as a neural implicit field, which enables: 🔍 Arbitrary-Resolution 📐 Accurate Metric Depth 📷 Large-View Novel View Synthesis Feel free to try our code: github.com/zju3dv/InfiniD…

English

201

15.5K

Frank Lin@developerlin·4 Nis

Interesting

Bo Wang@BoWang87

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass @1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English

Frank Lin@developerlin·4 Nis

This is really cool

Massimiliano Viola@massiviola01

EUPE: Efficient Universal Perception Encoder👋 Looking for the most powerful small image models today? Guess what: researchers at @AIatMeta cooked again!🍳 This time around, not some large vision encoder. Instead, a set of lightweight and efficient ones, both ViTs and ConvNeXts, all under 100M. The smallest is a ViT-Tiny at just 6M params! But hear me out: this is the COOLEST thing ever...

English

Frank Lin retweetledi

Google Gemma@googlegemma·2 Nis

Meet Gemma 4! Purpose-built for advanced reasoning and agentic workflows on the hardware you own, and released under an Apache 2.0 license. We listened to invaluable community feedback in developing these models. Here is what makes Gemma 4 our most capable open models yet: 👇

English

166

841

7.2K

621.4K

Frank Lin retweetledi

Yasser Dahou@dahou_yasser·1 Nis

We are releasing Falcon Perception, an open-vocabulary referring expression segmentation model. Along with it, a 0.3B OCR model that is on par with 3-10x larger competitors. Current systems solve this with complex pipelines (separate encoders, late fusion, matching algorithms). We developed a novel simpler "bitter" approach: one early-fusion Transformer (image + text from first layer) with a shared parameter space, and let scale + training signal do the work. Please check our work ! 📄 Paper: arxiv.org/pdf/2603.27365 💻 Code: github.com/tiiuae/falcon-… 🎮 Playground: vision.falcon.aidrc.tii.ae 🤗 Blogpost: huggingface.co/blog/tiiuae/fa…

English

165

990

117.2K

Frank Lin retweetledi

Jim Fan@DrJimFan·1 Nis

The power of the Claw, in the palm of a robot hand. Agentic robotics is here! Today, we open-source CaP-X: vibe agents, alive in the physical world. They incarnate as robot arms and humanoids with a rich set of perception APIs, actuation APIs, and auto synthesize skill libraries as they go. CaP-X is a strict superset of our old stack, because policies like VLAs are “just” API calls as well. It solves many tasks zero-shot that a learned policy would struggle with. And we are doing much more than vibing. CaP-X is our most systematic, scientific study on agentic robotics so far: - We build a comprehensive agentic toolkit: perception (SAM3 segmentation, Molmo pointing, depth, point cloud), control (IK solvers, grasp planner, navigation), and visualization (EEF, mask overlays) that work across different robots. - CaP-Gym: LLM’s first Physical Exam! 187 manipulation tasks across RoboSuite, LIBERO-PRO, and BEHAVIOR. Tabletop, bimanual, mobile manipulation. Sim and real. Can’t wait to see the gradients flow from CaP-Gym to the next wave of frontier LLM releases. - CaP-Bench: we benchmark 12 frontier LLMs/VLMs (Gemini, GPT, Opus, Qwen, DeepSeek, Kimi, and more) across 8 evaluation tiers. We systematically vary API abstraction level, agentic harness, and visual grounding methods. Lots of insights in our paper. - CaP-Agent0: a training-free agentic harness that matches or exceeds human expert code on 4 out of 7 tasks without task-specific tuning. - CaP-RL: if you get a gym, you get RL ;). A 7B OSS model jumps from 20% to 72% success after only 50 training iterations. The synthesized programs transfer to real robots with minimal sim-to-real gap. 3 years ago, our team created Voyager, one of the earliest agentic AI that plays and learns in Minecraft continuously. Its key ideas — skill libraries, self-reflection loops, and in-context planning — have since influenced many modern agentic designs. Today, the agent graduates from Minecraft and gets a real job. It’s April Fool’s, but this Claw is getting its hands dirty for real! Link in thread:

English

114

692

65.9K

Frank Lin retweetledi

Markets & Mayhem@Mayhem4Markets·31 Mar

TurboQuant is looking pretty solid. 🔥 > Original idea was to use it just for KV cache where context tokens are stored > Now it is expanding to be used with models > On Qwen 3.5-27B it shrinks the model down to 12.9B > 6X memory savings vs 16-bit precision > Stays accurate

English

155

1.6K

289.1K

Frank Lin retweetledi

Andrej Karpathy@karpathy·31 Mar

New supply chain attack this time for npm axios, the most popular HTTP client library with 300M weekly downloads. Scanning my system I found a use imported from googleworkspace/cli from a few days ago when I was experimenting with gmail/gcal cli. The installed version (luckily) resolved to an unaffected 1.13.5, but the project dependency is not pinned, meaning that if I did this earlier today the code would have resolved to latest and I'd be pwned. It's possible to personally defend against these to some extent with local settings e.g. release-age constraints, or containers or etc, but I think ultimately the defaults of package management projects (pip, npm etc) have to change so that a single infection (usually luckily fairly temporary in nature due to security scanning) does not spread through users at random and at scale via unpinned dependencies. More comprehensive article: stepsecurity.io/blog/axios-com…

Feross@feross

🚨 CRITICAL: Active supply chain attack on axios -- one of npm's most depended-on packages. The latest axios @1.14.1 now pulls in plain-crypto-js@4.2.1, a package that did not exist before today. This is a live compromise. This is textbook supply chain installer malware. axios has 100M+ weekly downloads. Every npm install pulling the latest version is potentially compromised right now. Socket AI analysis confirms this is malware. plain-crypto-js is an obfuscated dropper/loader that: • Deobfuscates embedded payloads and operational strings at runtime • Dynamically loads fs, os, and execSync to evade static analysis • Executes decoded shell commands • Stages and copies payload files into OS temp and Windows ProgramData directories • Deletes and renames artifacts post-execution to destroy forensic evidence If you use axios, pin your version immediately and audit your lockfiles. Do not upgrade.

English

558

1.1K

10.5K

1.5M

Frank Lin retweetledi

Utopic e/λ@UtopicDev·29 Mar

@DailyDoseOfDS_ github.com/google-researc…

QME

7.7K

Frank Lin retweetledi

Akshay 🚀@akshay_pachaar·29 Mar

Microsoft did it again! Speech AI models have a major limitation. They slice long recordings into tiny chunks, lose track of who's speaking, and forget all context halfway through. This is exactly what Microsoft's VibeVoice solves. It's an open-source family of frontier voice AI models for both speech recognition and speech generation. Here's what it can do: > VibeVoice-ASR processes up to 60 minutes of audio in a single pass. No chunking. It outputs structured transcriptions with who spoke, when they spoke, and what they said. > You can feed it custom hotwords like names, technical jargon, or domain-specific terms. The model uses them to significantly improve accuracy on specialized content. > VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers. Natural turn-taking, emotional expression, all in one pass. > VibeVoice-Realtime is a 0.5B streaming TTS model with ~300ms first-audio latency. Small enough to deploy practically anywhere. All of this is powered by continuous speech tokenizers running at just 7.5 Hz. This ultra-low frame rate preserves audio quality while making long sequences computationally feasible. I have shared the link to the GitHub repo in the replies!

English

519

44.9K

Keşfet

@GoogleDeepMind @fMinZhou @PawelHuryn @MaziyarPanahi @elonmusk @BarackObama @taylorswift13 @cristiano