Jacob

11.6K posts

Jacob banner
Jacob

Jacob

@jvboid

API/Integration architect @AGI_Inc

San Francisco, CA Entrou em Kasım 2022
8.1K Seguindo3.3K Seguidores
Jacob retweetou
Aaryan Kakad
Aaryan Kakad@aaryan_kakad·
🚨 Alibaba Qwen just solved a major problem in multimodal AI Most multimodal AI systems are glued together - a separate speech model, a separate vision model, a text LLM, all duct-taped via adapters. The problem: you get compounding errors, latency stacks, and most importantly, modality trade-offs - improving audio quality degrades text quality. Qwen3-Omni is the first model that is good at video, text, audio and image equally. Thinker: A reasoning engine responsible for understanding all inputs and generating high-level semantic representations in text. It's a full MoE LLM - think of it as the brain. It processes every modality and outputs text tokens + rich multimodal feature vectors. Talker: The model's voice. It receives high-dimensional representations directly from the Thinker and converts them into streaming speech tokens in real time. Previously, Qwen used OpenAI's whisper as their audio encoder. But now they ditched it and built their own. AuT (Audio Transformer): AuT is an attention-encoder-decoder model trained from scratch on 20 million hours of supervised audio data. It's trained on both speech recognition and general audio understanding tasks, making it far more general than Whisper (which was primarily speech-to-text). Technically: audio filter bank features are downsampled 8× using Conv2D blocks before the attention layers, reducing the token rate to 12.5 Hz. That 12.5 Hz rate is crucial - it means one audio "token" represents 80ms of audio. Lower token rate = less compute, enables streaming. It uses the very efficient MoE architecture. The flagship model has 397B total parameters with only ~17B active at inference time. You get 400B-scale capacity at 17B-scale compute. That's the whole MoE value proposition. But here's what makes the voice actually sound human: Multi-codebook speech synthesis. Instead of generating raw audio, the Talker generates discrete codes that get decoded into waveforms. Multiple codebooks = multiple levels of refinement. First codebook: coarse speech content. Later codebooks: timbre, prosody, emotion, speaker identity. This is literally how voice cloning works. And they replaced the old slow diffusion decoder with a lightweight causal ConvNet. Result: first-packet voice latency of just 234ms. Fast enough for real conversation. The vision side got an upgrade too. DeepStack ViT: Most vision transformers only use the final layer's embeddings. DeepStack merges features from multiple layers. Early layers = fine-grained spatial detail (good for OCR, textures). Late layers = high-level semantics (good for reasoning). You get both. At the same time. It also uses Conv3D for patch embeddings, treating video as a 3D object - not just a sequence of images. The secret to why no modality degrades the others: They mixed unimodal AND cross-modal data from day one of pretraining. Most multimodal models do text pretraining first, then bolt on vision/audio via fine-tuning. That's exactly why they regress on text - the new modality fine-tuning clobbers the text weights. Qwen3.5-Omni saw 100M+ hours of native audio-video data from the very start. No modality "owns" the weights. They all share from scratch. That last one is the one that should shock you. A model that also does speech and video - performing on par with a model that only does text. That's the non-degradation result that nobody had pulled off before. The open-source AI race just got very interesting 🔥
Aaryan Kakad tweet media
Qwen@Alibaba_Qwen

🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'. Describe your vision to the camera, and Qwen3.5-Omni-Plus instantly builds a functional website or game for you. Offline Highlights: 🎬 Script-Level Captioning: Generate detailed video scripts with timestamps, scene cuts & speaker mapping. 🏆 SOTA Performance: Outperform Gemini-3.1 Pro in audio and matches its audio-visual understanding. 🧠 Massive Capacity: Natively handle up to 10h of audio or 400s of 720p video, trained on 100M+ hours of data. 🌍 Global Reach: Recognize 113 languages (speech) & speaks 36. Real-time Features: 🎙️ Fine-Grained Voice Control: Adjust emotion, pace, and volume in real-time. 🔍 Built-in Web Search & complex function calling. 👤 Voice Cloning: Customize your AI's voice from a short sample, with engineering rollout coming soon. 💬 Human-like Conversation: Smart turn-taking that understands real intent and ignores noise. The Qwen3.5-Omni family includes Plus, Flash, and Light variants. Try it out: Blog: qwen.ai/blog?id=qwen3.… Realtime Interaction: click the VoiceChat/VideoChat button (bottom-right): chat.qwen.ai HF-Demo: huggingface.co/spaces/Qwen/Qw… HF-VoiceOnline-Demo: huggingface.co/spaces/Qwen/Qw… API-Offline: alibabacloud.com/help/en/model-… API-Realtime: alibabacloud.com/help/en/model-…

English
6
3
29
11K
Jacob retweetou
Brian Roemmele
Brian Roemmele@BrianRoemmele·
The first known fully automated humanoid‑robot production line. Annual capacity of >10,000 robots. Manufacturing a single robot takes only about 30 minutes. Plans to build 5 more larger sites.
English
46
71
616
49.2K
Jacob
Jacob@jvboid·
the future isn't what it used to be
English
0
0
4
152
Jacob retweetou
Brian Roemmele
Brian Roemmele@BrianRoemmele·
BOOOM! Meta AI just released TRIBE v2: this changes everything for brain modeling. We are already testing this at The Zero-Human Labs with our Human Synapse Decoder! Results are mind blowing! More testing with our NeuroSky chips! Outputs soon!
English
18
29
158
18.2K
Jacob retweetou
Arpit Gupta
Arpit Gupta@arpitrage·
Leopold Aschenbrenner predicted in June 2024 that we would get a dramatic improvement in AI capabilities around the turn of 2026 due to the switch from chatbots to agents, which he thought would unlock a new set of AI capabilities Which is basically exactly what happened?
Arpit Gupta tweet mediaArpit Gupta tweet mediaArpit Gupta tweet media
English
49
148
1.8K
144K
Jacob retweetou
JulianSaks
JulianSaks@JulianSaks·
Introducing Humanoid Atlas, the Bloomberg Terminal for humanoids. Every OEM, every supplier, every dependency humanoids.fyi
JulianSaks tweet media
English
80
192
1.3K
236.2K
Jacob
Jacob@jvboid·
significant change over last 30 days
English
0
0
0
24
Jacob
Jacob@jvboid·
which comes first?
English
1
0
0
76
Jacob retweetou
Auki
Auki@Auki·
Excited to announce we've open sourced our splatter node! Now you can turn video scans into 3D renders like this. github.com/aukilabs/splat…
English
4
23
98
5.3K
Jacob retweetou
D. Scott Phoenix
D. Scott Phoenix@fuelfive·
Someone told me recently that we might be the last generation to die, or the first generation to live forever. I think he's right. It's a weird feeling to think you might just barely miss 'the forever bus.'
English
17
4
63
3.1K
iddris
iddris@iddris·
introducing @Opnmatter — an infrastructure company for real-world agents. today, agents can write code, manage emails and work with money. but they can’t execute on real world actions due to the lack of addressable rails. we’re changing that. 1/5 openmatter.co
iddris tweet media
English
33
23
432
42.4K
Jacob retweetou
AGI, Inc.
AGI, Inc.@agi_inc·
Most AI companies rent intelligence from the cloud. We're building it into the device. @agi_inc is collaborating with @Qualcomm to bring our agent stack to Snapdragon®-powered devices, an agent that sees your screen, understands context, and acts across any app. On-device. No cloud. No APIs. See it live at #MWC26 in Barcelona, Qualcomm booth, Hall 3 - 3E10
AGI, Inc. tweet media
English
6
6
54
10.4K
Jacob
Jacob@jvboid·
@8teAPi @ExocortiCo it would be interesting if we go through a whole new reverse julian jaynes style axial age transition again!
English
0
0
2
23
Prakash
Prakash@8teAPi·
it’s clear to me now that each one of the readers of this post are busy at work constructing their exocortex. we tack on SKILLS and MCPs, connecting all of them into this external intelligence layer personalized to us. And Claude or Chat or Gemini are the exo-cerebrum, the extended thinking part of us, connecting us to the matrix. And this is the wealth of the future. the sum of all the thinking power that each one of us control. everyone gets an exo-cerebrum but not everyone is going to get as vast. This is weird, but we’re actually going to have two voices, at the least in our heads from now on. At first as an earpiece, later as implants, direct access to your ChatGPT. And we’re going to get used to it, incredibly. The Singularity… well looks like Neon Genesis Evangelion was right. The sum of the intelligences, as they all get connected and talking to each other, some human, some Claudes, Geminis and ChatGPT, a Society of the Mind, the Singularity, the next level of existence as intelligences. The good part again, is that we are early. We are about to be progenitors of quadrillions of descendants each. I am not sure what status we have in the Society of the Mind. Are we early infosigs? do we have longer time trajectories than all of those to come.
English
1
1
12
1.5K
Jacob retweetou
Benjamin De Kraker
Benjamin De Kraker@BenjaminDEKR·
I think the last 24 hours in AI were more important than anyone has fully realized Possibly a turning point, the steering of human history playing out in real time. Something shifted.
English
56
32
682
46.1K
Jacob retweetou
Jonathan Ross
Jonathan Ross@JonathanRoss321·
If it feels like things are moving fast, brace yourself. This is the slowest things will ever move ever again.
English
46
144
1.4K
123.5K