kwindla

6K posts

kwindla banner
kwindla

kwindla

@kwindla

Infrastructure and developer tools for real-time voice, video, and AI. @trydaily // ᓚᘏᗢ // @pipecat_ai

San Francisco, CA Katılım Eylül 2008
3.9K Takip Edilen12.4K Takipçiler
Sabitlenmiş Tweet
kwindla
kwindla@kwindla·
NVIDIA Nemotron 3 Super launches today! We've been building voice agents with Super's pre-release checkpoints and running all our various tests and benchmarks. Nemotron 3 Super matches both GPT-5.4 and GPT-4.1 in tool calling and instruction following performance on our realtime conversation, long context, real-world benchmarks. GPT-4.1 is the most widely used LLM today for production voice agents. So an open model that performs as well as GPT-4.1 on hard, voice-specific benchmarks is a big deal. (Side note: we don't think a benchmark "tells the story" about a model's voice agent performance unless it tests model correctness across at least 20 human/agent conversation turns.) The Nemotron models are *fully* open: weights, data sets, training code, inference code. Nemotron 3 Super is 120B params, with a hybrid Mamba-Transformer MoE architecture for efficient inference. You can run it on NVIDIA data center hardware or on a DGX Spark mini-desktop machine. 1M token context. Blog post with full benchmarks, thinking budget notes, inference setup on @Modal, and where we think this goes next. 👇
kwindla tweet media
English
13
34
230
19.1K
kwindla
kwindla@kwindla·
@ShriramKMurthi He wrote/talked about it in the essays and interview book Doubling the Point!
English
0
0
1
275
Shriram Krishnamurthi (primary: Bluesky)
TIL that age 22, JM Coetzee, armed with degrees in english and math, moved to England to work as a *programmer*. Thank goodness that didn't last!
Shriram Krishnamurthi (primary: Bluesky) tweet media
English
3
1
24
3.6K
Yam Peleg
Yam Peleg@Yampeleg·
MiniMax-M2.7 woww
HT
10
4
277
24.2K
kwindla
kwindla@kwindla·
Come by and see @EvanGrenda at the AWS booth at GTC. @tavus video avatars, voice agents built with NVIDIA Nemotron models, and new realtime AI architecture patterns in @pipecat_ai!
kwindla tweet media
English
1
2
8
1.1K
kwindla
kwindla@kwindla·
@joelgascoigne Not exactly what you’re asking for, but a lot of people use this voice MCP server with Claude Code. Run Claude Code in tmux, start the MCP server when you want to do remote voice input. Connect via WebRTC from mobile browser. github.com/pipecat-ai/pip…
English
2
1
10
908
Joel Gascoigne
Joel Gascoigne@joelgascoigne·
I feel like there needs to be an iOS ssh client that has voice-first input. I'd love to keep my Claude Code sessions going via mobile just by talking. Does this exist yet?
English
22
1
20
7.2K
kwindla
kwindla@kwindla·
@altryne @ryancarson You covered way more stuff than I summarized. I just pulled out clips about some of the things that are most on my mind, these days. The fruit fly stuff is wild, but also way outside what I understand!
English
1
0
2
80
kwindla
kwindla@kwindla·
Sunday podcast catching up. ThursdAI this week on coding agents and open weights models: - @ryancarson used a billion tokens in 24 hours. Every single engineer I work with now lives in Claude Code and Codex. This is a complete change in how we do engineering, and it happened unbelievably quickly. - @WolframRvnwlf talked about something I've been thinking about a lot lately: using open weights LLMs today feels a lot like using Linux in the 90s. Nothing ever just works in a straightforward way; everything's fragile and futzy; capabilities are behind the best commercial models. But, equally clearly, these open weights models are going to be a big part of the future of computing. - @llm_wizard gives a great rundown of the new NVIDIA Nemotron 3 Super model (and the open weights, open data, open source training and inference code approach that is the broader Nemotron ecosystem). Claude made this clip sequence for me, with basically just one prompt and a couple of small refinement requests from me, using my `skill-caption-clip` skill that Claude wrote a few weeks ago.
Alex Volkov@altryne

What an absolute crazy show today, could not have imagined this 3 years ago. Autonomous researchers, fruit fly brains fully uploaded, chinese Openclaw obsession + interviews! > Nemotron 3 super with @NVIDIAAI 's @llm_wizard > Paperclip agent orchestration with @dotta > @slashlast30days with Matt > Symphony with @ryancarson Just a terrific terrific birthday show! Thank you everyone for the birthday wishes 🙏

English
5
5
22
4.2K
kwindla
kwindla@kwindla·
I'll be at NVIDIA GTC next week talking about voice agents, realtime AI for robotics, and conversational video agents. - Best practices for deploying enterprise voice agents at scale on AWS. - How to design for use cases like customer support, outbound phone calls, and multi-channel UIs that combine voice and text conversation. - Why all production voice agents are becoming "multi-agent systems". - What's coming next. Hang out with @EvanGrenda and me at the AWS booth on the show floor. I'm also hosting several invite-only events. Tell me in the comments what you're planning to do at GTC, and I'll message you!
Evan Grenda@EvanGrenda

Built a voice agent to to present itself and engage with customers at NVIDIA GTC The robots are selling for us

English
3
3
17
2.2K
kwindla
kwindla@kwindla·
@lina_colucci The future is multi-modal, multi-model, multi-platform ...
English
0
0
1
29
kwindla
kwindla@kwindla·
@hampsonw Changed that line in the README.md to fix my mistake. And added a note about the model.
English
0
0
0
12
kwindla
kwindla@kwindla·
Oh, you're right. So sorry. I must have gotten confused when I edited the README.md. It looks like the 27b row there is supposed to be non-thinking. And the 4s cutoff is obviously kind of arbitrary. The right thing to do, clearly, is figure out some low hanging fruit optimizations to get 27b down below that 4s number. :-) Thank you for figuring that out.
English
1
0
0
18
kwindla
kwindla@kwindla·
Hoisting this up to a top-level thread because I'd like advicea bout Qwen3.5 27B ... I'm still figuring 27b. I *want* to talk more about it, because it's clealy a good model in a bunch of ways. But it falls into a middle category that's not super useful for me. Maybe skill issue on my part. But: 1. So far I don't have a vLLM/SGLang configuration with a TTFT low enough for the conversational loop part of voice AI. With thinking disabled it's not good enough at tool calling. With thinking enabled, TTFT to first non-thinking token is >1,000ms. 2. It does not do well on the sub-agent tasks I'm most interested in, which are long, multi-turn, and include structured data inputs.
Sunny@sunnypause

@kwindla U should talk more about 27b thinking..

English
4
0
25
5K
kwindla
kwindla@kwindla·
@hampsonw I'm curious what turn times you saw when you ran the benchmark?
English
0
0
0
8
Will Hampson
Will Hampson@hampsonw·
@kwindla I'm serving it myself off 2x3090's I ran Qwen3.5-27-GPTQ-INT4
English
1
0
1
27
kwindla
kwindla@kwindla·
@ekryski 100% with you that a good harness is critical to making all models work for non-trivial tasks. And especially so for non-frontier (or small, or whatever you want to call them) models.
English
0
0
2
60
Eric Kryski
Eric Kryski@ekryski·
@kwindla Totally. This is why I have a ton of tests in my custom harness, use it every day for real stuff to see what breaks, and have a benchmark suite of all the stuff I’m trying so that I can see which models pass my own real world benchmarks. I bench while I sleep.
English
2
0
0
19
kwindla
kwindla@kwindla·
So I'm wondering if: 1. You had a lucky run with thinking on. 8% of the time, the model completes the the gb-benchmarks/port-to-port task. 2. I have some issue with that endpoint where inference reliability degrades over time. I didn't see evidence of that, I don't think. But it's possible. 3. I'm an idiot and did something else very wrong.
English
2
0
0
20
Will Hampson
Will Hampson@hampsonw·
@kwindla I did one run each thinking on and thinking off. Thinking off failed miserably.
English
1
0
0
18
kwindla
kwindla@kwindla·
@ekryski I have not spent any time with LFM. It's on my list. Curious if @charles_irl has thoughts about LFM. I like the gpt-oss models, too! But the Nemotron models seem strictly better now, to me. (Not a surprise, as they are newer.)
English
1
0
2
30
Eric Kryski
Eric Kryski@ekryski·
@kwindla LFM has potential but haven’t had a lot of success with it doing tool calling reliably either. GPT-OSS 20B has been my go to. It almost never fails me and if it does it’s a harness or model censorship issue.
English
2
0
0
23
kwindla
kwindla@kwindla·
One thing about open weights models is that getting stable and performant endpoints up and running is not easy! I'm never sure whether the results I'm seeing are fundamental to the model, or are implementation issues. I've run tests models via Open Router and seen some providers just have completely broken tool calling, for example, on models that were released long enough ago that that really should not be the case. It's like desktop linux! Is that sound card (chat template) going to work after I reboot?
English
1
0
1
41
Eric Kryski
Eric Kryski@ekryski·
@kwindla Frankly, I've noticed this with all of the Qwen models. Feels like people are hyping them up so much, but aren't actually doing real work with them. Tool calls are incredibly fragile, especially for longer, more complex tasks. Maybe it's just my setup: M1 Max MLX w all quants
English
1
0
0
44
kwindla
kwindla@kwindla·
@hampsonw Huh. I did many, many runs. The numbers in the table are a 25 run aggregate. But I will definitely try again. That's really interesting! (And it looks like you used my endpoint, so I really should have seen those same numbers.) How many runs did you do?
English
1
0
0
20