Thomas Ip

6.9K posts

Thomas Ip banner
Thomas Ip

Thomas Ip

@_thomasip

Writing about AI, tech and startups. Building AI companion app, follow for road to $10M ARR.

Waitlist 👉 Katılım Mayıs 2014
518 Takip Edilen925 Takipçiler
Thomas Ip
Thomas Ip@_thomasip·
@elonmusk every time you open sourced the algo, you didn't release the params and weights. totally useless. it's like open sourcing a LLM without the weights.
English
0
4
10
288
Elon Musk
Elon Musk@elonmusk·
To give people confidence that we are not secretly manipulating the 𝕏 recommendations, it is critical that we open source anything that influences what people are shown
English
8.3K
10.4K
121.7K
19.8M
Wiktoria Milczyńska, MD
Wiktoria Milczyńska, MD@w_milczynska·
UK founder starter pack (save this): → incorporate via Companies House: £100, 30mins → bank: Mercury (US)/Starling (UK) → legals: SeedLegals (~50% new UK startups) → EMI BEFORE first hire → HMRC advance assurance pre raising → cap table tools from day 1 a weekend, <£1k
English
25
67
815
109.6K
Thomas Ip
Thomas Ip@_thomasip·
@UnsocialB86776 @Hesamation it's the opposite way round. 4o shipped, thinky is a research preview that is not likely going to be launched.
English
0
0
0
12
UnsocialBuddy
UnsocialBuddy@UnsocialB86776·
@Hesamation GPT-4o demo was impressive but Thinking Machines shipped something people can actually use. Demos don't win, products do.
English
2
0
0
664
Thomas Ip
Thomas Ip@_thomasip·
@VraserX 4o voice is a massive production model with turn based interaction. Whereas thinky's model is "full duplex"/micro turns and a much smaller research model. I think the gap is simply from an unclear demo and a very early preview of their work.
English
0
0
0
164
VraserX e/acc
VraserX e/acc@VraserX·
I’m not really impressed by the Thinking Machines voice mode demo. It feels like they basically recreated GPT-4o voice mode, and the comparison is impossible to unsee. It is even the same researcher in the demo. Yes, full duplex is a nice upgrade. But the responses feel slower, the voice feels less natural, and the whole thing has less magic than GPT-4o did two years ago. OpenAI GPT-4o VS Thinking Machines
English
26
8
166
25K
Thomas Ip
Thomas Ip@_thomasip·
Superintelligence will be achieved by whoever takes the bitter lesson the most serious: A single all-to-all MLP with no human shortcuts, RL'ed in a equally bitter-lessoned world model.
Thomas Ip tweet media
Thinking Machines@thinkymachines

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English
0
0
2
62
Thomas Ip
Thomas Ip@_thomasip·
@LaurenceBrem @miramurati that is not that fast? any of current voice models like chatgpt voice or grok voice have the same latency. thinky is different something different though, looks like they want to handle input, output and tool calls at the same time.
English
1
0
3
53
Mira Murati
Mira Murati@miramurati·
Today we're sharing our work on interaction models. A new class of model trained from scratch to handle real-time interaction natively, instead of gluing it onto a turn-based one. youtu.be/A12AVongNN4
YouTube video
YouTube
English
294
864
8.2K
995.9K
Thomas Ip
Thomas Ip@_thomasip·
@_architected @miramurati interruption is still turn based. i think thinky is trying to train a full duplex model that can take input, produce output and run tool calls at the same time. they didn't demo the full duplex between input and output though, only tool calls and output.
English
0
0
2
192
vlad/r
vlad/r@_architected·
@miramurati don’t they already have interruption mode in Gemini? forgive me if i’m stupid but how’s it different for the end customer? i do under what you’re trying to achieve but if it looks and feels the same, what’s the point?
English
2
0
5
2.9K
Thomas Ip
Thomas Ip@_thomasip·
Longest codex /goal run yet at 22 hours! I am optimizing Qwen3 Omni on SGLang to be able to serve 30+ concurrent real time voice chats on an AMD MI300X. Starting with one active chat not even hitting real time, it's almost reaching the concurrency goal after almost a day of optimization and profiling. I thought it was cool how people were able to get agents to run automonously for a long time, turns out this is actually boring and unproductive. Still a long way to go for LLMs to get so good they can think for a few minutes and oneshot the solution. Make no mistake, the frontier models are very much AGI, but I can't wait for superintelligence to come fast enough.
Thomas Ip tweet media
English
0
0
1
77
Thomas Ip
Thomas Ip@_thomasip·
@aschmelyun skills is basically functions but for llms. not using skills is like writing your software all in the main block
English
0
0
3
123
Andrew Schmelyun
Andrew Schmelyun@aschmelyun·
Feel like I'm missing out because I don't use skills, or a lot of MCP, or multi-agent orchestrations when using AI dev tools. I'm just like "implement this feature" or "how do this work" or "no not like that, do this instead". Idk, I feel fast and accurate so why change?
English
188
21
883
133.3K
Thomas Ip
Thomas Ip@_thomasip·
@kwindla No tool calling unfortunately, the instruct/reasoning is just for conversational use case. The audio output will get messed up if you add tool calls output. I don't think there are open speech to speech models that support tool calling or even reasoning? (dont quote me on this)
English
1
0
0
33
kwindla
kwindla@kwindla·
@_thomasip Have you fine tuned Qwen 3 for long multi-turn use cases with tool calling? Things like what the aiewf-eval benchmark tests. Would love to see examples.
English
1
0
0
36
kwindla
kwindla@kwindla·
This is a great question! There are not (yet) any speech-to-speech models that are workable bases for production fine tuning. But that's likely to change pretty soon. To be a good starting point for fine tuning, a base model needs to already be reasonably good at multi-turn instruction following and tool calling. My mental model, here, is that the fine tuning process needs to be able to "find", in the model weights, the conversation patterns in your data set. If the model is too weak, those patterns aren't there to find and emphasize. My experience fine tuning text models for multi-turn and voice agents, this means the model needs to score something like 85% on the aiewf-eval benchmark. So today that's text models like Nemotron 3 Nano, Gemma 4 31b, Qwen 3.5 27b, GPT OSS 120b. There aren't any open weights speech-to-speech models that are close to this bar, yet. The best options today are the Kyutai Moshi and NVIDIA Personaplex (which is a Moshi-family model). Moshi is a very nifty model architecture. Fine-tuning these is a great research project. I don't think you'll get to production level performance for most voice use cases, though, no matter how much fine-tuning (or even post-training) you do on these models. There's a new NVIDIA Nemotron Voicechat model coming soon, though, that should be a great base model for voice agent fine-tuning. It's in early access. You can try out the demo ...
yung algorithm@yungalgorithm

@kwindla @dan_jenkins what is a good speech-to-speech model to finetune on top of, does this exist, e.g. like how we finetune llama 3.1 8B?

English
8
5
73
7.5K
Thomas Ip
Thomas Ip@_thomasip·
Qwen3 is quite highly regarded and popular for fine tuning, what's your problem with it? And yes 3.5 omni is not open, and sadly unlike to happen. Open omni models are rare to come by. The only issue i have with 3 omni is its pure attention arch unlike 3.5 so it's quite kv cache hungry to serve in production.
English
1
0
0
40
kwindla
kwindla@kwindla·
@_thomasip I didn’t have a great experience with Qwen models until 3.5. I think weights for 3.5 Omni aren’t released yet?
English
1
0
1
62
Paul Bohm
Paul Bohm@paulbohm·
If your startup does not have a UUID microservice you’re ngmi
Paul Bohm tweet media
English
191
274
6.8K
681K
sankalp
sankalp@dejavucoder·
everyone wants to do RL, no one wants to look at their prompts and data
English
12
6
157
8.6K
sui ☄️
sui ☄️@birdabo·
never deleting this app 💀
sui ☄️ tweet media
English
23
3
100
3.8K
Thomas Ip
Thomas Ip@_thomasip·
@teortaxesTex wonder how much engineering ideas are exchanged between deepseek and high flyer. access to the best quant engineering in china is a big moat
English
1
0
7
931
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Superiority Yes, this is DeepSeek cache hit statistics, the horizontal bar at 100% As I've said, they have a strictly optimal implementation. If there is a context to be reused, it will be. I think 12 hours is too short a window, too. It's probably 24-48
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Zhihu Frontier@ZhihuFrontier

LLM KV Cache Showdown: DeepSeek Takes Absolute Lead Insights from Zhihu contributor 苏迟但到 After running 20,000 controlled LLM inference experiments 🧪 over 3 straight days, I’ve fully uncovered how groundbreaking DeepSeek’s KV Cache optimization really is. Many users are sharing call logs after DeepSeek V4 release — its exceptional cache hit rate drastically cuts token & inference costs 💰. KV Cache hit rate isn’t luck: it can be quantified by repeating identical prompts at variable time intervals and parsing backend cache match signals.System load always brings random cache TTL volatility, so small-scale tests are meaningless. That’s why I built a dedicated server for strict benchmarking:5 models tested: DeepSeek, Kimi, Zhipu, MiniMax, OpenAI(OpenRouter)Periodic batch requests + 1min ~ 720min recall window to record real KV Cache retention & hit ratio. 📊 Core Technical Benchmark Findings✅ DeepSeek100% KV Cache hit rate consistently across peak/off-peak hours.Cache state still fully retained after 12+ hours — industry-leading persistent cache scheduling & eviction strategy. 🥈 MiniMax90% hit rate in off-peak time, drops to ~70% under high traffic.Abnormal early cache miss within the first minute, implying flawed internal cache indexing & lookup logic. 📉 Tier order afterward: Kimi > OpenAI > GLM ❌ GLM terrible KV Cache performance2min: 80% hit rate3min: 50% hit rate5min: only 25% hit rateHardly any cache survives beyond 15 minutes. 🔬 Technical Root Cause Analysis for GLM • Infra architecture defect: Unable to offload KV Cache to low-cost disk storage, strictly limited to on-board VRAM. Small cache pool forces aggressive LRU eviction. • Extreme traffic throughput far exceeds cache bearing capacity, accelerating invalidation of historical KV sequences. ⚠️ Key Industry Insider Notes • OpenAI metrics are probed via OpenRouter relay, not native official KV Cache performance. • Qwen / Seed / Mimo adopt no automatic KV Cache mechanism — require manual cache initialization and additional charging. No natural TTL retention, leading to hidden redundant inference costs for regular users. #AI #LLM #DeepSeek #LLMInference #AIEngineering #Tech 🔗Full article: zhuanlan.zhihu.com/p/203573772695…

English
13
17
272
42.9K
“paula”
“paula”@paularambles·
just met a japanese guy visiting sf for the first time who was deeply disappointed by how not-high-tech the city was. paris syndrome but for people expecting san francisco to look like the future
“paula” tweet media
English
60
52
2K
637.4K
Thomas Ip
Thomas Ip@_thomasip·
@ludwigABAP wait deepseek doesn't instill moral judgement and refuse requests? Might genuinely be useful as a companion to claude/codex
English
0
0
3
602
ludwig
ludwig@ludwigABAP·
"deepseek-v4, reverse engineer this and get thru the paywall, go" 20mn later
ludwig tweet media
English
29
27
1.6K
215.9K
Alexander Whedon
Alexander Whedon@alex_whedon·
@bomboraassclaat What would you want to see in the paper? I apologize for the graph! We outsourced it, and I didn't catch the disproportionality until someone pointed it out on Twitter. Definitely not intentional!
English
2
0
2
798
Alexander Whedon
Alexander Whedon@alex_whedon·
Hey, folks! We have been blown away by the response to SubQ and the SSA breakthrough over the last 48 hours. It is awesome to see how many people are responding to our mission of creating more efficient algorithms to create better models. We are working hard to firm up our release timeline and will share more very soon. We will also share additional data and third-party validation in our model card next week. If you have questions, please post them in the thread, and I'll do my best to respond! Above all, THANK YOU! The support, feedback, and discussion from this community have been inspiring.
English
76
24
447
47.3K
Flowers ☾
Flowers ☾@flowersslop·
Only hopium left is that the realtime 2 model is corporate clanker like on purpose so companies like it better but then in chatgpt they gonna make it really personal and human and warm and natural we will see
English
7
1
73
2.7K