Kenton Varda

6.2K posts

Kenton Varda

Kenton Varda

@KentonVarda

Tech lead for @Cloudflare Workers, https://t.co/bFUDCZ7BUc, https://t.co/Hjf1slmJqs, https://t.co/oIKxYZA4LW. 🦋 https://t.co/8QKY5gf1BK

Austin, TX Katılım Kasım 2008
269 Takip Edilen13.8K Takipçiler
Sabitlenmiş Tweet
Kenton Varda
Kenton Varda@KentonVarda·
Dynamic Workers are now in Open Beta, all paid Workers users have access. Secure sandboxes that start ~100x faster than a container and use 1/10 the memory, so you can start one up on-demand to handle one AI chat message and then throw it away. Agents should interact with the world by writing code, not tool calls. This makes that possible at "consumer scale", where millions of end users each have their own agent writing code. blog.cloudflare.com/dynamic-worker…
English
63
78
819
345.8K
Kenton Varda
Kenton Varda@KentonVarda·
@gryphendoor I'm specifically asking why the chatbot is not able to do things.
English
1
0
0
166
gryph
gryph@gryphendoor·
@KentonVarda You’re using Gemini chatbot. Use other google tools.
English
1
0
0
188
Kenton Varda
Kenton Varda@KentonVarda·
How is Google so far behind on agentic AI? This is the Gemini sidebar *embedded inside gmail*.
Kenton Varda tweet mediaKenton Varda tweet media
English
74
9
405
55.8K
Kenton Varda
Kenton Varda@KentonVarda·
@coreyward Yeah just approvals for all side effecting calls for now. And not letting it send email at all (yet).
English
0
0
0
50
Corey Ward
Corey Ward@coreyward·
@KentonVarda Interesting how are you dealing with prompt injection? Are you just constraining the set of tools available and requiring approval for every call?
English
1
0
0
79
Kenton Varda
Kenton Varda@KentonVarda·
Wow a lot of people have takes on this. Some people saying it's hard to do safely. No it isn't. You just give it an approval step and/or undo button. I literally built my own implementation of this (that works with Google products via OAuth APIs) as a side project...
English
3
0
15
2.5K
Kenton Varda
Kenton Varda@KentonVarda·
The Google Docs one can't even fix my typos! I have written my own agent that can do this (operating on a Google Doc) but the one inside Google Docs can't. What are they spending their time on over there? This is like the #1 thing every SaaS needed to do LAST YEAR.
English
5
0
49
6.3K
Kenton Varda
Kenton Varda@KentonVarda·
@phillipstewart They could have you approve each action, and/or have an undo button. Pretty standard stuff.
English
1
0
25
2.8K
Phillip.png ⚡🐢
Phillip.png ⚡🐢@phillipstewart·
@KentonVarda We know LLMs can make huge mistakes like deleting all your emails... so perhaps it is not ready for that yet. But of course just archiving emails is not a huge risk.
English
3
0
10
3.3K
Kenton Varda
Kenton Varda@KentonVarda·
@CaptYums Obviously the Gemini team itself is doing great work. What I'm saying is that *the rest of Google* is failing to keep up.
English
1
0
24
2.6K
Kenton Varda
Kenton Varda@KentonVarda·
To be really clear: * I am not using Claude Code. This is my own harness. * It's not just hanging. I see reasoning text streaming in. It just gets stuck thinking for excessively long. Eventually degrades to gibberish. Really weird.
English
0
0
9
1.1K
Kenton Varda
Kenton Varda@KentonVarda·
Anyone else seeing Claude (both Sonnet and Opus) going into excessive reasoning loops today? I'm just sitting here watching it spend 10 minutes generating reasoning text for a problem it normally solves in 45 seconds. (I use the same prompt a lot as a test of my own harness...)
English
38
2
160
17.6K
Kenton Varda
Kenton Varda@KentonVarda·
@banshanlu25 If my network was down I wouldn't see the reasoning tokens streaming in.
English
0
0
0
309
orlie
orlie@sunglassesface·
@KentonVarda Legit question: why do you use models hosted by Anthropic when you can use KIMI K2.5 hosted on Cloudflare?
English
1
0
0
401
Kenton Varda retweetledi
Matt 'TK' Taylor
Matt 'TK' Taylor@MattieTK·
It's my first @Cloudflare blog, and it's a big one. We're rebuilding WordPress as if it were built today. It's end to end TypeScript, works as an Astro plugin, and has secure plugin execution in dynamic workers. It's called EmDash, try it now ⤵️ blog.cloudflare.com/emdash-wordpre…
English
29
49
393
22.9K
Emmanuel LD
Emmanuel LD@its_hebilicious·
@KentonVarda @simonw Can you connect all the GPUs in your house and try to run something bigger like kimi2.5 ?
English
1
0
2
128
Simon Willison
Simon Willison@simonw·
Georgi on why it's still hard to get great coding agent performance from local models: "Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction"
Georgi Gerganov@ggerganov

I think the consensus is that Qwen3.5 is a step change so atm I would recommend explore that, given that it covers a range of sizes suitable for all devices. Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction. Sometimes there are even pure inference bugs. From typing the task in the client to the actual result, there is a long chain of components that atm are not only fragile - are also developed by different parties. So it's difficult to consolidate the entire stack and you have to keep in mind that what you are currently observing is with very high probability still broken in some subtle way along that chain. But things are improving on all levels and everything will become better across the board soon. Best way to evaluate things IMO: - Start with full quality models that you fit on your hardware - Make sure you know what your harness actually does. F.ex. don't expect to hook Claude Code or Codex to some local model and the magic to happen. The developers of CC don't care (yet) if it is compatible with Qwen3.5. Best is to write your own harness so you know what happens every step of the way. Or use llama-server's webui (we now have MCP support out of the box) - When things start to click, look for optimizations to make it faster. Here is where you can start quantizing for speed or looks for some advice in the community for optimal parameters So I can just say that on the low-level inference side, we will ship the right solution for sure. We still need to make the user-facing stack work better with local models - I'm hoping this will happen, though I feel less capable to control that. And to answer your question more straightforward, I've experimented with the following models and have found useful applications (mostly around chat, MCP and coding) with all of them: - gpt-oss-120b - Qwen3-Coder-30B - GLM-4.7-Flash - MiniMax-M2.5 - Qwen3.5-35B-A3B With the exception of gpt-oss-120b and MiniMax-M2.5, I've used Q8_0 variants to keep most of the original quality. Unfortunately, I am not familiar with tool calling benchmarks specifically, so I cannot recommend. From my PoV, as long as we make sure the fundamental inference computation is correct, tool calling efficiency will depend just on: - Model intelligence (something we do not control) - Chat template parsing (something we are still actively improving on our end in llama.cpp)

English
48
15
331
71.5K
Kenton Varda
Kenton Varda@KentonVarda·
Ehh I don't think so. I have a coding harness I wrote myself that's not optimized for any particular model and has made it really easy to do apples-to-apples comparisons between models. Claude, ChatGPT, Gemini are all in one league and smaller models that I run with ollama are very clearly in a different league. (I can run models up to about 80B with my setup FWIW.)
English
2
0
16
1.3K
Simon Willison
Simon Willison@simonw·
This sounds right to me - I've tinkered with running local models against Claude Code and Codex and been disappointed, but I've not put the work in yet to try and find the right harness+model combination given how many tiny details might produce disappointing results
English
16
0
43
9.1K
sukru tikves
sukru tikves@stikves·
@KentonVarda I would not disagree, since I did not write that code. But don't get me wrong. It would definitely *seem* they would need less code upfront. And once the first design iteration is checked in, it is very difficult to go back and change.
English
1
0
0
17
Kenton Varda
Kenton Varda@KentonVarda·
If gRPC had been based on WebSocket instead of full-duplex HTTP, it would have vastly broader support on the web today. We wouldn't need gRPC-web, we could just speak gRPC in browsers. We wouldn't need special support for gRPC in proxies since they all support WebSocket already.
English
28
18
604
56.9K
Kenton Varda
Kenton Varda@KentonVarda·
@stikves > I believe it is more likely they chose the one with less amount of additional code required Based on my experience having written multiplexing RPC protocols and HTTP implementations, I'm fairly certain that reusing HTTP/2 for this must have resulted in far more code required.
English
1
0
1
21
sukru tikves
sukru tikves@stikves·
> You seem to be name-dropping Superroot, Mixers, and Twiddlers to suggest you know something I don't, but I was on the Superroot team for 5 years... Quite the opposite Anyway > I suspect the choice to use HTTP/2 was more because someone thought it would be beautiful and elegant to use the new standard -- not a practical engineering decision. I believe it is more likely they chose the one with less amount of additional code required Which brings to... > HTTP/2 multiplexing isn't magic. It's fairly easy to implement multiplexing in an RPC protocol, starting from a single bidirectional message stream. That would require implementing a multiplexing logic in that layer. I'm sure you have been at high level decision meetings "We have two options, one of them, HTTP/2.0 has multiplexing built in. The other one requires us to maintain a separate toolkit we have to write from scratch" "Are there any benefits to choosing WebSockets, though?" "Not so much. And we were also instrumental in HTTP/2.0 design" "Great, let's go with your recommendation, then" There is nothing magical, but there is a lot of practicality.
English
1
0
0
23
Kenton Varda
Kenton Varda@KentonVarda·
HTTP/2 multiplexing isn't magic. It's fairly easy to implement multiplexing in an RPC protocol, starting from a single bidirectional message stream. Stubby did it long before HTTP/2 existed. So did Cap'n Proto. For an RPC protocol that isn't presenting as HTTP to the application, using HTTP/2 multiplexing under the hood is not a win -- you just lose a lot of control while taking on a lot of HTTP baggage you don't need. You are trying to make the argument that gRPC's design decisions were required for "Google scale", but the argument doesn't hold. The choice to use HTTP/2 didn't benefit "Google scale" at all; it provided no advantage over what Stubby already did. I suspect the choice to use HTTP/2 was more because someone thought it would be beautiful and elegant to use the new standard -- not a practical engineering decision. You seem to be name-dropping Superroot, Mixers, and Twiddlers to suggest you know something I don't, but I was on the Superroot team for 5 years...
English
1
0
1
37
sukru tikves
sukru tikves@stikves·
Yes, there is a split between gRPC and internal one, just like K8 and BCL. However, they are still designed to sell Google services, and they wanted to (and failed) to sell to internal customers as well. "Hey, I have found 100 A100 GPUs in Search GCP pool. Can I use it?" "Nope, we have to use Borg Cell XX-Y. For reason Xyz..." The problem is, once again, gRPC (internal) depends heavily on side channels, that is how the authentication tokens M*** go through, along other things. And they really use multiplexing (Your SR and Mixers, Twidlers, ... will talk a lot, and Google is keen in saving every microsecond out there) Anyway, it is nice to discuss what-ifs, but given how bazel, K8, TensorFlow, ABSL, BigQuery, ... turned out, it is very much on brand for Google to have the public versions replicate the internal structures.
English
1
0
0
36