Kenton Varda

6.2K posts

Kenton Varda

@KentonVarda

Tech lead for @Cloudflare Workers, https://t.co/bFUDCZ7BUc, https://t.co/Hjf1slmJqs, https://t.co/oIKxYZA4LW. 🦋 https://t.co/8QKY5gf1BK

Austin, TX Katılım Kasım 2008

269 Takip Edilen13.8K Takipçiler

Sabitlenmiş Tweet

Kenton Varda@KentonVarda·24 Mar

Dynamic Workers are now in Open Beta, all paid Workers users have access. Secure sandboxes that start ~100x faster than a container and use 1/10 the memory, so you can start one up on-demand to handle one AI chat message and then throw it away. Agents should interact with the world by writing code, not tool calls. This makes that possible at "consumer scale", where millions of end users each have their own agent writing code. blog.cloudflare.com/dynamic-worker…

English

819

345.8K

Kenton Varda@KentonVarda·10h

@gryphendoor I'm specifically asking why the chatbot is not able to do things.

English

166

gryph@gryphendoor·11h

@KentonVarda You’re using Gemini chatbot. Use other google tools.

English

188

Kenton Varda@KentonVarda·1d

How is Google so far behind on agentic AI? This is the Gemini sidebar *embedded inside gmail*.

English

405

55.8K

Kenton Varda@KentonVarda·17h

@coreyward Yeah just approvals for all side effecting calls for now. And not letting it send email at all (yet).

English

Corey Ward@coreyward·19h

@KentonVarda Interesting how are you dealing with prompt injection? Are you just constraining the set of tools available and requiring approval for every call?

English

Kenton Varda@KentonVarda·19h

Wow a lot of people have takes on this. Some people saying it's hard to do safely. No it isn't. You just give it an approval step and/or undo button. I literally built my own implementation of this (that works with Google products via OAuth APIs) as a side project...

English

2.5K

Kenton Varda@KentonVarda·1d

The Google Docs one can't even fix my typos! I have written my own agent that can do this (operating on a Google Doc) but the one inside Google Docs can't. What are they spending their time on over there? This is like the #1 thing every SaaS needed to do LAST YEAR.

English

6.3K

Kenton Varda@KentonVarda·20h

@Francesco2714 Used the Google APIs directly.

English

Francesco27@Francesco2714·20h

@KentonVarda How did you do it? The MCP Portal is read only

English

Kenton Varda@KentonVarda·1d

@phillipstewart They could have you approve each action, and/or have an undo button. Pretty standard stuff.

English

2.8K

Phillip.png ⚡🐢@phillipstewart·1d

@KentonVarda We know LLMs can make huge mistakes like deleting all your emails... so perhaps it is not ready for that yet. But of course just archiving emails is not a huge risk.

English

3.3K

Kenton Varda@KentonVarda·1d

@CaptYums Obviously the Gemini team itself is doing great work. What I'm saying is that *the rest of Google* is failing to keep up.

English

2.6K

Jonathan Poczatek (e/wombat)@CaptYums·1d

@KentonVarda Lol deep research was the first agentic AI product. Get real dude

English

2.9K

Kenton Varda@KentonVarda·3d

To be really clear: * I am not using Claude Code. This is my own harness. * It's not just hanging. I see reasoning text streaming in. It just gets stuck thinking for excessively long. Eventually degrades to gibberish. Really weird.

English

1.1K

Kenton Varda@KentonVarda·3d

Anyone else seeing Claude (both Sonnet and Opus) going into excessive reasoning loops today? I'm just sitting here watching it spend 10 minutes generating reasoning text for a problem it normally solves in 45 seconds. (I use the same prompt a lot as a test of my own harness...)

English

160

17.6K

Kenton Varda@KentonVarda·3d

@banshanlu25 If my network was down I wouldn't see the reasoning tokens streaming in.

English

309

Banshan@banshanlu25·3d

@KentonVarda check your network

English

343

Kenton Varda@KentonVarda·3d

@sunglassesface I test against all the good models in my harness, including Kimi.

English

332

orlie@sunglassesface·3d

@KentonVarda Legit question: why do you use models hosted by Anthropic when you can use KIMI K2.5 hosted on Cloudflare?

English

401

Kenton Varda@KentonVarda·3d

@dillon_mulroy "Make a collaborative whiteboard app."

English

796

Dillon Mulroy@dillon_mulroy·3d

@KentonVarda curious what prompt you use for testing your harness 👀

English

2.4K

Kenton Varda retweetledi

Matt 'TK' Taylor@MattieTK·3d

It's my first @Cloudflare blog, and it's a big one. We're rebuilding WordPress as if it were built today. It's end to end TypeScript, works as an Astro plugin, and has secure plugin execution in dynamic workers. It's called EmDash, try it now ⤵️ blog.cloudflare.com/emdash-wordpre…

English

393

22.9K

Kenton Varda@KentonVarda·5d

@its_hebilicious @simonw Honestly all the GPUs in my house only adds to like 250GB, so no. 🙁

English

Emmanuel LD@its_hebilicious·5d

@KentonVarda @simonw Can you connect all the GPUs in your house and try to run something bigger like kimi2.5 ?

English

128

Simon Willison@simonw·5d

Georgi on why it's still hard to get great coding agent performance from local models: "Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction"

Georgi Gerganov@ggerganov

I think the consensus is that Qwen3.5 is a step change so atm I would recommend explore that, given that it covers a range of sizes suitable for all devices. Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction. Sometimes there are even pure inference bugs. From typing the task in the client to the actual result, there is a long chain of components that atm are not only fragile - are also developed by different parties. So it's difficult to consolidate the entire stack and you have to keep in mind that what you are currently observing is with very high probability still broken in some subtle way along that chain. But things are improving on all levels and everything will become better across the board soon. Best way to evaluate things IMO: - Start with full quality models that you fit on your hardware - Make sure you know what your harness actually does. F.ex. don't expect to hook Claude Code or Codex to some local model and the magic to happen. The developers of CC don't care (yet) if it is compatible with Qwen3.5. Best is to write your own harness so you know what happens every step of the way. Or use llama-server's webui (we now have MCP support out of the box) - When things start to click, look for optimizations to make it faster. Here is where you can start quantizing for speed or looks for some advice in the community for optimal parameters So I can just say that on the low-level inference side, we will ship the right solution for sure. We still need to make the user-facing stack work better with local models - I'm hoping this will happen, though I feel less capable to control that. And to answer your question more straightforward, I've experimented with the following models and have found useful applications (mostly around chat, MCP and coding) with all of them: - gpt-oss-120b - Qwen3-Coder-30B - GLM-4.7-Flash - MiniMax-M2.5 - Qwen3.5-35B-A3B With the exception of gpt-oss-120b and MiniMax-M2.5, I've used Q8_0 variants to keep most of the original quality. Unfortunately, I am not familiar with tool calling benchmarks specifically, so I cannot recommend. From my PoV, as long as we make sure the fundamental inference computation is correct, tool calling efficiency will depend just on: - Model intelligence (something we do not control) - Chat template parsing (something we are still actively improving on our end in llama.cpp)

English

331

71.5K

Kenton Varda@KentonVarda·5d

Ehh I don't think so. I have a coding harness I wrote myself that's not optimized for any particular model and has made it really easy to do apples-to-apples comparisons between models. Claude, ChatGPT, Gemini are all in one league and smaller models that I run with ollama are very clearly in a different league. (I can run models up to about 80B with my setup FWIW.)

English

1.3K

Simon Willison@simonw·5d

This sounds right to me - I've tinkered with running local models against Claude Code and Codex and been disappointed, but I've not put the work in yet to try and find the right harness+model combination given how many tiny details might produce disappointing results

English

9.1K

Kenton Varda@KentonVarda·5d

@stikves Fair enough, that's a plausible explanation.

English

sukru tikves@stikves·5d

@KentonVarda I would not disagree, since I did not write that code. But don't get me wrong. It would definitely *seem* they would need less code upfront. And once the first design iteration is checked in, it is very difficult to go back and change.

English

Kenton Varda@KentonVarda·6d

If gRPC had been based on WebSocket instead of full-duplex HTTP, it would have vastly broader support on the web today. We wouldn't need gRPC-web, we could just speak gRPC in browsers. We wouldn't need special support for gRPC in proxies since they all support WebSocket already.

English

604

56.9K

Kenton Varda@KentonVarda·5d

@stikves > I believe it is more likely they chose the one with less amount of additional code required Based on my experience having written multiplexing RPC protocols and HTTP implementations, I'm fairly certain that reusing HTTP/2 for this must have resulted in far more code required.

English

sukru tikves@stikves·5d

> You seem to be name-dropping Superroot, Mixers, and Twiddlers to suggest you know something I don't, but I was on the Superroot team for 5 years... Quite the opposite Anyway > I suspect the choice to use HTTP/2 was more because someone thought it would be beautiful and elegant to use the new standard -- not a practical engineering decision. I believe it is more likely they chose the one with less amount of additional code required Which brings to... > HTTP/2 multiplexing isn't magic. It's fairly easy to implement multiplexing in an RPC protocol, starting from a single bidirectional message stream. That would require implementing a multiplexing logic in that layer. I'm sure you have been at high level decision meetings "We have two options, one of them, HTTP/2.0 has multiplexing built in. The other one requires us to maintain a separate toolkit we have to write from scratch" "Are there any benefits to choosing WebSockets, though?" "Not so much. And we were also instrumental in HTTP/2.0 design" "Great, let's go with your recommendation, then" There is nothing magical, but there is a lot of practicality.

English

Kenton Varda@KentonVarda·5d

HTTP/2 multiplexing isn't magic. It's fairly easy to implement multiplexing in an RPC protocol, starting from a single bidirectional message stream. Stubby did it long before HTTP/2 existed. So did Cap'n Proto. For an RPC protocol that isn't presenting as HTTP to the application, using HTTP/2 multiplexing under the hood is not a win -- you just lose a lot of control while taking on a lot of HTTP baggage you don't need. You are trying to make the argument that gRPC's design decisions were required for "Google scale", but the argument doesn't hold. The choice to use HTTP/2 didn't benefit "Google scale" at all; it provided no advantage over what Stubby already did. I suspect the choice to use HTTP/2 was more because someone thought it would be beautiful and elegant to use the new standard -- not a practical engineering decision. You seem to be name-dropping Superroot, Mixers, and Twiddlers to suggest you know something I don't, but I was on the Superroot team for 5 years...

English

sukru tikves@stikves·5d

Yes, there is a split between gRPC and internal one, just like K8 and BCL. However, they are still designed to sell Google services, and they wanted to (and failed) to sell to internal customers as well. "Hey, I have found 100 A100 GPUs in Search GCP pool. Can I use it?" "Nope, we have to use Borg Cell XX-Y. For reason Xyz..." The problem is, once again, gRPC (internal) depends heavily on side channels, that is how the authentication tokens M*** go through, along other things. And they really use multiplexing (Your SR and Mixers, Twidlers, ... will talk a lot, and Google is keen in saving every microsecond out there) Anyway, it is nice to discuss what-ifs, but given how bazel, K8, TensorFlow, ABSL, BigQuery, ... turned out, it is very much on brand for Google to have the public versions replicate the internal structures.

English

Kenton Varda@KentonVarda·5d

@antonycourtney @dcolascione That critique is, in fact, pretty well-known. It shows up on HN like once a year and every time I have to go paste my reply... news.ycombinator.com/item?id=451405…

English

Antony Courtney@antonycourtney·6d

@dcolascione @KentonVarda 💯. I only came across this old critique recently, which is fairly detailed and damning; I really wish it more were broadly known. reasonablypolymorphic.com/blog/protos-ar…

English

Keşfet

@gryphendoor @coreyward @Francesco2714 @phillipstewart @CaptYums @banshanlu25 @sunglassesface @dillon_mulroy