Chris E

102 posts

Chris E

@cte

Helping build @roo_code.

Los Altos Hills, CA Katılım Nisan 2007

273 Takip Edilen747 Takipçiler

Chris E retweetledi

GosuCoder@GosuCoder·1 Eyl

September 2025 Evals featuring GPT 5, Grok Code, Claude 4 Sonnet, Claude 4 Opus, and Qwen 3 Coder is now uploading. This was by far the largest test run i've done to date, which leads more to why I need to figure out more ways to automate as much of this as possible. 1. Some crazy upsets in my opinion 2. Claude Code continues to fall in overall ranking, which is concerning... 3. Grok Code Fast shows some promise, but seems to get off track easily, so i'm wondering how this would perform using it in an existing large codebase. Example is it would be nearing completion of an eval, see a terminal error and then go down a rabbit hole making things worse trying to fix it. In the real world though the programmer should catch that and redirect it. Video should drop in about an hour, once its done uploading and processing.

English

135

8.4K

Chris E@cte·30 Tem

@communicating @roocode @Zai_org @openrouter Here are the programming exercises for our evals suite: github.com/RooCodeInc/Roo… Full results here: roocode.com/evals

English

Christopher@communicating·30 Tem

@roocode @Zai_org @openrouter What does your eval entail?

English

271

Roo Code@roocode·30 Tem

Noticed a spike in @Zai_org's new GLM-4.5 model usage on @openrouter, so we ran our own eval using Roo Code. ✅ Scored 86, slightly better than Qwen3-Coder 💸 Cost us about $27 to run 📊 Solid value for the performance Are you seeing similar results?

English

263

27.6K

Chris E@cte·30 Tem

@artificialguybr @roocode @Zai_org @openrouter roocode.com/evals

QME

Artificial Guy / João Vitor A.@artificialguybr·30 Tem

@roocode @Zai_org @openrouter we have eval link?

English

737

Chris E retweetledi

Roo Code@roocode·13 Haz

Roo Code 3.20.0 | THIS IS A BIG ONE!! Marketplace for extensions AND modes, concurrent file edits and reads, and numerous other improvements and bug fixes. docs.roocode.com/update-notes/v…

English

140

6.1K

Chris E@cte·28 May

Your take isn't spicy enough!😂 My sense is that there are trade-offs with all of these tools and in the long run I wouldn't bet against giving these models more tools and letting them judge which is most appropriate given the constraints. It would be nice to have some eval data backing these takes (we're working on that).

English

GosuCoder@GosuCoder·28 May

Fascinating take, I disagree overall, and have a hard time seeing where this is coming from: RAG (vector embedding based) has its place. 1. It can lower costs - Very Important 2. It can be worse at some searches and better at others. In fact different AI models work with these systems better than others. 3. It can be faster than reading code like a human. AI doesn’t need to operate like a human. Why limit how we navigate through code to what a human does? 4. The security argument could be said about anything, source control, sending data to the LLM itself etc. local embedding is arguably safer. 5. Large context windows also aren’t coming fast enough and even when we do have them the recall isn’t great. And the cost of using all that context is nuts. 6. It feels like the author isn’t taking into account time and token count which also equate to real value. 7. I also think completely ignoring RAG is a mistake as none of us know where advancements will go. Is it perfect… no, but does it have a place. Just look at Augment Code who has the best coding context engine on the market, and see how much people love it.

Cline@cline

Cline doesn't index your codebase. No RAG, no embeddings, no vector databases. This isn't a limitation -- it's a deliberate design choice. As context windows increase, this approach enhances Cline's ability to understand your code. Here's why. 🧵

English

1.4K

Chris E@cte·22 Nis

@AlexGrama @roocode On it! github.com/RooVetGit/Roo-…

English

132

gramanoid@alexgrama·22 Nis

@openrouter @cline @roocode did you guys update this lads? shukran

English

790

OpenRouter@OpenRouter·22 Nis

Gemini caching is now live! 75% off prompt token prices, and we made the API extremely simple: Just set *cache_control* on a message. Exactly the same as Anthropic, and others to come👇

English

546

47.6K

Chris E@cte·22 Nis

@pingToven 😬 Agreed... I'm trying to implement caching for the Gemini provider in Roo Code and I'm sure that I'll get it wrong on the first attempt.

English

Toven@pingToven·21 Nis

@OfficialLoganK not gonna lie all this variable pricing stuff is really rough

English

279

Logan Kilpatrick@OfficialLoganK·21 Nis

Context caching updates in the Gemini API: - ✅ Added support for 2.0 Flash - ✅ Added support for 2.5 Pro Preview - 📉 Reduced min context size from 32K down to 4K Much more to come still, please send any feedback on the experience!

English

150

1.7K

117.6K

Chris E@cte·18 Nis

@soyhenryxyz @GosuCoder This is amazing. The next big push on evals is going to testing various orchestration configurations and show some data that backs our intuition about it, so I'd love to help.

English

Henry Moran@soyhenryxyz·18 Nis

@GosuCoder @cte might be able to provide some insight 😊

English

GosuCoder@GosuCoder·18 Nis

Finally releasing a test I’ve been running for almost a month now. The idea was simple, can you orchestrate tasks using RooCode to produce similar results but using less capable models.

English

280

Chris E@cte·17 Nis

@jpeg729 Our evals framework is open source here: github.com/RooVetGit/Roo-… - we're working on better documentation for running it yourself locally.

English

JPeg@jpeg729·17 Nis

@cte @GosuCoder @bindureddy Are there evals for models I could run locally like qwen-coder?

English

Bindu Reddy@bindureddy·16 Nis

o4-mini may be the REAL STORY It ALSO beats Gemini 2.5 AND is about 2x cheaper! It also comes with prompt caching 💃 A great model to throw into the mix when you are building god-tier AI 😎 Our launch will be on Friday!

English

648

71.1K

Chris E@cte·17 Nis

@sachasayan ❤️

QME

Sacha Sayan@sachasayan·17 Nis

brb, fixing roo code again

English

Chris E@cte·17 Nis

@GosuCoder @bindureddy I just updated the Roo Code evals - roocode.com/evals - o4 Mini (High) doesn't come near the top-tier of coding models but the price to performance is reasonable.

English

1.7K

GosuCoder@GosuCoder·17 Nis

@bindureddy can you describe more what this benchmark entails? What Coding benchmarks are you running, early tests don't seem to feel like o4-Mini High is anywhere near Gemini 2.5 Pro. What languages etc.

English

2.1K

Chris E@cte·15 Nis

@soyhenryxyz @roocode Totally agree; I’d love to come up with a new set of benchmarks that are designed to show off the strengths of subtasks.

English

Henry Moran@soyhenryxyz·15 Nis

@roocode Would love to see some evals that are somehow based on subtasks from Orchestrator

English

217

Roo Code@roocode·15 Nis

We're continuing to flesh out roocode.com/evals to give a better sense of the performance and cost of different popular models. As we expected GPT 4.1 is on par with Gemini 2.5 Pro Preview, but GPT 4.1 Mini's performance was a surprise - it has an extremely good price to performance ratio, even beating out DeepSeek V3. What do you think? Does this match your experience with the new GPT 4.1 models so far?

English

Chris E@cte·15 Nis

@setkyarwalar @roocode Nano... 🙈😬

Filipino

Universe@setkyarwalar·15 Nis

@roocode Mini is not bad. Nano is useless.

English

334

Chris E@cte·14 Nis

@LeaderOnePro @roocode Trying to get it out today!

English

129

翼月菌 - e/acc@LeaderOnePro·14 Nis

@roocode When will xAI be supported?

English

100

Roo Code@roocode·14 Nis

GPT-4.1 is now live in Roo Code! Given how popular it was with our users when it was operating in stealth, along with the modest pricing, we expect it to be a banger 🚀

OpenRouter@OpenRouter

Stealth model reveal 👀 Many of you have been excited to know who made Quasar Alpha & Optimus Alpha, which both topped the charts during testing. Today we can announce that they were early test versions of...

English

Chris E@cte·14 Nis

@cdossman @roocode I think it's on pace to be slightly below Sonnet 3.7 and Gemini 2.5. The price to intelligence ratio of 4.1 mini seems to be trending really well...

English

Chris@cdossman·14 Nis

@roocode better then gemini 2.5 pro?

English

116

Chris E@cte·19 Mar

@mattpocockuk The Aider polyglot benchmarks are a good start. I wired up the Cursor-like product I’m working on to run the benchmarks and see how it compares to the publicly available scores.

English

Matt Pocock@mattpocockuk·18 Mar

Since AI tools like Cursor/Windsurf basically have no moat... ...how do you get a sense for which is better? Tempted to make a set of evals for testing the performance and latency of each AI editor

English

175

38.6K

Chris E retweetledi

Louis Virtel@louisvirtel·8 Haz

I love how Mitt Romney reappears once every three months to outshine the entire Republican Party by doing the absolute least.

English

638

12.7K

136K

Chris E retweetledi

Carol Leonnig@CarolLeonnig·18 Mar

Real journalists provide facts to inform the public, not to lead them astray or put them in danger.

English

155

454

Chris E retweetledi

Ryan Chapline@ryanchapline·5 Tem

“I perceive the necessity... the necessity for haste.” - George “Maverick” Washington #RevolutionaryWarAirportStories

English

192

2.5K

10.8K

Chris E retweetledi

Pete Buttigieg@PeteButtigieg·31 Mar

Amazingly, the chyron is not the most foolish thing about this picture. To get ahead of a potential refugee crisis caused by great suffering in Central America, it would make sense to use our resources to help reduce that suffering. This is self-defeating.

English

1.1K

4.8K

27.1K

Keşfet

@communicating @roocode @Zai_org @openrouter @artificialguybr @AlexGrama @cline @pingToven