Chris E

102 posts

Chris E banner
Chris E

Chris E

@cte

Helping build @roo_code.

Los Altos Hills, CA Katılım Nisan 2007
273 Takip Edilen747 Takipçiler
Chris E retweetledi
GosuCoder
GosuCoder@GosuCoder·
September 2025 Evals featuring GPT 5, Grok Code, Claude 4 Sonnet, Claude 4 Opus, and Qwen 3 Coder is now uploading. This was by far the largest test run i've done to date, which leads more to why I need to figure out more ways to automate as much of this as possible. 1. Some crazy upsets in my opinion 2. Claude Code continues to fall in overall ranking, which is concerning... 3. Grok Code Fast shows some promise, but seems to get off track easily, so i'm wondering how this would perform using it in an existing large codebase. Example is it would be nearing completion of an eval, see a terminal error and then go down a rabbit hole making things worse trying to fix it. In the real world though the programmer should catch that and redirect it. Video should drop in about an hour, once its done uploading and processing.
GosuCoder tweet media
English
27
10
135
8.4K
Roo Code
Roo Code@roocode·
Noticed a spike in @Zai_org's new GLM-4.5 model usage on @openrouter, so we ran our own eval using Roo Code. ✅ Scored 86, slightly better than Qwen3-Coder 💸 Cost us about $27 to run 📊 Solid value for the performance Are you seeing similar results?
Roo Code tweet media
English
22
23
263
27.6K
Chris E retweetledi
Roo Code
Roo Code@roocode·
Roo Code 3.20.0 | THIS IS A BIG ONE!! Marketplace for extensions AND modes, concurrent file edits and reads, and numerous other improvements and bug fixes. docs.roocode.com/update-notes/v…
English
7
16
140
6.1K
Chris E
Chris E@cte·
Your take isn't spicy enough!😂 My sense is that there are trade-offs with all of these tools and in the long run I wouldn't bet against giving these models more tools and letting them judge which is most appropriate given the constraints. It would be nice to have some eval data backing these takes (we're working on that).
English
0
0
1
60
GosuCoder
GosuCoder@GosuCoder·
Fascinating take, I disagree overall, and have a hard time seeing where this is coming from: RAG (vector embedding based) has its place. 1. It can lower costs - Very Important 2. It can be worse at some searches and better at others. In fact different AI models work with these systems better than others. 3. It can be faster than reading code like a human. AI doesn’t need to operate like a human. Why limit how we navigate through code to what a human does? 4. The security argument could be said about anything, source control, sending data to the LLM itself etc. local embedding is arguably safer. 5. Large context windows also aren’t coming fast enough and even when we do have them the recall isn’t great. And the cost of using all that context is nuts. 6. It feels like the author isn’t taking into account time and token count which also equate to real value. 7. I also think completely ignoring RAG is a mistake as none of us know where advancements will go. Is it perfect… no, but does it have a place. Just look at Augment Code who has the best coding context engine on the market, and see how much people love it.
Cline@cline

Cline doesn't index your codebase. No RAG, no embeddings, no vector databases. This isn't a limitation -- it's a deliberate design choice. As context windows increase, this approach enhances Cline's ability to understand your code. Here's why. 🧵

English
4
2
12
1.4K
OpenRouter
OpenRouter@OpenRouter·
Gemini caching is now live! 75% off prompt token prices, and we made the API extremely simple: Just set *cache_control* on a message. Exactly the same as Anthropic, and others to come👇
OpenRouter tweet media
English
18
46
546
47.6K
Chris E
Chris E@cte·
@pingToven 😬 Agreed... I'm trying to implement caching for the Gemini provider in Roo Code and I'm sure that I'll get it wrong on the first attempt.
English
0
0
3
67
Toven
Toven@pingToven·
@OfficialLoganK not gonna lie all this variable pricing stuff is really rough
Toven tweet media
English
1
0
2
279
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
Context caching updates in the Gemini API: - ✅ Added support for 2.0 Flash - ✅ Added support for 2.5 Pro Preview - 📉 Reduced min context size from 32K down to 4K Much more to come still, please send any feedback on the experience!
English
150
65
1.7K
117.6K
Chris E
Chris E@cte·
@soyhenryxyz @GosuCoder This is amazing. The next big push on evals is going to testing various orchestration configurations and show some data that backs our intuition about it, so I'd love to help.
English
1
0
1
23
GosuCoder
GosuCoder@GosuCoder·
Finally releasing a test I’ve been running for almost a month now. The idea was simple, can you orchestrate tasks using RooCode to produce similar results but using less capable models.
GosuCoder tweet media
English
1
0
11
280
Bindu Reddy
Bindu Reddy@bindureddy·
o4-mini may be the REAL STORY It ALSO beats Gemini 2.5 AND is about 2x cheaper! It also comes with prompt caching 💃 A great model to throw into the mix when you are building god-tier AI 😎 Our launch will be on Friday!
Bindu Reddy tweet media
English
39
57
648
71.1K
Sacha Sayan
Sacha Sayan@sachasayan·
brb, fixing roo code again
Sacha Sayan tweet media
English
1
0
5
2K
GosuCoder
GosuCoder@GosuCoder·
@bindureddy can you describe more what this benchmark entails? What Coding benchmarks are you running, early tests don't seem to feel like o4-Mini High is anywhere near Gemini 2.5 Pro. What languages etc.
English
3
1
15
2.1K
Chris E
Chris E@cte·
@soyhenryxyz @roocode Totally agree; I’d love to come up with a new set of benchmarks that are designed to show off the strengths of subtasks.
English
1
0
2
24
Henry Moran
Henry Moran@soyhenryxyz·
@roocode Would love to see some evals that are somehow based on subtasks from Orchestrator
English
1
0
1
217
Roo Code
Roo Code@roocode·
We're continuing to flesh out roocode.com/evals to give a better sense of the performance and cost of different popular models. As we expected GPT 4.1 is on par with Gemini 2.5 Pro Preview, but GPT 4.1 Mini's performance was a surprise - it has an extremely good price to performance ratio, even beating out DeepSeek V3. What do you think? Does this match your experience with the new GPT 4.1 models so far?
Roo Code tweet media
English
4
8
60
4K
Universe
Universe@setkyarwalar·
@roocode Mini is not bad. Nano is useless.
English
1
0
2
334
Chris E
Chris E@cte·
@cdossman @roocode I think it's on pace to be slightly below Sonnet 3.7 and Gemini 2.5. The price to intelligence ratio of 4.1 mini seems to be trending really well...
English
0
0
2
97
Chris
Chris@cdossman·
@roocode better then gemini 2.5 pro?
English
1
0
1
116
Chris E
Chris E@cte·
@mattpocockuk The Aider polyglot benchmarks are a good start. I wired up the Cursor-like product I’m working on to run the benchmarks and see how it compares to the publicly available scores.
English
0
0
0
29
Matt Pocock
Matt Pocock@mattpocockuk·
Since AI tools like Cursor/Windsurf basically have no moat... ...how do you get a sense for which is better? Tempted to make a set of evals for testing the performance and latency of each AI editor
English
54
5
175
38.6K
Chris E retweetledi
Louis Virtel
Louis Virtel@louisvirtel·
I love how Mitt Romney reappears once every three months to outshine the entire Republican Party by doing the absolute least.
English
638
12.7K
136K
0
Chris E retweetledi
Carol Leonnig
Carol Leonnig@CarolLeonnig·
Real journalists provide facts to inform the public, not to lead them astray or put them in danger.
English
18
155
454
0
Chris E retweetledi
Pete Buttigieg
Pete Buttigieg@PeteButtigieg·
Amazingly, the chyron is not the most foolish thing about this picture. To get ahead of a potential refugee crisis caused by great suffering in Central America, it would make sense to use our resources to help reduce that suffering. This is self-defeating.
Pete Buttigieg tweet media
English
1.1K
4.8K
27.1K
0