16x Eval

40 posts

16x Eval

@16xEval

The Simplest Way to Test Models and Prompts 16x Eval is your personal workspace for prompt engineering.

Katılım Nisan 2025

2 Takip Edilen128 Takipçiler

Sabitlenmiş Tweet

16x Eval@16xEval·8 Eyl

16x Eval is the simplest way to create and run evals on prompts and models. Download for try it out for free: eval.16x.engineer

English

410

16x Eval@16xEval·10 Eyl

Full release note: eval.16x.engineer/release

English

16x Eval@16xEval·10 Eyl

Added support for custom model cost configuration to track costs for custom models and OpenRouter models.

English

16x Eval@16xEval·10 Eyl

16x Eval - 0.0.71 Released. Highlights: Added support for background evaluation to run evaluations in the background without blocking the UI.

English

190

16x Eval retweetledi

Latteant 👾@latteant·8 Eyl

the swe-bench leakage where agents can see the future commit isn't a surprise. our eval culture over-indexes on SOTA-chasing. we aren't training robust agents. we're training expert benchmark hackers.

English

16x Eval retweetledi

Zhu Liang@paradite_·7 Eyl

Nice to see evals being labeled as "very hard" to setup and execute. This validates the value proposition for @16xEval, which makes it easy to setup and run evals, even for non-technical people. I'll need to update my H1 again. eval.16x.engineer

Malte Ubl@cramforce

I'm joining the Evals vs. AB-testing discourse

English

395

16x Eval retweetledi

Zhu Liang@paradite_·4 Eyl

I just shipped the biggest update to 16x Eval: Custom JavaScript code as evaluation function. Now you can write your own custom evaluation function in JavaScript to evaluate the model responses.

English

228

16x Eval retweetledi

Zhu Liang@paradite_·3 Eyl

It's working! Run custom JavaScript code as evaluation function inside @16xEval.

English

242

16x Eval retweetledi

Zhu Liang@paradite_·2 Eyl

Alright. I see everyone is building eval products now and getting ahead of @16xEval in some ways. It's time to focus back on the product instead of just marketing. Running custom JavaScript code for automated evals will be the next priority feature I'm shipping.

English

750

16x Eval retweetledi

Zhu Liang@paradite_·2 Eyl

After 2 hours of refactoring and fixing, I finally got my eval app @16xEval to automatically recognize different reasoning effort and display them as separate columns in benchmarks. Now it is time to test GPT-5 high reasoning on my coding tasks!

English

307

16x Eval retweetledi

Zhu Liang@paradite_·25 Ağu

New landing page for @16xEval

English

233

16x Eval retweetledi

Zhu Liang@paradite_·21 Ağu

Got the unique model id working by adding a new release date field to the model, to differentiate between DeepSeek V3 and DeepSeek V3.1 which shares the same model id `deepseek-chat`. @16xEval now supports DeepSeek V3.1 & DeepSeek V3.1 (Thinking Mode) via DeepSeek API.

Zhu Liang@paradite_

I spent the entire day thinking about how handle different DeepSeek models (V3, V3 new, V3.1) using the same model id `deepseek-chat` in API for my eval app @16xEval. This is worse than supporting provider only syntax for OpenRouter. I haven't thought of a good solution yet.

English

335

16x Eval retweetledi

Zhu Liang@paradite_·15 Ağu

Prompt engineering and evals are more important than ever. New models like GPT-5 have its own quirks that requires experiments and tuning of the prompts. @16xEval was built exactly for that. Prompt engineering and model evaluation all in one place: eval.16x.engineer

edwin@edwinarbus

good advice from @__ruiters: GPT-5 isn’t broken. Your prompts are. A lot of folks, myself included, expected GPT-5 to be fungible in the sense that you could drop it right into your existing workflows, and it would “just work.” But the depth of the GPT-5 prompt guide makes it clear: this is a major change; a major version bump, if you will. Even the Cursor team, who lead pilot adoption for the new model, called it out: > “GPT-5 is one of the most steerable models I've used. I've needed to be more explicit about what I was trying to accomplish for some tasks. Leaving things vague ended up with the model taking a different direction than I expected. When I was more specific, I was surprised by how smart the model was.” Users’ frustration is Hyrum’s Law in action: > With a sufficient number of users of an API, it doesn’t matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. This basically means that OpenAI is big enough that now, no matter what changes they make, or how much better the models are, people will always complain because they have built their systems to depend on the behavior of the models at the time. The OpenAI team has said it themselves: GPT-5 is extremely steerable. This is both feature and bug: * It will do what you tell it to * But you have to know what you want to it to do or at least articulate your intent better At HYBRD, I’ve been using GPT-5 and finding it works quite well with appropriate prompting, and quite poorly without. If you’re shipping on GPT-5: 1. Treat prompts like code: version, test, and review them. -- THIS IS THE MOST IMPORTANT ONE! 2. Read the prompting guide and understand how this model actually works (link in comments) 3. Run your prompts through the OpenAI Prompt Optimizer (link in comments) 4. Ask it to plan it’s approach before touching any code.

English

328

16x Eval retweetledi

Zhu Liang@paradite_·7 Ağu

Updated @16xEval to support GPT-5 models. It is time to test and do our own evals!

English

195

16x Eval retweetledi

Zhu Liang@paradite_·31 Tem

Finished my Horizon Alpha coding evaluation via @openrouter. An interesting model with some unique characteristics. Horizon Alpha beats Kimi K2, but still behind top models like Claude Sonnet 4 and GPT-4.1.

English

1.1K

16x Eval@16xEval·31 Tem

Eval. Eval. Eval.

Groq Inc@GroqInc

LLM evals are broken. So we created an open-source standard.

English

16x Eval@16xEval·21 Tem

@paradite_ Thank you @grok for indexing!

English

16x Eval retweetledi

Zhu Liang@paradite_·21 Tem

Nice. @16xEval blog is confirmed to be indexed by Grok xAI.

English

193

16x Eval retweetledi

Zhu Liang@paradite_·19 Tem

Seeing more prompt engineering and model eval products launching. Great validation that @16xEval is solving a real and emerging problem. 16x Eval is a simple desktop eval app that has no subscription, no login. 👉 eval.16x.engineer

English

225

16x Eval retweetledi

Zhu Liang@paradite_·16 Tem

Finished testing Grok 4 on my personal eval set: Solid model across coding, writing and image analysis, but the drawback is slow response time. Key results:

English

691

Keşfet

@openrouter @paradite_ @grok @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates