16x Eval

40 posts

16x Eval banner
16x Eval

16x Eval

@16xEval

The Simplest Way to Test Models and Prompts 16x Eval is your personal workspace for prompt engineering.

Katılım Nisan 2025
2 Takip Edilen128 Takipçiler
Sabitlenmiş Tweet
16x Eval
16x Eval@16xEval·
16x Eval is the simplest way to create and run evals on prompts and models. Download for try it out for free: eval.16x.engineer
16x Eval tweet media
English
0
1
3
410
16x Eval
16x Eval@16xEval·
Added support for custom model cost configuration to track costs for custom models and OpenRouter models.
16x Eval tweet media
English
1
0
0
39
16x Eval
16x Eval@16xEval·
16x Eval - 0.0.71 Released. Highlights: Added support for background evaluation to run evaluations in the background without blocking the UI.
16x Eval tweet media
English
1
1
1
190
16x Eval retweetledi
Latteant 👾
Latteant 👾@latteant·
the swe-bench leakage where agents can see the future commit isn't a surprise. our eval culture over-indexes on SOTA-chasing. we aren't training robust agents. we're training expert benchmark hackers.
English
0
1
1
57
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
I just shipped the biggest update to 16x Eval: Custom JavaScript code as evaluation function. Now you can write your own custom evaluation function in JavaScript to evaluate the model responses.
Zhu Liang tweet media
English
1
1
3
228
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
It's working! Run custom JavaScript code as evaluation function inside @16xEval.
Zhu Liang tweet media
English
0
1
3
242
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
Alright. I see everyone is building eval products now and getting ahead of @16xEval in some ways. It's time to focus back on the product instead of just marketing. Running custom JavaScript code for automated evals will be the next priority feature I'm shipping.
English
0
1
2
750
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
After 2 hours of refactoring and fixing, I finally got my eval app @16xEval to automatically recognize different reasoning effort and display them as separate columns in benchmarks. Now it is time to test GPT-5 high reasoning on my coding tasks!
Zhu Liang tweet media
English
1
1
2
307
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
New landing page for @16xEval
Zhu Liang tweet media
English
0
1
2
233
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
Got the unique model id working by adding a new release date field to the model, to differentiate between DeepSeek V3 and DeepSeek V3.1 which shares the same model id `deepseek-chat`. @16xEval now supports DeepSeek V3.1 & DeepSeek V3.1 (Thinking Mode) via DeepSeek API.
Zhu Liang tweet mediaZhu Liang tweet media
Zhu Liang@paradite_

I spent the entire day thinking about how handle different DeepSeek models (V3, V3 new, V3.1) using the same model id `deepseek-chat` in API for my eval app @16xEval. This is worse than supporting provider only syntax for OpenRouter. I haven't thought of a good solution yet.

English
0
1
3
335
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
Prompt engineering and evals are more important than ever. New models like GPT-5 have its own quirks that requires experiments and tuning of the prompts. @16xEval was built exactly for that. Prompt engineering and model evaluation all in one place: eval.16x.engineer
edwin@edwinarbus

good advice from @__ruiters: GPT-5 isn’t broken. Your prompts are. A lot of folks, myself included, expected GPT-5 to be fungible in the sense that you could drop it right into your existing workflows, and it would “just work.” But the depth of the GPT-5 prompt guide makes it clear: this is a major change; a major version bump, if you will. Even the Cursor team, who lead pilot adoption for the new model, called it out: > “GPT-5 is one of the most steerable models I've used. I've needed to be more explicit about what I was trying to accomplish for some tasks. Leaving things vague ended up with the model taking a different direction than I expected. When I was more specific, I was surprised by how smart the model was.” Users’ frustration is Hyrum’s Law in action: > With a sufficient number of users of an API, it doesn’t matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. This basically means that OpenAI is big enough that now, no matter what changes they make, or how much better the models are, people will always complain because they have built their systems to depend on the behavior of the models at the time. The OpenAI team has said it themselves: GPT-5 is extremely steerable. This is both feature and bug: * It will do what you tell it to * But you have to know what you want to it to do or at least articulate your intent better At HYBRD, I’ve been using GPT-5 and finding it works quite well with appropriate prompting, and quite poorly without. If you’re shipping on GPT-5: 1. Treat prompts like code: version, test, and review them. -- THIS IS THE MOST IMPORTANT ONE! 2. Read the prompting guide and understand how this model actually works (link in comments) 3. Run your prompts through the OpenAI Prompt Optimizer (link in comments) 4. Ask it to plan it’s approach before touching any code.

English
0
1
5
328
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
Updated @16xEval to support GPT-5 models. It is time to test and do our own evals!
Zhu Liang tweet media
English
0
1
4
195
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
Finished my Horizon Alpha coding evaluation via @openrouter. An interesting model with some unique characteristics. Horizon Alpha beats Kimi K2, but still behind top models like Claude Sonnet 4 and GPT-4.1.
Zhu Liang tweet media
English
4
1
7
1.1K
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
Nice. @16xEval blog is confirmed to be indexed by Grok xAI.
Zhu Liang tweet media
English
1
1
4
193
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
Seeing more prompt engineering and model eval products launching. Great validation that @16xEval is solving a real and emerging problem. 16x Eval is a simple desktop eval app that has no subscription, no login. 👉 eval.16x.engineer
Zhu Liang tweet media
English
0
1
1
225
16x Eval retweetledi
Zhu Liang
Zhu Liang@paradite_·
Finished testing Grok 4 on my personal eval set: Solid model across coding, writing and image analysis, but the drawback is slow response time. Key results:
Zhu Liang tweet media
English
1
1
9
691