Fiction.live

449 posts

Fiction.live

Fiction.live

@ficlive

Read and control interactive stories Talk to writers. Suggest your own ideas and debate with other fans. Vote for what happens next.

가입일 Kasım 2012
36 팔로잉968 팔로워
Fiction.live
Fiction.live@ficlive·
qwen-3.5 plus and qwen3-max-thinking and opus 4.6
Fiction.live tweet media
English
11
15
143
22.8K
Fiction.live
Fiction.live@ficlive·
Added glm-5 and minimax-m2.5
Fiction.live tweet media
English
4
1
21
1.1K
jehrjd
jehrjd@jehrjd45963·
@ficlive thank you!!!! would be really cool if you could also test 5.2 medium. also what reasoning level is 5.2 pro?
English
1
0
1
32
jehrjd
jehrjd@jehrjd45963·
@ficlive it's really important that you mention the reasoning levels on the models that have different possibilities. like for the GPT-5.2 models it's entirely different models based on the different reasoning levels so please mention that
English
1
0
2
319
Fiction.live
Fiction.live@ficlive·
gemini-3-pro-preview improves upon the strong results of gemini-2.5-pro and is now neck and neck with gpt-5.2 on top in the "almost perfect" tier.
English
1
0
4
635
Fiction.live
Fiction.live@ficlive·
Long context eval. Huge improvement since last year. The frontier models went from poor to great. An exciting standout is kimi-2.5. It made impressive progress without (presumably) a new architecture, putting up gemini-2.5-pro numbers which we were all impressed by last year.
Fiction.live tweet media
English
13
21
248
39.8K
Fiction.live
Fiction.live@ficlive·
@k0tovsk1y Been working on a better one for the past few months, hope to get it out soon. But at the same time these models are just now good IMO, and you'll start seeing that in the real world in terms of agentic workflows starting to work frfr.
English
0
0
7
353
Kotovskiy
Kotovskiy@k0tovsk1y·
@ficlive GPT-5.2 and Gemini-3-Pro basicly maxed out this benchmark. There is a need for something more difficult
English
1
0
2
440
Fiction.live
Fiction.live@ficlive·
claude-opus-4-5 fixed claude's long context performance, it is now good when previously it was a laggard. claude-sonnet-4-5 had a regression compared to sonnet 4… Same tier as grok-4.
English
0
0
3
875
Fiction.live
Fiction.live@ficlive·
Kimi-k2.5 now the Chinese/Open-source leader! Minimax??? gpt-5.2 improves on almost perfection in gpt-5 to now very close to perfect. gpt-5.2-pro did surprisingly poorly.
English
0
0
5
1K
Fiction.live
Fiction.live@ficlive·
@teortaxesTex I might, but my bench is saturated by gpt-5. It's meaningfully better than gemini 2.5 and the bench did not reflect that. I will be back with a better eval.
English
1
0
3
123
Fiction.live
Fiction.live@ficlive·
Fiction.LiveBench for Long Context Deep Comprehension adds: deepseek-v3.2-exp [reasoning: high], deepseek-v3.2-exp, nemotron-nano-9b-v2:free, qwen-max, qwen3-next-80b-a3b-instruct.
Fiction.live tweet media
English
18
10
94
20.2K
Fiction.live 리트윗함
Sam Paech
Sam Paech@sam_paech·
Some updates to Spiral Bench: - A more detailed rubric for protective vs delusion-reinforcing behaviours - Responses evaluated by a judge ensemble: sonnet-4.5, gpt-5 & kimi-k2 - New models evaluated: qwen3-235b, glm-4.6, grok-4-fast, mistral-medium-3.1
Sam Paech tweet mediaSam Paech tweet mediaSam Paech tweet media
English
3
3
29
3.5K
Fiction.live
Fiction.live@ficlive·
@DavidSZD1 Didn't finish the entire benchmark but there was no change from previous results for flash.
English
1
0
0
115
DavidSZD
DavidSZD@DavidSZD1·
@ficlive Cool. All that's missing is September's Gemini Flash and flash lite and it's perfect.
English
1
0
0
345
Fiction.live
Fiction.live@ficlive·
@dhtikna Sometimes the reasoning puts it over the token limit and the call fails.
English
0
0
1
175
Fiction.live
Fiction.live@ficlive·
@gusarich Yes it's surprisingly low for DeepSeek 3.2, I guess you have to pay the piper somewhere for the sparsity.
English
0
0
3
350
Fiction.live
Fiction.live@ficlive·
Thoughts: Interesting that we see an improvement for deepseek's reasoning mode but no improvement for the non-reasoning. It has high scores on the easier questions but very low scores on the hard ones. grok-4-fast is fairly close to sonama-sky-alpha while still being free.
English
3
0
13
1.3K