Fiction.live
449 posts

Fiction.live
@ficlive
Read and control interactive stories Talk to writers. Suggest your own ideas and debate with other fans. Vote for what happens next.
Entrou em Kasım 2012
36 Seguindo968 Seguidores

Had to add gemini-3-flash-preview to the results.
It dominates.
Clearly the top model on this benchmark.
Hopefully we can get a v2 of this bench out sometime soon.

Fiction.live@ficlive
Long context eval. Huge improvement since last year. The frontier models went from poor to great. An exciting standout is kimi-2.5. It made impressive progress without (presumably) a new architecture, putting up gemini-2.5-pro numbers which we were all impressed by last year.
English

@ficlive @scaling01 Opus and sonnet at 0? Is this a glitch or something
English

@k0tovsk1y Been working on a better one for the past few months, hope to get it out soon. But at the same time these models are just now good IMO, and you'll start seeing that in the real world in terms of agentic workflows starting to work frfr.
English

@teortaxesTex I might, but my bench is saturated by gpt-5. It's meaningfully better than gemini 2.5 and the bench did not reflect that. I will be back with a better eval.
English
Fiction.live retweetou

@DavidSZD1 Didn't finish the entire benchmark but there was no change from previous results for flash.
English

@dhtikna Sometimes the reasoning puts it over the token limit and the call fails.
English

@gusarich Yes it's surprisingly low for DeepSeek 3.2, I guess you have to pay the piper somewhere for the sparsity.
English













