Fiction.live

449 posts

Fiction.live

@ficlive

Read and control interactive stories Talk to writers. Suggest your own ideas and debate with other fans. Vote for what happens next.

Entrou em Kasım 2012

36 Seguindo968 Seguidores

Fiction.live@ficlive·16 Şub

fiction.live/stories/Fictio…

ZXX

1.3K

Fiction.live@ficlive·16 Şub

qwen-3.5 plus and qwen3-max-thinking and opus 4.6

English

143

22.8K

Fiction.live@ficlive·12 Şub

fiction.live/stories/Fictio…

ZXX

334

Fiction.live@ficlive·12 Şub

Added glm-5 and minimax-m2.5

English

1.1K

Fiction.live@ficlive·2 Şub

@jehrjd45963 I had to double check but pro was medium.

English

jehrjd@jehrjd45963·31 Oca

@ficlive thank you!!!! would be really cool if you could also test 5.2 medium. also what reasoning level is 5.2 pro?

English

Fiction.live@ficlive·31 Oca

Had to add gemini-3-flash-preview to the results. It dominates. Clearly the top model on this benchmark. Hopefully we can get a v2 of this bench out sometime soon.

Fiction.live@ficlive

Long context eval. Huge improvement since last year. The frontier models went from poor to great. An exciting standout is kimi-2.5. It made impressive progress without (presumably) a new architecture, putting up gemini-2.5-pro numbers which we were all impressed by last year.

English

172

20.4K

Fiction.live@ficlive·31 Oca

@jehrjd45963 Yes, gpt-5.2 is xhigh.

English

236

jehrjd@jehrjd45963·31 Oca

@ficlive it's really important that you mention the reasoning levels on the models that have different possibilities. like for the GPT-5.2 models it's entirely different models based on the different reasoning levels so please mention that

English

319

Fiction.live@ficlive·30 Oca

gemini-3-pro-preview improves upon the strong results of gemini-2.5-pro and is now neck and neck with gpt-5.2 on top in the "almost perfect" tier.

English

635

Fiction.live@ficlive·30 Oca

English

248

39.8K

Fiction.live@ficlive·30 Oca

@xzenova @scaling01 I think the reasoning ran out of context.

English

248

Xzenova@xzenova·30 Oca

@ficlive @scaling01 Opus and sonnet at 0? Is this a glitch or something

English

330

Fiction.live@ficlive·30 Oca

@k0tovsk1y Been working on a better one for the past few months, hope to get it out soon. But at the same time these models are just now good IMO, and you'll start seeing that in the real world in terms of agentic workflows starting to work frfr.

English

353

Kotovskiy@k0tovsk1y·30 Oca

@ficlive GPT-5.2 and Gemini-3-Pro basicly maxed out this benchmark. There is a need for something more difficult

English

440

Fiction.live@ficlive·30 Oca

claude-opus-4-5 fixed claude's long context performance, it is now good when previously it was a laggard. claude-sonnet-4-5 had a regression compared to sonnet 4… Same tier as grok-4.

English

875

Fiction.live@ficlive·30 Oca

Kimi-k2.5 now the Chinese/Open-source leader! Minimax??? gpt-5.2 improves on almost perfection in gpt-5 to now very close to perfect. gpt-5.2-pro did surprisingly poorly.

English

Fiction.live@ficlive·30 Oca

fiction.live/stories/Fictio…

ZXX

1.2K

Fiction.live@ficlive·12 Ara

@teortaxesTex I might, but my bench is saturated by gpt-5. It's meaningfully better than gemini 2.5 and the bench did not reflect that. I will be back with a better eval.

English

123

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·11 Ara

@ficlive Update with V3.2?

English

226

Fiction.live@ficlive·29 Eyl

Fiction.LiveBench for Long Context Deep Comprehension adds: deepseek-v3.2-exp [reasoning: high], deepseek-v3.2-exp, nemotron-nano-9b-v2:free, qwen-max, qwen3-next-80b-a3b-instruct.

English

20.2K

Fiction.live retweetou

Sam Paech@sam_paech·12 Eki

Some updates to Spiral Bench: - A more detailed rubric for protective vs delusion-reinforcing behaviours - Responses evaluated by a judge ensemble: sonnet-4.5, gpt-5 & kimi-k2 - New models evaluated: qwen3-235b, glm-4.6, grok-4-fast, mistral-medium-3.1

English

3.5K

Fiction.live@ficlive·30 Eyl

@DavidSZD1 Didn't finish the entire benchmark but there was no change from previous results for flash.

English

115

DavidSZD@DavidSZD1·29 Eyl

@ficlive Cool. All that's missing is September's Gemini Flash and flash lite and it's perfect.

English

345

Fiction.live@ficlive·30 Eyl

@dhtikna Sometimes the reasoning puts it over the token limit and the call fails.

English

175

Ankith 🐋/acc@dhtikna·30 Eyl

@ficlive @ficlive why didnt the V3.2 reasoner high have a score for 120k?

English

282

Fiction.live@ficlive·30 Eyl

@gusarich Yes it's surprisingly low for DeepSeek 3.2, I guess you have to pay the piper somewhere for the sparsity.

English

350

Daniil Sedov@Gusarich·29 Eyl

@ficlive what's up with that 0/400/1k result?

English

436

Fiction.live@ficlive·29 Eyl

Thoughts: Interesting that we see an improvement for deepseek's reasoning mode but no improvement for the non-reasoning. It has high scores on the easier questions but very low scores on the hard ones. grok-4-fast is fairly close to sonama-sky-alpha while still being free.

English

1.3K

Descobrir

@jehrjd45963 @xzenova @scaling01 @k0tovsk1y @teortaxesTex @elonmusk @BarackObama @taylorswift13