Generative History

286 posts

Generative History

@HistoryGPT

Exploring how historians can engage with Generative AI

Waterloo Katılım Mart 2023

105 Takip Edilen1.7K Takipçiler

Sabitlenmiş Tweet

Generative History@HistoryGPT·17 Ara

Gemini 3 flash is as good at reading handwriting as the average human (pro is expert human level). It is much better than both GPT-5.2 and Opus 4.5 with character level error rates of 1.43% and word level error rates of 2.74%. This is a 47-63% improvement over 2.5 Flash, the same leap we saw with pro. At a fraction of a cent per page, this is a big deal. Read more about the Gemini models on handwriting: generativehistory.substack.com/p/gemini-3-sol…

English

103

970

151.2K

Generative History@HistoryGPT·2h

@tszzl Had that realization today. Prototyped an agentic document segmentation pipeline in a morning, spent two weeks “perfecting” it…and it’s marginally better. Ugh.

English

317

roon@tszzl·3h

common pattern: have an idea, make a horrifically messy implementation of it, see that it’s promising, and then spend twice as much time “cleaning it up” and “doing it right” only to realize you got 80% of what you were going to get out of it the first time

English

699

25.6K

Generative History@HistoryGPT·1d

Not sure how one sustains such an argument in the face of co-work. The sad thing js, is that we desperately need critical voices to help shape the debate around how we are going to adapt to AI, but it’s only going to be ignored if it isn’t engaging with what is actually happening in the world. And that cedes all the power to AI companies.

English

951

Ethan Mollick@emollick·1d

I had to check the date on this a couple times. Absolutely spot on for 2021, but for 2026, though…

nature@Nature

Book review 📚 Artificial-intelligence models will supposedly take over the world, but AI innovator Luc Julia tells Nature that they’re little more than glorified pocket calculators go.nature.com/4lPpuPd

English

216

37.7K

Generative History@HistoryGPT·11 Mar

@kevinroose That tracks.

English

132

Kevin Roose@kevinroose·11 Mar

judging from responses to the AI vs. human writing quiz, twitter appears to be in the bargaining/depression stage of the kubler-ross process, while bluesky is firmly in the denial/anger stage.

English

444

101.7K

Generative History@HistoryGPT·10 Mar

This was a really tough batch: the middle pages of a few letters were quite ambiguous as they began in a similar way and could plausibly fit in a few places. To figure it out you had to find clues in the text. /2

English

Generative History@HistoryGPT·6 Mar

No the tweet is the “published results” for GPT-5.4. Original paper on HTR is here but this updates the paper: doi.org/10.1080/016154…

English

113

Generative History@HistoryGPT·6 Mar

The frontier continues to be jagged. The new GPT-5.4 scores far behind Gemini-3 and Opus 4.6 on our Historical Handwritten Text Recognition benchmark: modified WER of 11.14% and CER of 7.72%. Gemini 3 Pro is 1.33% and 0.69% while Opus 4.6 is 4.2% and 2.29%.

English

633

Generative History@HistoryGPT·6 Mar

Another interesting point: their best model on handwriting by our measure to date was GPT-4.5. Suggests they went hard at visual before GPT-5 and then changed course.

English

125

Generative History@HistoryGPT·6 Mar

OpenAI models have alway been well behind on this task while excelling at others. It is almost certainly a training data and emphasis effect but interesting nonetheless.

English

131

Generative History@HistoryGPT·4 Mar

Same thing on transcription and the clock test: you can see them start to second guess themselves, rationalize, and then go all in on errors. It makes sense: on snap judgement tasks, where something is either right or wrong, thinking budgets force the model to go over an essentially solved problem.

English

1.7K

Chubby♨️@kimmonismus·4 Mar

BullshitBench v2, created by Peter Gostev, is a benchmark that does something refreshingly different: it tests whether AI models can detect and reject nonsensical prompts instead of confidently rolling with them. Only Anthropic's Claude models and Alibaba's Qwen 3.5 score meaningfully above 60% on nonsense detection. OpenAI and Google? Stuck, and not improving. Even more surprising: reasoning models that "think harder" actually perform worse, they use their extra compute to rationalize the nonsense rather than reject it.

Peter Gostev (SF: 29 Mar - 3 Apr)@petergostev

V2 has 100 questions and 70+ model variants tested (model + reasoning levels) - Anthropic and Qwen 3.5 are only models that are much above 60%.

English

933

120.9K

Generative History@HistoryGPT·3 Mar

@OfficialLoganK Is that why the API has been unusable today with 503 errors?

English

2.4K

Logan Kilpatrick@OfficialLoganK·3 Mar

gemini

Indonesia

464

140

3.3K

626.2K

Generative History@HistoryGPT·3 Mar

@NateSilver538 @jachiam0 I think that is part of the problem…perceptions on the inside are very different than on the outside.

English

665

Nate Silver@NateSilver538·3 Mar

@jachiam0 I like this observation but from the outside view it feels more like the transition between the early game and the midgame.

English

551

60.7K

Joshua Achiam@jachiam0·3 Mar

People in the AI community have for years casually described the AGI development time period as having an "early game," a "midgame," and an "endgame"; I think the DoW-Anthropic rupture is as clear a demarcation as one could expect between midgame and endgame.

English

186

91K

Generative History@HistoryGPT·25 Şub

@johannesmboehm I am getting better results now with 0.3 temp.

English

Johannes Boehm@johannesmboehm·24 Şub

@HistoryGPT On the Feb 19 blog post, you write that you got best results with temperature at 0.0, not with 0.3. Could you clarify? (and thanks for all the work!)

English

Generative History@HistoryGPT·19 Şub

On handwriting recognition, Gemini 3.1 Pro Preview is slightly worse than 3.0, which reflects the small drop in multimodal scores @GoogleDeepMind reported today. Excluding ambiguous capitalization and punctuation, it has 0.92% character / 2.15% word level error rates. (1/5)

English

523

Generative History@HistoryGPT·23 Şub

@kimmonismus Labs really need to work with non-STEM fields too. If programming is essentially becoming properly bounded problem articulation in natural language, that often requires a different set of skills.

English

108

Chubby♨️@kimmonismus·23 Şub

Demis Hassabis: The best modern inventions arise from the intersection of 2+ subjects. Think DeepMind: neuroscience, engineering, and machine learning. Or Isomorphic: ML, chemistry, and biology. Become an expert in multiple fields and find connections.

English

183

17.1K

Generative History@HistoryGPT·22 Şub

@rohanpaul_ai Interesting idea, but not sure you would have the necessary mass of pre-1911 data to train a frontier model. And no synthetic data either.

English

132

Rohan Paul@rohanpaul_ai·21 Şub

Demis Hassabis’s “Einstein test” for defining AGI: Train a model on all human knowledge but cut it off at 1911, then see if it can independently discover general relativity (as Einstein did by 1915); if yes, it’s AGI.

English

663

819

11.9K

2.2M

Generative History@HistoryGPT·22 Şub

I know you know this, but the Industrial Revolution also played out over several stages and hundreds of years. People had time to adapt and it *still* led to massive upheaval and political changes. What happens when you compress a similar level of change into a single generation? It might well be explosive.

English

Ethan Mollick@emollick·22 Şub

I would add that when imagining backlash people think of Dune’s Butlerian Jihad or Luddites But what those fights actually looked like during the previous Industrial Revolutions were about regulation, redistribution, nationalization, unions & safety nets. Could expect similar

Nate Silver@NateSilver538

If AI produces unprecedented levels of technological disruption on time scales that are an order of magnitude or two faster than anything in human history, it's going to be an unprecedented political fight. And FWIW, the timelines potentially line up with the 2028 U.S. election.

English

295

41.8K

Generative History@HistoryGPT·21 Şub

@kevinroose That curve just keeps getting pulled upwards

English

213

Kevin Roose@kevinroose·21 Şub

the METR time-horizon chart emerged from its hole and saw its shadow, 6 more months of AI boom

METR@METR_Evals

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

English

159

13.8K

Generative History@HistoryGPT·20 Şub

I think it still depends on the use case. For systems meant to prioritize auditability and transparency, deterministic approaches might be preferable. Truly “modern” agentic systems are more difficult to audit because they often write code on the fly so the specific methods used to complete a given task vary. It depends on whether the goal is complete AI automation (former) or AI augmentation (later).

English

949

Ethan Mollick@emollick·20 Şub

All those products where building an "AI agent" meant defining a series of basic prompts linked together deterministically through a flowchart with separate RAG inputs are looking pretty dated right about now (yes, that is basically every agent product released in 2025)

English

478

33.7K

Generative History@HistoryGPT·19 Şub

Why the difference? Not sure but it’s what we’ve seen with other models too (Claude and OpenAI) as they are optimized more for thinking based pipelines. Remember: increasing thinking budgets always decreases accuracy on handwriting recognition. (4/5)

English

3.8K

Generative History@HistoryGPT·19 Şub

For comparison, Gemini 3.0 Pro scores modified CERs of 0.69% / WERs 1.33%. It has a strict CER of 1.67% / WER 4.42%. Same APi settings. (3/5)

English

305

Keşfet

@tszzl @kevinroose @OfficialLoganK @NateSilver538 @jachiam0 @johannesmboehm @elonmusk @BarackObama