Generative History

286 posts

Generative History

Generative History

@HistoryGPT

Exploring how historians can engage with Generative AI

Waterloo Katılım Mart 2023
105 Takip Edilen1.7K Takipçiler
Sabitlenmiş Tweet
Generative History
Generative History@HistoryGPT·
Gemini 3 flash is as good at reading handwriting as the average human (pro is expert human level). It is much better than both GPT-5.2 and Opus 4.5 with character level error rates of 1.43% and word level error rates of 2.74%. This is a 47-63% improvement over 2.5 Flash, the same leap we saw with pro. At a fraction of a cent per page, this is a big deal. Read more about the Gemini models on handwriting: generativehistory.substack.com/p/gemini-3-sol…
Generative History tweet media
English
27
103
970
151.2K
Generative History
Generative History@HistoryGPT·
@tszzl Had that realization today. Prototyped an agentic document segmentation pipeline in a morning, spent two weeks “perfecting” it…and it’s marginally better. Ugh.
English
0
0
1
317
roon
roon@tszzl·
common pattern: have an idea, make a horrifically messy implementation of it, see that it’s promising, and then spend twice as much time “cleaning it up” and “doing it right” only to realize you got 80% of what you were going to get out of it the first time
English
63
25
699
25.6K
Generative History
Generative History@HistoryGPT·
Not sure how one sustains such an argument in the face of co-work. The sad thing js, is that we desperately need critical voices to help shape the debate around how we are going to adapt to AI, but it’s only going to be ignored if it isn’t engaging with what is actually happening in the world. And that cedes all the power to AI companies.
English
0
0
1
951
Kevin Roose
Kevin Roose@kevinroose·
judging from responses to the AI vs. human writing quiz, twitter appears to be in the bargaining/depression stage of the kubler-ross process, while bluesky is firmly in the denial/anger stage.
English
77
25
444
101.7K
Generative History
Generative History@HistoryGPT·
This was a really tough batch: the middle pages of a few letters were quite ambiguous as they began in a similar way and could plausibly fit in a few places. To figure it out you had to find clues in the text. /2
English
0
0
0
49
Generative History
Generative History@HistoryGPT·
The frontier continues to be jagged. The new GPT-5.4 scores far behind Gemini-3 and Opus 4.6 on our Historical Handwritten Text Recognition benchmark: modified WER of 11.14% and CER of 7.72%. Gemini 3 Pro is 1.33% and 0.69% while Opus 4.6 is 4.2% and 2.29%.
English
3
0
14
633
Generative History
Generative History@HistoryGPT·
Another interesting point: their best model on handwriting by our measure to date was GPT-4.5. Suggests they went hard at visual before GPT-5 and then changed course.
English
0
0
0
125
Generative History
Generative History@HistoryGPT·
OpenAI models have alway been well behind on this task while excelling at others. It is almost certainly a training data and emphasis effect but interesting nonetheless.
English
1
0
0
131
Generative History
Generative History@HistoryGPT·
Same thing on transcription and the clock test: you can see them start to second guess themselves, rationalize, and then go all in on errors. It makes sense: on snap judgement tasks, where something is either right or wrong, thinking budgets force the model to go over an essentially solved problem.
English
0
0
0
1.7K
Chubby♨️
Chubby♨️@kimmonismus·
BullshitBench v2, created by Peter Gostev, is a benchmark that does something refreshingly different: it tests whether AI models can detect and reject nonsensical prompts instead of confidently rolling with them. Only Anthropic's Claude models and Alibaba's Qwen 3.5 score meaningfully above 60% on nonsense detection. OpenAI and Google? Stuck, and not improving. Even more surprising: reasoning models that "think harder" actually perform worse, they use their extra compute to rationalize the nonsense rather than reject it.
Chubby♨️ tweet media
Peter Gostev (SF: 29 Mar - 3 Apr)@petergostev

V2 has 100 questions and 70+ model variants tested (model + reasoning levels) - Anthropic and Qwen 3.5 are only models that are much above 60%.

English
51
72
933
120.9K
Nate Silver
Nate Silver@NateSilver538·
@jachiam0 I like this observation but from the outside view it feels more like the transition between the early game and the midgame.
English
12
1
551
60.7K
Joshua Achiam
Joshua Achiam@jachiam0·
People in the AI community have for years casually described the AGI development time period as having an "early game," a "midgame," and an "endgame"; I think the DoW-Anthropic rupture is as clear a demarcation as one could expect between midgame and endgame.
English
10
5
186
91K
Johannes Boehm
Johannes Boehm@johannesmboehm·
@HistoryGPT On the Feb 19 blog post, you write that you got best results with temperature at 0.0, not with 0.3. Could you clarify? (and thanks for all the work!)
English
2
0
0
8
Generative History
Generative History@HistoryGPT·
On handwriting recognition, Gemini 3.1 Pro Preview is slightly worse than 3.0, which reflects the small drop in multimodal scores @GoogleDeepMind reported today. Excluding ambiguous capitalization and punctuation, it has 0.92% character / 2.15% word level error rates. (1/5)
English
1
0
8
523
Generative History
Generative History@HistoryGPT·
@kimmonismus Labs really need to work with non-STEM fields too. If programming is essentially becoming properly bounded problem articulation in natural language, that often requires a different set of skills.
English
0
0
0
108
Chubby♨️
Chubby♨️@kimmonismus·
Demis Hassabis: The best modern inventions arise from the intersection of 2+ subjects. Think DeepMind: neuroscience, engineering, and machine learning. Or Isomorphic: ML, chemistry, and biology. Become an expert in multiple fields and find connections.
English
15
31
183
17.1K
Generative History
Generative History@HistoryGPT·
@rohanpaul_ai Interesting idea, but not sure you would have the necessary mass of pre-1911 data to train a frontier model. And no synthetic data either.
English
0
0
1
132
Rohan Paul
Rohan Paul@rohanpaul_ai·
Demis Hassabis’s “Einstein test” for defining AGI: Train a model on all human knowledge but cut it off at 1911, then see if it can independently discover general relativity (as Einstein did by 1915); if yes, it’s AGI.
English
663
819
11.9K
2.2M
Generative History
Generative History@HistoryGPT·
I know you know this, but the Industrial Revolution also played out over several stages and hundreds of years. People had time to adapt and it *still* led to massive upheaval and political changes. What happens when you compress a similar level of change into a single generation? It might well be explosive.
English
0
0
2
78
Ethan Mollick
Ethan Mollick@emollick·
I would add that when imagining backlash people think of Dune’s Butlerian Jihad or Luddites But what those fights actually looked like during the previous Industrial Revolutions were about regulation, redistribution, nationalization, unions & safety nets. Could expect similar
Nate Silver@NateSilver538

If AI produces unprecedented levels of technological disruption on time scales that are an order of magnitude or two faster than anything in human history, it's going to be an unprecedented political fight. And FWIW, the timelines potentially line up with the 2028 U.S. election.

English
21
24
295
41.8K
Generative History
Generative History@HistoryGPT·
I think it still depends on the use case. For systems meant to prioritize auditability and transparency, deterministic approaches might be preferable. Truly “modern” agentic systems are more difficult to audit because they often write code on the fly so the specific methods used to complete a given task vary. It depends on whether the goal is complete AI automation (former) or AI augmentation (later).
English
1
0
10
949
Ethan Mollick
Ethan Mollick@emollick·
All those products where building an "AI agent" meant defining a series of basic prompts linked together deterministically through a flowchart with separate RAG inputs are looking pretty dated right about now (yes, that is basically every agent product released in 2025)
English
48
23
478
33.7K
Generative History
Generative History@HistoryGPT·
Why the difference? Not sure but it’s what we’ve seen with other models too (Claude and OpenAI) as they are optimized more for thinking based pipelines. Remember: increasing thinking budgets always decreases accuracy on handwriting recognition. (4/5)
English
0
0
5
3.8K
Generative History
Generative History@HistoryGPT·
For comparison, Gemini 3.0 Pro scores modified CERs of 0.69% / WERs 1.33%. It has a strict CER of 1.67% / WER 4.42%. Same APi settings. (3/5)
English
1
0
3
305