Hafsteinn

1.3K posts

Hafsteinn banner
Hafsteinn

Hafsteinn

@hafsteinn

Associate Professor in CS at the University of Iceland, research scientist at deCODE (views here are my own) 🇮🇸🏳️‍🌈, he/him.

Iceland Katılım Mart 2008
1.8K Takip Edilen640 Takipçiler
Akari Asai
Akari Asai@AkariAsai·
Thrilled to share: OpenScholar - our work on scientific deep research agents for reliable literature synthesis -has been accepted to Nature! 🎉 Huge thanks to collaborators across institutions who made this possible!
Akari Asai tweet media
English
33
227
1.3K
126.1K
Lilian Weng
Lilian Weng@lilianweng·
I’ve been telling people this a lot today: I enjoy so much working with people who care about what they are building and craftsmanship. It is a privilege to have a chance to work on something I’m passionate about, beyond making a living. I cherish it and don’t take it for granted.
English
66
66
1.6K
178.1K
Harjass Gambhir
Harjass Gambhir@harjassgambhir·
@hafsteinn 😂 didn't expect Icelandi to be even more apt ig you could call my timeline since yesterday Sama Sori
English
1
0
1
15
Hafsteinn
Hafsteinn@hafsteinn·
@harjassgambhir @teortaxesTex The technical report states that they paraphrased the pretraining corpus multiple times. The aim was to learn the information within but not to learn it verbatim.
English
1
0
2
35
Harjass Gambhir
Harjass Gambhir@harjassgambhir·
@teortaxesTex kimi's pretraining data is diff or are they doing post training with high taste?
English
4
0
31
6K
Crémieux
Crémieux@cremieuxrecueil·
Should I get a gene therapy done to become hyper-muscular with zero effort?
English
158
10
260
173.9K
Hafsteinn
Hafsteinn@hafsteinn·
Pretty interesting to see this result, and it’s not very surprising given that Kimi-K2 was trained on several rephrased versions of the pretraining data. That approach will likely allow it to memorize facts better without memorizing verbatim the pretraining data. But I also wonder about the importance of these kind of benchmarks when you could alternatively prompt the models to look up the facts online before they do the work required. How would the performance change then?
Florian Brand@xeophon

After thinking about this problem for months, I am so happy to finally introduce DetailBench! It answers a simple question: How good are current LLMs at finding small errors, when they are *not* explicitly asked to do so? (Yes, the graph is right!)

English
0
0
1
116
Hafsteinn
Hafsteinn@hafsteinn·
The pendulum of tool design always swings too far. First we automate the task. Then we automate the thinking. Then we automate the judgment of when to think. The best tools preserve human agency at the meta level - not "do this for me" but "match my intensity." A hammer doesn't decide how hard to strike.
Andrej Karpathy@karpathy

I'm noticing that due to (I think?) a lot of benchmarkmaxxing on long horizon tasks, LLMs are becoming a little too agentic by default, a little beyond my average use case. For example in coding, the models now tend to reason for a fairly long time, they have an inclination to start listing and grepping files all across the entire repo, they do repeated web searchers, they over-analyze and over-think little rare edge cases even in code that is knowingly incomplete and under active development, and often come back ~minutes later even for simple queries. This might make sense for long-running tasks but it's less of a good fit for more "in the loop" iterated development that I still do a lot of, or if I'm just looking for a quick spot check before running a script, just in case I got some indexing wrong or made some dumb error. So I find myself quite often stopping the LLMs with variations of "Stop, you're way overthinking this. Look at only this single file. Do not use any tools. Do not over-engineer", etc. Basically as the default starts to slowly creep into the "ultrathink" super agentic mode, I feel a need for the reverse, and more generally good ways to indicate or communicate intent / stakes, from "just have a quick look" all the way to "go off for 30 minutes, come back when absolutely certain".

English
0
0
1
103
Hafsteinn
Hafsteinn@hafsteinn·
@jxmnop Doesn’t it just raise the expectations of what you could do in the span of a hackathon then 💁🏼‍♂️
English
1
0
64
2.3K
dr. jack morris
dr. jack morris@jxmnop·
i haven't heard it dicussed yet but AI basically killed hackathons. pretty much anything you could possibly make at a hackathon in 2019 can be built better and faster by AI in 2025
English
217
75
2.7K
206.5K
Pedro Duarte
Pedro Duarte@peduarte·
is opus significantly better than sonnet?
English
157
4
409
88.1K
Hafsteinn
Hafsteinn@hafsteinn·
What makes O3 different? While impressive, it's not perfect. O3 fails on 40×40 mazes in both languages. This might suggest OpenAI is incorporating training approaches related to embodied intelligence or spatial reasoning tasks, though the exact methods remain unclear.
English
1
0
0
49
Hafsteinn
Hafsteinn@hafsteinn·
🧵 Can LLMs escape a maze? We tested 8 models on simple navigation tasks. Perhaps not so surprisingly, performance varies dramatically based on the language of instructions.
Hafsteinn tweet media
English
1
0
1
86