Håvard Ihle (@htihle) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata. The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.

Håvard Ihle@htihle

Exited to share the results from WeirdML - a benchmark testing LLMs ability to solve weird and unusual machine learning tasks by writing working PyTorch code and iteratively learn from feedback.

English

9

18

167

430.1K

Håvard Ihle retweetledi

Michaël Trazzi@MichaelTrazzi·16h

I organized the biggest AI Safety protest in US History! Nearly 200 people marched from Anthropic to OpenAI to xAI with one demand: commit to pausing if the others do too

English

43

48

333

13.2K

Håvard Ihle retweetledi

Nathan Lambert@natolambert·1d

The answer (~44:40) to Noam's question on @NoPriorsPod --- @karpathy: Well, I was there for a while, right? And I did re-enter. So to some extent I agree. And I think that there are many ways to slice this question. It's a very loaded question a little bit. Um, I will say that... I feel very good about what people can contribute and their impact outside of the frontier labs, obviously. Not in the industry, but also in like more, like ecosystem-level roles. So your role, for example, is more ecosystem level. My role currently is also kind of more on an ecosystem level, and I feel very good about the impact that people can have in those kinds of roles. I think conversely there's... there are definite problems in my mind for, um, for basically aligning yourself way too much with the frontier labs too. So fundamentally I mean you're, you have a huge financial incentive to, uh, with these frontier labs. And by your own admission, the uh, the AIs are going to like really change humanity and society in very dramatic ways, and here you are basically like building that technology and benefiting from it like and being like very allied to it through financial means. Like this was a conundrum that was in, um... at the heart of, you know, how OpenAI was started in the beginning, like this was the conundrum that we were trying to solve. Um, and so you know, that—so it's kind of... @saranormous: It's still not resolved. Andrej Karpathy: The conundrum is still not like fully resolved. So that's number one. You're not a completely free agent and you can't actually like be part of that conversation in a fully autonomous, um, free way. Like if you're inside one of the frontier labs. Like there are certain things that you can't say, uh, and conversely there are certain things that the organization wants you to say. And you know, they're not going to twist your arm, but you feel the pressure of like what you should be saying, you know? Cause like, obviously. Otherwise it's like really awkward conversations, strange side-eyes, like what are you doing, you know? So you can't like really be an independent agent, and I feel like a bit more aligned with humanity in a certain sense outside of a frontier lab, because uh, I don't, I'm not subject to those pressures almost, right? And I can say whatever I want. So those are like some sources of misalignment I think, to some extent. I will say that like, in one way I do agree a lot with that sentiment that, um, I do feel like the labs, for better or worse, they're opaque and a lot of work is there, and they're kind of like at the edge of capability and what's possible, and they're working on what's coming down the line. And I think if you're outside of that frontier lab, uh, your, your judgment fundamentally will start to drift, because you're not part of the, you know, what's coming down the line. And so I feel like my judgment will inevitably start to drift as well. And uh, I won't actually have an understanding of how these systems actually work under the hood. That's an opaque system. Uh, I won't have a good understanding of how it's going to develop and etc. And so I do think that in that sense I agree and it's something I'm nervous about. I think it's worth basically being in touch with what's actually happening and actually being in a frontier lab. And if some of the frontier labs would have me come for, you know, some amount of time and do really good work for them and then maybe come in and out— Sarah Guo: Guys, he's looking for a job, this is super exciting! Andrej Karpathy: (Laughs) Then I think that's maybe a good setup. Because I kind of feel like it kind of, um... you know, um, maybe that's like one way uh to, to actually be connected to what's actually happening but also not feel like you're necessarily fully controlled by those entities. So I think honestly in my mind like, uh, Noam can probably do extremely good work at OpenAI, but also I think his most, um, impactful work could very well be outside of OpenAI. Sarah Guo: Noam, that's a call to be an independent researcher, if you got auto-research. Andrej Karpathy: Yeah, there's many things to do on the outside and it's a... and I think ultimately I think the ideal solution maybe is like yeah, going back and forth, uh, or um, yeah, and I think fundamentally you can have really amazing impact in both places. So very complicated, I don't know, it's a very loaded question a little bit, but um, I mean I joined the frontier lab and I'm outside, and then maybe in the future I'll want to join again, and I think um, uh, that's kind of like how I look at it.

Noam Brown@polynoamial

@saranormous @karpathy @NoPriorsPod Why is he not at a frontier AI lab at the most pivotal time in human history since at least the industrial revolution?

English

9

16

251

62.7K

Håvard Ihle retweetledi

Nate Soares ⏹️@So8res·1d

From @neiltyson: "that branch of AI is lethal. We gotta do something about that. Nobody should build it. And everyone needs to agree to that by treaty."

English

39

53

314

48.5K

Håvard Ihle retweetledi

Rob Bensinger ⏹️@robbensinger·2d

x.com/i/article/2035…

ZXX

3

7

127

28.2K

Håvard Ihle retweetledi

Peter Gostev@petergostev·3d

LLM sceptics have predicted the last 7 of 0 walls

English

21

39

585

32K

Håvard Ihle@htihle·3d

Gpt 5.4 mini/nano (high) score 60.3%/49.2% on WeirdML, both reasonable results, but mini uses a ton of tokens (54k output on average) and costs roughly twice gpt 5 (high) for the same score. gpt 5.4 nano is at the frontier for cost/accuracy, but does not push it much. I also ran them without thinking, and they both scored 38%, comparable to gpt 4.1 mini. These models are probably best to use with reasoning enabled.

Håvard Ihle@htihle

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata. The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.

English

1

0

18

552

Håvard Ihle retweetledi

Rob Bensinger ⏹️@robbensinger·3d

Modern ML being much more Yudkowskian than Hansonian also matches up with what you actually see from Yudkowsky and Hanson. Hanson has been downplaying LLMs for as long as LLMs have existed. He regularly predicts that AI will hit a wall, and that we have 80+ years left before AGI. He's the Gary Marcus of the rationalist community, surprised by each new SotA step. Meanwhile, Yudkowsky afaik mostly completed his update to "deep learning will plausibly go all the way" ten years ago, in the wake of AlphaGo.

English

4

5

73

2K

Håvard Ihle retweetledi

Rob Bensinger ⏹️@robbensinger·4d

Some examples of Chinese belligerence on AI risk, making it clear that there's no point in the USG broaching talks with the CCP about a coordinated halt: Zhang Jun, Chinese UN ambassador: "The potential impacts of AI may exceed human cognitive boundaries. To ensure that this technology always benefits humanity, it is necessary to take people-oriented and AI for good as the basic principles to regulate the development of AI and to prevent this technology from turning into a runaway wild horse. [...] The international community needs to [...] ensure that risks beyond human control do not occur[....] We need to strengthen the detection and evaluation of the entire life cycle of AI, ensuring that mankind has the ability to press the stop button at critical moments." Chinese Premier Li Qiang: "We should strengthen coordination to form a global AI governance framework that has broad consensus as soon as possible." The Economist: "More clues to Mr Xi’s thinking come from the study guide prepared for party cadres, which he is said to have personally edited. China should 'abandon uninhibited growth that comes at the cost of sacrificing safety', says the guide. Since AI will determine 'the fate of all mankind', it must always be controllable, it goes on." Xiao Qian, Deputy Director of Tsinghua University's Center for International Security and Strategy: "Just as US-Soviet nuclear arms control has mattered for world stability since the 1970s, ensuring humanity's effective control over these rapidly evolving AI systems will depend on the degree of US-China cooperation in AI—this concerns the very foundation of tomorrow's world's survival." Chinese Vice Premier Ding Xuexiang: "If we allow this reckless competition among countries to continue, then we will see a ‘gray rhino’ [...] We stand ready, under the framework of the United Nations and its core, to actively participate in including all the relevant international organizations and all countries to discuss the formulation of robust rules to ensure that AI technology will become an 'Ali Baba’s treasure cave' instead of a 'Pandora’s Box.'"

English

5

27

144

5.7K

Håvard Ihle@htihle·4d

@teortaxesTex If it matches gemini-3.1-pro on WeirdML, I will be extremely surprised. I will run it when we get an independent provider on openrouter. I’m assuming it’s based on the same base model as before and that the weights are open.

English

0

1

139

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·4d

If that is true, we should see a big uptick in WeirdML but somehow I don't believe it @htihle you?

AiBattle@AiBattle_

Building a self-evolving intelligent agent model - MiniMax M2.7 "M2.7 is our first model which deeply participated in its own evolution" - We believe that future AI self-evolution will gradually transition towards full autonomy, coordinating data construction, model training, inference architecture, evaluation, and other stages without human involvement - To this end, we conducted preliminary exploratory tests in low-resource scenarios. We had M2.7 participate in 22 machine learning competitions at the MLE Bench Lite level open-sourced by OpenAI - These competitions can be run on a single A30 GPU, yet they cover virtually all stages of machine learning research. - We designed and implemented a simple harness to guide the agent in autonomous optimization. The core modules include three components: short-term memory, self-feedback, and self-optimization - Specifically, after each iteration round, the agent generates a short-term memory markdown file and simultaneously performs self-criticism on the current round's results, thereby providing potential optimization directions for the next round - The next round then conducts further self-optimization based on the memory and self-feedback chain from all previous rounds - We ran a total of three trials, each with 24 hours for iterative evolution. From the figure below, one can see that the ML models trained by M2.7 continuously achieved higher performance over time - In the end, the best run achieved 9 gold medals, 5 silver medals, and 1 bronze medal. The average medal rate across the three runs was 66.6%, a result second only to Opus-4.6 (75.7%) and GPT-5.4 (71.2%), tying with Gemini-3.1 (66.6%)

English

4

0

36

4K

Håvard Ihle retweetledi

Nate Soares ⏹️@So8res·5d

Neil deGrasse Tyson ended tonight's debate with an impassioned plea for an international treaty to ban creating the sort of superintelligent AI that could kill us all.

English

39

51

535

29.4K

Håvard Ihle retweetledi

Jeffrey Ladish@JeffLadish·5d

Please consider donating to Palisade! We have 900k of SFF matching that runs out in 14 days. We are quite funding constrained and donations now will both help free up my time and help us expand our comms team.

English

2

23

173

28.9K

Håvard Ihle retweetledi

Maria Curi@m_ccuri·6d

Hundreds of companies across the entire tech sector are now supporting Anthropic in its lawsuit against the Pentagon. Blacklisting a company in such a way would make procurement "contingent on political favor" rather than the rule of law, industry groups argue.

English

9

106

678

99.7K

Håvard Ihle@htihle·16 Mar

@teortaxesTex Based on this it would not surprise me if nvidia will create the leading western open models going forward.

English

0

4

158

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·16 Mar

Nvidia did a solid job with Nemotron. It's clear they're not at the frontier yet as far as post-training data is concerned. I guess they're a bit behind latest Qwens (I'd say Qwen-3.5-27B should be around 40%), StepFun, MiMo – not tested. But, this is Chinese open frontier <200B.

Håvard Ihle@htihle

Nemotron 3 Super scores 38.0% on WeirdML, a solid score, and ahead of (the original) qwen3-235b (thinking). I ran it locally through ollama, with quantization q4_K_M, so full precision might do even better. The price is assuming 0.1/0.5$/M.

English

8

0

33

4.6K

Håvard Ihle@htihle·16 Mar

@teortaxesTex Here it is! x.com/htihle/status/…

Håvard Ihle@htihle

Nemotron 3 Super scores 38.0% on WeirdML, a solid score, and ahead of (the original) qwen3-235b (thinking). I ran it locally through ollama, with quantization q4_K_M, so full precision might do even better. The price is assuming 0.1/0.5$/M.

English

0

24

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·11 Mar

Nemotron Super (120B, 10B active, ≈"3B" class speed) destroys all Qwens, gpt-oss, matches Kimi K2.5, exceeds Step 3.5 Flash and V3.2, and to my knowledge is only beaten by two open models (309B MiMo and Speciale) on the most interesting benchmark today. @htihle pls test

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Well, seems we're not getting DeepSeek V4 today but we're getting what amounts to its lite version runnable on normal hardware. New architecture, fast, 1M context… …and it's a bit weaker than the equivalent Qwen 3.5.

English

10

11

194

21.4K

Håvard Ihle@htihle·16 Mar

Nemotron 3 Super scores 38.0% on WeirdML, a solid score, and ahead of (the original) qwen3-235b (thinking). I ran it locally through ollama, with quantization q4_K_M, so full precision might do even better. The price is assuming 0.1/0.5$/M.

Håvard Ihle@htihle

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata. The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.

English

4

3

28

6.9K

Håvard Ihle retweetledi

Dean W. Ball@deanwball·15 Mar

If you think Anthropic models constitute a foreign-adversary level threat to national security because of their potential to “think for themselves” based on inscrutable reasoning and potentially override the military, I have bad news for you: This threat model applies to all AI systems. There is nothing unique about Claude or Anthropic in that regard. If you think this risk constitutes an imminent national security threat that must be stopped at once, you will find you have more in common with Eliezer Yudkowsky (and indeed, with the founders of Anthropic) than you might believe, especially if you fancy yourself an “accelerationist.” I suspect when some so-called accelerationists understand what is happening, they will experience an urge to run crying toward “ban it all!” But that is neither feasible nor wise, in my view. The alternative is to not set one’s hair on fire but instead to treat these as important and serious but ultimately solvable problems. This requires that you not dismiss the notions of “safety” and “alignment” but instead see them as essential parts of achieving an accelerated future. So for instance, you might fund research into AI alignment, control, and interpretability specifically within the military, along with, say, about 90 other actionable things. What a novel idea!

Robert J Salvador@RobertJSalvador

@deanwball My background provides the point. These off base AI opinions are making Heritage look bad so it’s better I just say no than call out how you clearly don’t understand the differences in model Pre-Training and Post/Training that have the DoW thinking differently about Claude.

English

17

22

281

24.5K

Håvard Ihle retweetledi

Dean W. Ball@deanwball·13 Mar

A hypothetical: 1. In the 2028 election, a Democrat has won. Say that it is Kamala Harris. 2. Using frontier AI systems contracted by the Department of Homeland Security, President Harris orders the creation of a new program for AI to monitor social media and notify the social media platform about posts spreading “misinformation” that “harms homeland and national security by spreading dangerous falsehoods.” 3. Many Republicans see this “misinformation” as core policy positions of their political party. 4. The AI-generated monitoring and notification system described in (2) is designed to conform to the pattern of jawboning exhibited by the Biden Administration in Murthy v. Missouri, where the Supreme Court ruled that people whose social media posts were taken down due to government pressure have no standing to sue. 5. The social media platforms create AI agents that receive the government’s AI generated requests and make decisions in seconds about whether to take down posts, deboost them, deplatform the user, etc. 6. According to very recent Supreme Court precedents, everything I have described falls into “lawful use” of an AI system by all parties involved. A person whose speech was deleted by a social media platform at the request of government does not have standing to sue the government, so long as the government did not threaten policy retaliation against the social media company. And a social media company’s content moderation policies are protected expression. Thus a person whose speech rights were harmed in this context currently has no legal recourse. 7. This is “America’s national security agencies using AI within the bounds of all lawful use.” It is also a wholly automated censorship regime. This is barely a hypothetical. Much of it already happened *under the Biden admin.* The only difference is the use of AI. In the world where this happens, I’d be curious to know whether thoughtful people like @Indian_Bronson would object. If xAI were one of the companies used by the government for the social media monitoring, would you encourage the company to cancel their business with the government? Or would you say they have an obligation to provide their services to the national security apparatus of USG for all lawful use? If you would encourage xAI to cancel their contract with the government, on what principle (not qualitative judgment—universal and timeless principle!) would you distinguish between the DoW’s current insistence on “all lawful use regardless of a private party’s qualms” and xAI’s hypothetical future insistence on “all lawful use regardless of a private party’s qualms”?

English

33

56

642

62.2K

Håvard Ihle@htihle·13 Mar

@scaling01 Much more importantly, now the default in claude-code!

English

0

6

1.5K

Lisan al Gaib@scaling01·13 Mar

Anthropic no longer charges extra for longer context windows

Claude@claudeai

1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.

English

33

55

1.7K

220.2K

Håvard Ihle retweetledi

Nathan 🔎@NathanpmYoung·12 Mar

Anthropic Effortpost! What is going to happen to Anthropic by May 1st? 43% that designations will still be in force. 91% Anthropic will still be de facto banned from federal agencies.

English

1

5

60

6.7K

Håvard Ihle@htihle·13 Mar

Grok 4.20 beta scores 52.3% on WeirdML, a solid advancement over Grok 4 at 45.7%, but well behind the leading models. The model is very fast and writes simple and compact code.

Håvard Ihle@htihle

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata. The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.

English

3

2

35

5K

Håvard Ihle

Keşfet