Oren Sultan (@oren_sultan) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Oren Sultan@oren_sultan·1 Şub

Can LLMs reliably predict program termination? We evaluate frontier LLMs in the International Competition on Software Verification (SV-COMP) 2025, directly competing with state-of-the-art verification systems. @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae @pascalkesseli @jvanegue @HyadataLab @adiyossLC @PeterOHearn12 Paper: arxiv.org/pdf/2601.18987 Website: orensultan.com/llms_halting_p… 🧵👇 1/n

English

9

42

116

43.2K

Oren Sultan retweetledi

Asaf Yehudai@AsafYehudai·3 Mar

What’s next in AI? Is AGI around the corner? We don’t know. But we believe general-purpose agents will play an important role. What are general-purpose agents? And why do we believe in them? Let’s dive in 🧵👇

English

1

9

28

2.1K

Oren Sultan@oren_sultan·1 Mar

Tel Aviv Marathon 2026 – 10K 🏃‍♂️✅ First race. Great experience.

English

0

5

429

Oren Sultan retweetledi

Julien Vanegue@jvanegue·15 Şub

AI-generated podcast narrating our "LLMs vs the Halting Problem" paper in 16min: youtube.com/watch?v=eH8_EC… The whole series by @oren_sultan is a lot of fun to watch.

YouTube

English

0

2

6

1.3K

Oren Sultan retweetledi

Eliahu Horwitz@EliahuHorwitz·12 Şub

Cool paper by Tzachor et al. asking what if your Multimodal LLM already has better video embeddings than trained Video Models? VidVec finds strong video–text reps in intermediate layers, and with text-only “in-context” optimization achieves SoTA across MSR-VTT/MSVD/VATEX/DiDeMo

English

1

2

20

593

Oren Sultan@oren_sultan·6 Şub

@jordiae @pascalkesseli @jvanegue @HyadataLab @adiyossLC @PeterOHearn12 Looks like this channel feauring trending papers on X 😀Really impressive AI Video!

English

0

161

Oren Sultan@oren_sultan·6 Şub

youtube.com/watch?v=lv7GNC… Just discovered that an AI agent auto-summarized our paper… and it’s on YouTube! 😄 @jordiae @pascalkesseli @jvanegue @HyadataLab @adiyossLC @PeterOHearn12

YouTube

English

1

0

5

509

Oren Sultan retweetledi

Peter O'Hearn@PeterOHearn12·6 Şub

(1/5) In our recent work on the halting problem we found LLMs to be competitive with symbolic on symbolic home turf for the halting problem (SVCOMP), in contrast to SAT or strips planning (say). This illustrates issues I've been meaning to talk about on symbols, neurons, and static analysis. I think of it like this. Symbolic reasoning is like a tower . Give it the exact formal problem — SAT, STRIPS planning — and it can be superhuman. Step off the roof and capability drops to zero. This can also happen for neural: AlphaGo is a towering neural achievement, but the system itself is only for Go.

English

1

2

9

605

Oren Sultan retweetledi

Peter O'Hearn@PeterOHearn12·2 Şub

LLMs vs the Halting Problem. (Why, what, where going.) We recently released a paper on this; link to follow. A few comments here for context. Why? With LLM "reasoning" excitement, we thought: why not try LLMs on the first ever code reasoning task, the halting problem. Turing's proof of undecidability established fundamental limits. Fun bit: no matter how "superintelligent" AI becomes, this is a problem it can never perfectly solve. Where to get data to measure? SVCOMP. Verification researchers have through their insight and hard work, curated several thousand example C programs. They run dedicated tools over this dataset in an annual competition. This is in a sense the home turf of symbolic. We didn't know how LLMs would do, and in particular were aware of results of @rao2z , @RishiHazra95 and others showing that LLMs trail symbolic on "easier" decidable problems (SAT, propositional planning). The surprise: LLMs are competitive on halting—where they often trail on "easier" problems. Why? Hypothesis: LLMs are heuristic approximators; in undecidability, heuristic approximation isn't just a workaround—it's often the only way forward. Broader context: Penrose claimed undecidability proved AI is impossible (but didn't show humans can solve the undecidable). Turning the tables: undecidability is an ideal target for heuristic LLMs. Instead of using "already crushed" logic problems to show LLM limits, let's look at uncrushed problems where LLMs might actually help.

English

4

12

55

5.3K

Oren Sultan@oren_sultan·2 Şub

We call this program termination prediction, and it's not me it's a category in the competition -- the International Competition on Software Verification (SV-Comp) 2025. Historically, only dedicated verification systems have competed; our work is the first to bring LLMs into this setting. I hope now it's clear. Please read the thread.

English

0

31

Seamus (That’s the dog)@Jahnbeetwif·2 Şub

@oren_sultan @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae Then don’t call it the fucking halting problem, if that’s not what you’re talking about

English

1

0

2

39

Oren Sultan@oren_sultan·1 Şub

Can LLMs reliably predict program termination? We evaluate frontier LLMs in the International Competition on Software Verification (SV-COMP) 2025, directly competing with state-of-the-art verification systems. @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae @pascalkesseli @jvanegue @HyadataLab @adiyossLC @PeterOHearn12 Paper: arxiv.org/pdf/2601.18987 Website: orensultan.com/llms_halting_p… 🧵👇 1/n

English

9

42

116

43.2K

Oren Sultan@oren_sultan·2 Şub

Of course 🙂 Undecidability is not in question. What we really care about is how well termination can be approximated in practice. Verification systems have tackled this for decades, and it’s interesting to see that reasoning-enabled LLMs are now surprisingly competitive in SV-COMP 2025, though with clear limitations.

English

1

0

1

72

Seamus (That’s the dog)@Jahnbeetwif·2 Şub

@oren_sultan @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae No cs.virginia.edu/~robins/Turing…

1

0

4

69

Oren Sultan@oren_sultan·2 Şub

@CRJ_Sayedoff @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae That matches what we see: for program equivalence and termination, non-reasoning models still fail hard (GPT-4o included). What seems new is that explicit reasoning unlocks surprisingly strong approximate semantic reasoning. Curious how far this paradigm scales.

English

0

32

Chaked@The_SLM_Guy·2 Şub

Fun fact: The first thing I've ever tried with LLMs was a variant of the halting problem. My thesis was about Program Equivalence where I researched how to create automatic regression verification proofs for unbalanced recursions. Think of two variants of Fibonacci that advances in different paces. As programs might not terminate, this problem is undecidable. ChatGPT (3.5 back then!) was VERY bad at this. It couldn't determine the basic examples. Since then, I was under the impression that LLMs are cool, but are irrelevant for the real, hardcore CS problems. Curios to see where this research will take you!

English

2

0

2

110

Oren Sultan@oren_sultan·2 Şub

@CRJ_Sayedoff @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae Feels like our interests keep aligning 🙂 From Lightricks back then to my current work at Meta.

English

0

96

Oren Sultan@oren_sultan·2 Şub

@BeamFlamengo @SSanesti @_vatsadev @redtachyon As you can see F1 score on Non-termination class (NT) is much lower. It is also much lower in our test time scaling (TTS) because we require consensus voting between 10 random sampled generations in terms of label. If not, we predict "Unknown" which is a mistake in F1 score.

English

0

1

34

Oren Sultan@oren_sultan·2 Şub

@BeamFlamengo @SSanesti @_vatsadev @redtachyon That’s true, and that’s exactly why we use F1 rather than accuracy 🙂 BTW, in the competition dataset ~66% of programs terminate, but in real-world code, the skew toward non-termination is much stronger.

English

2

0

1

34

Ariel@redtachyon·1 Şub

No, they can't

Oren Sultan@oren_sultan

Can LLMs reliably predict program termination? We evaluate frontier LLMs in the International Competition on Software Verification (SV-COMP) 2025, directly competing with state-of-the-art verification systems. @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae @pascalkesseli @jvanegue @HyadataLab @adiyossLC @PeterOHearn12 Paper: arxiv.org/pdf/2601.18987 Website: orensultan.com/llms_halting_p… 🧵👇 1/n

English

17

3

535

19.1K

Oren Sultan retweetledi

Paul Snively@JustDeezGuy·2 Şub

One important difference: we don’t have evidence humans CAN’T solve the halting problem, but we DO have evidence no computational approach to date can. To be clear, this is exactly why asking the question of LLMs is interesting.

Peter O'Hearn@PeterOHearn12

LLMs vs the Halting Problem. (Why, what, where going.) We recently released a paper on this; link to follow. A few comments here for context. Why? With LLM "reasoning" excitement, we thought: why not try LLMs on the first ever code reasoning task, the halting problem. Turing's proof of undecidability established fundamental limits. Fun bit: no matter how "superintelligent" AI becomes, this is a problem it can never perfectly solve. Where to get data to measure? SVCOMP. Verification researchers have through their insight and hard work, curated several thousand example C programs. They run dedicated tools over this dataset in an annual competition. This is in a sense the home turf of symbolic. We didn't know how LLMs would do, and in particular were aware of results of @rao2z , @RishiHazra95 and others showing that LLMs trail symbolic on "easier" decidable problems (SAT, propositional planning). The surprise: LLMs are competitive on halting—where they often trail on "easier" problems. Why? Hypothesis: LLMs are heuristic approximators; in undecidability, heuristic approximation isn't just a workaround—it's often the only way forward. Broader context: Penrose claimed undecidability proved AI is impossible (but didn't show humans can solve the undecidable). Turning the tables: undecidability is an ideal target for heuristic LLMs. Instead of using "already crushed" logic problems to show LLM limits, let's look at uncrushed problems where LLMs might actually help.

English

4

1

4

739

Oren Sultan@oren_sultan·2 Şub

@redtachyon In conclusion, in a setting where all methods must rely on approximation, LLMs prove to be unexpectedly effective approximators.

English

0

35

Oren Sultan@oren_sultan·2 Şub

@redtachyon They often fail to produce valid witnesses (e.g., automaton proofs for non-termination). Also, performance degrades on longer and more complex programs – highlighting limits of current reasoning. But in total still very competitive to SOTA verification systems

English

1

0

87

Oren Sultan@oren_sultan·2 Şub

In a domain where everyone is forced to approximate, LLMs turn out to be surprisingly strong approximators. Undecidability isn’t a dead end; it’s where heuristics stop being a hack and become the only game in town!

Peter O'Hearn@PeterOHearn12

LLMs vs the Halting Problem. (Why, what, where going.) We recently released a paper on this; link to follow. A few comments here for context. Why? With LLM "reasoning" excitement, we thought: why not try LLMs on the first ever code reasoning task, the halting problem. Turing's proof of undecidability established fundamental limits. Fun bit: no matter how "superintelligent" AI becomes, this is a problem it can never perfectly solve. Where to get data to measure? SVCOMP. Verification researchers have through their insight and hard work, curated several thousand example C programs. They run dedicated tools over this dataset in an annual competition. This is in a sense the home turf of symbolic. We didn't know how LLMs would do, and in particular were aware of results of @rao2z , @RishiHazra95 and others showing that LLMs trail symbolic on "easier" decidable problems (SAT, propositional planning). The surprise: LLMs are competitive on halting—where they often trail on "easier" problems. Why? Hypothesis: LLMs are heuristic approximators; in undecidability, heuristic approximation isn't just a workaround—it's often the only way forward. Broader context: Penrose claimed undecidability proved AI is impossible (but didn't show humans can solve the undecidable). Turning the tables: undecidability is an ideal target for heuristic LLMs. Instead of using "already crushed" logic problems to show LLM limits, let's look at uncrushed problems where LLMs might actually help.

English

0

10

535

Oren Sultan@oren_sultan·2 Şub

@dr_piotr @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae Thank you! :) @dr_piotr

English

0

47

Peter S@dr_piotr·2 Şub

@oren_sultan @AIatMeta @HebrewU @Bloomberg @imperialcollege @ucl @jordiae Yeah, joke aside, the issue is really interesting. Great to see people are still using Twitter (X ahem) for sharing useful and genuinely interesting stuff.

English

1

0

1

38

Oren Sultan

Keşfet