Oren Sultan

683 posts

Oren Sultan banner
Oren Sultan

Oren Sultan

@oren_sultan

AI Research @Meta, @AIatMeta (FAIR), CS PhD Candidate @HebrewU, @HyadataLab | Past: @Lightricks @TU_Muenchen @UniMelb

Tel Aviv, Israel Katılım Ağustos 2021
813 Takip Edilen1.1K Takipçiler
Oren Sultan retweetledi
Asaf Yehudai
Asaf Yehudai@AsafYehudai·
What’s next in AI? Is AGI around the corner? We don’t know. But we believe general-purpose agents will play an important role. What are general-purpose agents? And why do we believe in them? Let’s dive in 🧵👇
Asaf Yehudai tweet media
English
1
9
28
2.1K
Oren Sultan
Oren Sultan@oren_sultan·
Tel Aviv Marathon 2026 – 10K 🏃‍♂️✅ First race. Great experience.
Oren Sultan tweet mediaOren Sultan tweet mediaOren Sultan tweet mediaOren Sultan tweet media
English
0
0
5
429
Oren Sultan retweetledi
Eliahu Horwitz
Eliahu Horwitz@EliahuHorwitz·
Cool paper by Tzachor et al. asking what if your Multimodal LLM already has better video embeddings than trained Video Models? VidVec finds strong video–text reps in intermediate layers, and with text-only “in-context” optimization achieves SoTA across MSR-VTT/MSVD/VATEX/DiDeMo
English
1
2
20
593
Oren Sultan retweetledi
Peter O'Hearn
Peter O'Hearn@PeterOHearn12·
(1/5) In our recent work on the halting problem we found LLMs to be competitive with symbolic on symbolic home turf for the halting problem (SVCOMP), in contrast to SAT or strips planning (say). This illustrates issues I've been meaning to talk about on symbols, neurons, and static analysis. I think of it like this. Symbolic reasoning is like a tower . Give it the exact formal problem — SAT, STRIPS planning — and it can be superhuman. Step off the roof and capability drops to zero. This can also happen for neural: AlphaGo is a towering neural achievement, but the system itself is only for Go.
Peter O'Hearn tweet media
English
1
2
9
605
Oren Sultan retweetledi
Peter O'Hearn
Peter O'Hearn@PeterOHearn12·
LLMs vs the Halting Problem. (Why, what, where going.) We recently released a paper on this; link to follow. A few comments here for context. Why? With LLM "reasoning" excitement, we thought: why not try LLMs on the first ever code reasoning task, the halting problem. Turing's proof of undecidability established fundamental limits. Fun bit: no matter how "superintelligent" AI becomes, this is a problem it can never perfectly solve. Where to get data to measure? SVCOMP. Verification researchers have through their insight and hard work, curated several thousand example C programs. They run dedicated tools over this dataset in an annual competition. This is in a sense the home turf of symbolic. We didn't know how LLMs would do, and in particular were aware of results of @rao2z , @RishiHazra95 and others showing that LLMs trail symbolic on "easier" decidable problems (SAT, propositional planning). The surprise: LLMs are competitive on halting—where they often trail on "easier" problems. Why? Hypothesis: LLMs are heuristic approximators; in undecidability, heuristic approximation isn't just a workaround—it's often the only way forward. Broader context: Penrose claimed undecidability proved AI is impossible (but didn't show humans can solve the undecidable). Turning the tables: undecidability is an ideal target for heuristic LLMs. Instead of using "already crushed" logic problems to show LLM limits, let's look at uncrushed problems where LLMs might actually help.
Peter O'Hearn tweet mediaPeter O'Hearn tweet mediaPeter O'Hearn tweet media
English
4
12
55
5.3K
Oren Sultan
Oren Sultan@oren_sultan·
We call this program termination prediction, and it's not me it's a category in the competition -- the International Competition on Software Verification (SV-Comp) 2025. Historically, only dedicated verification systems have competed; our work is the first to bring LLMs into this setting. I hope now it's clear. Please read the thread.
English
0
0
0
31
Oren Sultan
Oren Sultan@oren_sultan·
Of course 🙂 Undecidability is not in question. What we really care about is how well termination can be approximated in practice. Verification systems have tackled this for decades, and it’s interesting to see that reasoning-enabled LLMs are now surprisingly competitive in SV-COMP 2025, though with clear limitations.
English
1
0
1
72
Chaked
Chaked@The_SLM_Guy·
Fun fact: The first thing I've ever tried with LLMs was a variant of the halting problem. My thesis was about Program Equivalence where I researched how to create automatic regression verification proofs for unbalanced recursions. Think of two variants of Fibonacci that advances in different paces. As programs might not terminate, this problem is undecidable. ChatGPT (3.5 back then!) was VERY bad at this. It couldn't determine the basic examples. Since then, I was under the impression that LLMs are cool, but are irrelevant for the real, hardcore CS problems. Curios to see where this research will take you!
English
2
0
2
110
Oren Sultan
Oren Sultan@oren_sultan·
@BeamFlamengo @SSanesti @_vatsadev @redtachyon As you can see F1 score on Non-termination class (NT) is much lower. It is also much lower in our test time scaling (TTS) because we require consensus voting between 10 random sampled generations in terms of label. If not, we predict "Unknown" which is a mistake in F1 score.
Oren Sultan tweet media
English
0
0
1
34
Oren Sultan
Oren Sultan@oren_sultan·
@BeamFlamengo @SSanesti @_vatsadev @redtachyon That’s true, and that’s exactly why we use F1 rather than accuracy 🙂 BTW, in the competition dataset ~66% of programs terminate, but in real-world code, the skew toward non-termination is much stronger.
English
2
0
1
34
Oren Sultan retweetledi
Paul Snively
Paul Snively@JustDeezGuy·
One important difference: we don’t have evidence humans CAN’T solve the halting problem, but we DO have evidence no computational approach to date can. To be clear, this is exactly why asking the question of LLMs is interesting.
Peter O'Hearn@PeterOHearn12

LLMs vs the Halting Problem. (Why, what, where going.) We recently released a paper on this; link to follow. A few comments here for context. Why? With LLM "reasoning" excitement, we thought: why not try LLMs on the first ever code reasoning task, the halting problem. Turing's proof of undecidability established fundamental limits. Fun bit: no matter how "superintelligent" AI becomes, this is a problem it can never perfectly solve. Where to get data to measure? SVCOMP. Verification researchers have through their insight and hard work, curated several thousand example C programs. They run dedicated tools over this dataset in an annual competition. This is in a sense the home turf of symbolic. We didn't know how LLMs would do, and in particular were aware of results of @rao2z , @RishiHazra95 and others showing that LLMs trail symbolic on "easier" decidable problems (SAT, propositional planning). The surprise: LLMs are competitive on halting—where they often trail on "easier" problems. Why? Hypothesis: LLMs are heuristic approximators; in undecidability, heuristic approximation isn't just a workaround—it's often the only way forward. Broader context: Penrose claimed undecidability proved AI is impossible (but didn't show humans can solve the undecidable). Turning the tables: undecidability is an ideal target for heuristic LLMs. Instead of using "already crushed" logic problems to show LLM limits, let's look at uncrushed problems where LLMs might actually help.

English
4
1
4
739
Oren Sultan
Oren Sultan@oren_sultan·
@redtachyon In conclusion, in a setting where all methods must rely on approximation, LLMs prove to be unexpectedly effective approximators.
English
0
0
0
35
Oren Sultan
Oren Sultan@oren_sultan·
@redtachyon They often fail to produce valid witnesses (e.g., automaton proofs for non-termination). Also, performance degrades on longer and more complex programs – highlighting limits of current reasoning. But in total still very competitive to SOTA verification systems
English
1
0
0
87
Oren Sultan
Oren Sultan@oren_sultan·
In a domain where everyone is forced to approximate, LLMs turn out to be surprisingly strong approximators. Undecidability isn’t a dead end; it’s where heuristics stop being a hack and become the only game in town!
Peter O'Hearn@PeterOHearn12

LLMs vs the Halting Problem. (Why, what, where going.) We recently released a paper on this; link to follow. A few comments here for context. Why? With LLM "reasoning" excitement, we thought: why not try LLMs on the first ever code reasoning task, the halting problem. Turing's proof of undecidability established fundamental limits. Fun bit: no matter how "superintelligent" AI becomes, this is a problem it can never perfectly solve. Where to get data to measure? SVCOMP. Verification researchers have through their insight and hard work, curated several thousand example C programs. They run dedicated tools over this dataset in an annual competition. This is in a sense the home turf of symbolic. We didn't know how LLMs would do, and in particular were aware of results of @rao2z , @RishiHazra95 and others showing that LLMs trail symbolic on "easier" decidable problems (SAT, propositional planning). The surprise: LLMs are competitive on halting—where they often trail on "easier" problems. Why? Hypothesis: LLMs are heuristic approximators; in undecidability, heuristic approximation isn't just a workaround—it's often the only way forward. Broader context: Penrose claimed undecidability proved AI is impossible (but didn't show humans can solve the undecidable). Turning the tables: undecidability is an ideal target for heuristic LLMs. Instead of using "already crushed" logic problems to show LLM limits, let's look at uncrushed problems where LLMs might actually help.

English
0
0
10
535