Sabitlenmiş Tweet
Alon Jacoby
73 posts


Alex Cui@alexcdot
Okay so, we just found that over 50 papers published at @Neurips 2025 have AI hallucinations I don't think people realize how bad the slop is right now It's not just that researchers from @GoogleDeepMind, @Meta, @MIT, @Cambridge_Uni are using AI - they allowed LLMs to generate hallucinations in their papers and didn't notice at all. It's insane that these made it through peer review👇
English

@yoavgo Where can I read about this? (first problem, I agree about the second one)
English
Alon Jacoby retweetledi

Why does a 90% benchmark score often feel like 50% in production? 📉
We are researching the anatomy of the "Vibe Check" - formalizing the gap between static metrics and actual "feel".
Help us turn intuition into data with this quick 7-minute survey!
forms.gle/HqE6R9Vevq9zzk…

English
Alon Jacoby retweetledi
Alon Jacoby retweetledi

LLM CoT reasoning looks smart but can be logically flawed or... just made up. It's time to hold reasoning accountable!
We built VeriCoT to do just that. VeriCoT extracts the core argument of the CoT using well-formed symbolic notions of logical support. It formalizes every CoT step into first-order logic and finds the exact premise it's built on. This gives us two superpowers:
🤖Automated Proof: Solvers can automatically verify if the logic is valid.
🧑🔬Human-Readable Audits: Natural language premises let you pinpoint ungrounded leaps or fallacies.
Best of all, all these can be used as signals to learn more verifiable models!
To our knowledge, VeriCoT is the first neuro-symbolic validator of CoT traces in non-math/code domains.
📄 Paper: arxiv.org/pdf/2511.04662

English
Alon Jacoby retweetledi
Alon Jacoby retweetledi

It's also a good reminder that even really impressive models can be surprisingly susceptible to very simple surface-level perturbations.
The original FlenQA paper here -
arxiv.org/abs/2402.14848
English

The Phi 4 Reasoning technical report is a good reminder that current models still suffer massive performance degradation when reasoning tasks get longer - even at just 3K tokens!
They use FlenQA (w/ @mosh_levy) to show their model improves here massively.
arxiv.org/abs/2504.21318

English
Alon Jacoby retweetledi

If you specify '-s' when running the script, an LLM will summarize the diff (3 models implemented, but you can easily add more). If this is useful to you, because like me - you need a worse version of git - check out - github.com/alonj/pydift
or via
`pip install pydift`
English

@Alon_Jacoby You can check my fork here: github.com/mmarius/OLMo?t… It should still have all the information you are looking for.
English

This is one of a few neat ideas in @YuliSlavutsky 's work to learn robust representations in @NeurIPSConf '24. Definitely worth reading if you're also interested in robustness: neurips.cc/virtual/2024/p…
English

We've come to expect LLMs to be generalist models which are accurate in zero-shot settings - such as QA in different domains, reasoning types, or even low resource languages.
How can we ensure that models are accurate on samples from classes rarely seen in training?
Yuli Slavutsky@YuliSlavutsky
I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with @yuvalbenj. The gist is in the comments, and come chat with me to hear more!
English
Alon Jacoby retweetledi

I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with @yuvalbenj.
The gist is in the comments, and come chat with me to hear more!

English

Evaluating LLMs is often bit hand-wavy and it's hard to agree on what exactly we're testing...
this kind of rigorous work, exposing inconsistencies in LLMs (=assigning similar token probabilities in semantically comparable settings), is very much needed
Yuli Slavutsky@YuliSlavutsky
Evaluating LLMs comes with two big challenges: - No solid metric for "accuracy" - Statistical analysis is often missing 1/4
English










