Alon Jacoby

73 posts

Alon Jacoby

Alon Jacoby

@Alon_Jacoby

PhD student @ Penn @cogcomp

Katılım Eylül 2021
146 Takip Edilen132 Takipçiler
Sabitlenmiş Tweet
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
New findings: We just evaluated Gemini 1.5 Pro on our recent benchmark that tests the impact of context size on reasoning performance - it is much better than 1.0 in long contexts! Though still falls behind GPT4. Also, CoT prompting now improves accuracy (unlike with 1.0). (1/4)
Alon Jacoby tweet media
English
7
30
145
40.1K
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
@nir_benz @yoavgo This post from Alex Cui x.com/alexcdot/statu…
Alex Cui@alexcdot

Okay so, we just found that over 50 papers published at @Neurips 2025 have AI hallucinations I don't think people realize how bad the slop is right now It's not just that researchers from @GoogleDeepMind, @Meta, @MIT, @Cambridge_Uni are using AI - they allowed LLMs to generate hallucinations in their papers and didn't notice at all. It's insane that these made it through peer review👇

English
0
0
0
13
Nir Ben-Zvi
Nir Ben-Zvi@nir_benz·
@yoavgo Where can I read about this? (first problem, I agree about the second one)
English
1
0
0
796
(((ل()(ل() 'yoav))))👾
there is a much more fundamental problem than 50 accepted neurips papers with made up citations. and this problem is over 5200 accepted neurips papers.
English
11
13
220
21.6K
Alon Jacoby retweetledi
Itay Itzhak
Itay Itzhak@Itay_itzhak_·
Why does a 90% benchmark score often feel like 50% in production? 📉 We are researching the anatomy of the "Vibe Check" - formalizing the gap between static metrics and actual "feel". Help us turn intuition into data with this quick 7-minute survey! forms.gle/HqE6R9Vevq9zzk…
Itay Itzhak tweet media
English
1
5
21
3.7K
Alon Jacoby retweetledi
Yuli Slavutsky
Yuli Slavutsky@YuliSlavutsky·
(1/2) Uncertainty estimation fails under distribution shifts. Why? Partly because in stats, even Bayesian stats, we treat x as given. But intuitively data makes different models plausible. For reliable uncertainty, we need to account for it explicitly. Come chat with me tomorrow!
Yuli Slavutsky tweet media
English
1
1
3
182
Alon Jacoby retweetledi
Yu Feng
Yu Feng@AnnieFeng6·
LLM CoT reasoning looks smart but can be logically flawed or... just made up. It's time to hold reasoning accountable! We built VeriCoT to do just that. VeriCoT extracts the core argument of the CoT using well-formed symbolic notions of logical support. It formalizes every CoT step into first-order logic and finds the exact premise it's built on. This gives us two superpowers: 🤖Automated Proof: Solvers can automatically verify if the logic is valid. 🧑‍🔬Human-Readable Audits: Natural language premises let you pinpoint ungrounded leaps or fallacies. Best of all, all these can be used as signals to learn more verifiable models! To our knowledge, VeriCoT is the first neuro-symbolic validator of CoT traces in non-math/code domains. 📄 Paper: arxiv.org/pdf/2511.04662
Yu Feng tweet media
English
2
11
26
6.8K
Alon Jacoby retweetledi
Ai2
Ai2@allen_ai·
LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇
Ai2 tweet media
English
10
38
227
21.7K
Alon Jacoby retweetledi
Ron Eliav
Ron Eliav@ron_eliav·
🚨New Preprint! We propose CLATTER: Claim Localization and Attribution for Entailment Reasoning, to assess the faithfulness of LLM outputs to their sources. CLATTER boosts hallucination detection in reasoning models via decomposition and attribution steps. Summary below 👇
Ron Eliav tweet media
English
2
14
28
6K
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
It's also a good reminder that even really impressive models can be surprisingly susceptible to very simple surface-level perturbations. The original FlenQA paper here - arxiv.org/abs/2402.14848
English
0
0
0
63
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
The Phi 4 Reasoning technical report is a good reminder that current models still suffer massive performance degradation when reasoning tasks get longer - even at just 3K tokens! They use FlenQA (w/ @mosh_levy) to show their model improves here massively. arxiv.org/abs/2504.21318
Alon Jacoby tweet media
English
1
5
13
1.7K
Alon Jacoby retweetledi
AK
AK@_akhaliq·
RefVNLI Towards Scalable Evaluation of Subject-driven Text-to-image Generation
AK tweet media
English
1
52
133
17K
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
Obviously, be sensible. If you're not willing to send your code to 3rd parties (OpenAI, Google, etc), don't use `-s` (or `--summary`). Everything else is done locally.
English
0
0
0
26
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
If you specify '-s' when running the script, an LLM will summarize the diff (3 models implemented, but you can easily add more). If this is useful to you, because like me - you need a worse version of git - check out - github.com/alonj/pydift or via `pip install pydift`
English
1
0
0
30
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
Sometimes I want to track small changes in code without too much hassle, so I made pydift: replace "python script.py" with "pydift script.py", and diffs from previous runs will be saved automatically.
Alon Jacoby tweet mediaAlon Jacoby tweet media
English
1
0
1
40
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
How does one figure out exactly which samples were seen in training for a given OLMo checkpoint? Where is that information shared or stored? Also, there used to be a csv of checkpoints in the OLMo repo, but it's gone (guessing since OLMo 2)... Help will be appreciated
English
1
0
1
316
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
Say we collected a multi-hop reasoning QA dataset. Inevitably, the samples will have some attributes that we didn't/can't control for (domain, length of text, difficulty, etc). By taking small enough sub-samples, also inevitably, sometimes the minority attributes become majority.
English
1
0
0
17
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
We've come to expect LLMs to be generalist models which are accurate in zero-shot settings - such as QA in different domains, reasoning types, or even low resource languages. How can we ensure that models are accurate on samples from classes rarely seen in training?
Yuli Slavutsky@YuliSlavutsky

I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with @yuvalbenj. The gist is in the comments, and come chat with me to hear more!

English
1
0
3
119
Alon Jacoby retweetledi
Yuli Slavutsky
Yuli Slavutsky@YuliSlavutsky·
I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with @yuvalbenj. The gist is in the comments, and come chat with me to hear more!
Yuli Slavutsky tweet media
English
2
3
14
651
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
Another result is that there is some "preferred" order of decoding in MLMs (when there is more than one mask in the input). Seeing as MLMs are very relevant in retrieval, this is perhaps worth exploring and exploiting.
English
0
0
0
30
Alon Jacoby
Alon Jacoby@Alon_Jacoby·
One interesting result in the paper is that auto-regressive LLMs seem less consistent as their size increases, and the opposite is true with MLMs.
English
1
0
0
31