Alon Jacoby

73 posts

Alon Jacoby

@Alon_Jacoby

PhD student @ Penn @cogcomp

Katılım Eylül 2021

146 Takip Edilen132 Takipçiler

Sabitlenmiş Tweet

Alon Jacoby@Alon_Jacoby·17 Nis

New findings: We just evaluated Gemini 1.5 Pro on our recent benchmark that tests the impact of context size on reasoning performance - it is much better than 1.0 in long contexts! Though still falls behind GPT4. Also, CoT prompting now improves accuracy (unlike with 1.0). (1/4)

English

145

40.1K

Alon Jacoby@Alon_Jacoby·24 Oca

@nir_benz @yoavgo This post from Alex Cui x.com/alexcdot/statu…

Alex Cui@alexcdot

Okay so, we just found that over 50 papers published at @Neurips 2025 have AI hallucinations I don't think people realize how bad the slop is right now It's not just that researchers from @GoogleDeepMind, @Meta, @MIT, @Cambridge_Uni are using AI - they allowed LLMs to generate hallucinations in their papers and didn't notice at all. It's insane that these made it through peer review👇

English

Nir Ben-Zvi@nir_benz·23 Oca

@yoavgo Where can I read about this? (first problem, I agree about the second one)

English

796

(((ل()(ل() 'yoav))))👾@yoavgo·23 Oca

there is a much more fundamental problem than 50 accepted neurips papers with made up citations. and this problem is over 5200 accepted neurips papers.

English

220

21.6K

Alon Jacoby retweetledi

Itay Itzhak@Itay_itzhak_·21 Ara

Why does a 90% benchmark score often feel like 50% in production? 📉 We are researching the anatomy of the "Vibe Check" - formalizing the gap between static metrics and actual "feel". Help us turn intuition into data with this quick 7-minute survey! forms.gle/HqE6R9Vevq9zzk…

English

3.7K

Alon Jacoby retweetledi

Yuli Slavutsky@YuliSlavutsky·3 Ara

(1/2) Uncertainty estimation fails under distribution shifts. Why? Partly because in stats, even Bayesian stats, we treat x as given. But intuitively data makes different models plausible. For reliable uncertainty, we need to account for it explicitly. Come chat with me tomorrow!

English

182

Alon Jacoby retweetledi

Yu Feng@AnnieFeng6·7 Kas

LLM CoT reasoning looks smart but can be logically flawed or... just made up. It's time to hold reasoning accountable! We built VeriCoT to do just that. VeriCoT extracts the core argument of the CoT using well-formed symbolic notions of logical support. It formalizes every CoT step into first-order logic and finds the exact premise it's built on. This gives us two superpowers: 🤖Automated Proof: Solvers can automatically verify if the logic is valid. 🧑‍🔬Human-Readable Audits: Natural language premises let you pinpoint ungrounded leaps or fallacies. Best of all, all these can be used as signals to learn more verifiable models! To our knowledge, VeriCoT is the first neuro-symbolic validator of CoT traces in non-math/code domains. 📄 Paper: arxiv.org/pdf/2511.04662

English

6.8K

Alon Jacoby retweetledi

Ai2@allen_ai·18 Ağu

LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇

English

227

21.7K

Alon Jacoby retweetledi

Ron Eliav@ron_eliav·6 Haz

🚨New Preprint! We propose CLATTER: Claim Localization and Attribution for Entailment Reasoning, to assess the faithfulness of LLM outputs to their sources. CLATTER boosts hallucination detection in reasoning models via decomposition and attribution steps. Summary below 👇

English

Alon Jacoby@Alon_Jacoby·7 May

It's also a good reminder that even really impressive models can be surprisingly susceptible to very simple surface-level perturbations. The original FlenQA paper here - arxiv.org/abs/2402.14848

English

Alon Jacoby@Alon_Jacoby·7 May

The Phi 4 Reasoning technical report is a good reminder that current models still suffer massive performance degradation when reasoning tasks get longer - even at just 3K tokens! They use FlenQA (w/ @mosh_levy) to show their model improves here massively. arxiv.org/abs/2504.21318

English

1.7K

Alon Jacoby retweetledi

AK@_akhaliq·25 Nis

RefVNLI Towards Scalable Evaluation of Subject-driven Text-to-image Generation

English

133

17K

Alon Jacoby@Alon_Jacoby·3 Şub

Obviously, be sensible. If you're not willing to send your code to 3rd parties (OpenAI, Google, etc), don't use `-s` (or `--summary`). Everything else is done locally.

English

Alon Jacoby@Alon_Jacoby·3 Şub

If you specify '-s' when running the script, an LLM will summarize the diff (3 models implemented, but you can easily add more). If this is useful to you, because like me - you need a worse version of git - check out - github.com/alonj/pydift or via `pip install pydift`

English

Alon Jacoby@Alon_Jacoby·3 Şub

Sometimes I want to track small changes in code without too much hassle, so I made pydift: replace "python script.py" with "pydift script.py", and diffs from previous runs will be saved automatically.

English

Alon Jacoby@Alon_Jacoby·16 Ara

@mariusmosbach thanks!

English

Marius Mosbach@mariusmosbach·15 Ara

@Alon_Jacoby You can check my fork here: github.com/mmarius/OLMo?t… It should still have all the information you are looking for.

English

122

Alon Jacoby@Alon_Jacoby·15 Ara

How does one figure out exactly which samples were seen in training for a given OLMo checkpoint? Where is that information shared or stored? Also, there used to be a csv of checkpoints in the OLMo repo, but it's gone (guessing since OLMo 2)... Help will be appreciated

English

316

Alon Jacoby@Alon_Jacoby·11 Ara

This is one of a few neat ideas in @YuliSlavutsky 's work to learn robust representations in @NeurIPSConf '24. Definitely worth reading if you're also interested in robustness: neurips.cc/virtual/2024/p…

English

Alon Jacoby@Alon_Jacoby·11 Ara

Say we collected a multi-hop reasoning QA dataset. Inevitably, the samples will have some attributes that we didn't/can't control for (domain, length of text, difficulty, etc). By taking small enough sub-samples, also inevitably, sometimes the minority attributes become majority.

English

Alon Jacoby@Alon_Jacoby·11 Ara

We've come to expect LLMs to be generalist models which are accurate in zero-shot settings - such as QA in different domains, reasoning types, or even low resource languages. How can we ensure that models are accurate on samples from classes rarely seen in training?

Yuli Slavutsky@YuliSlavutsky

I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with @yuvalbenj. The gist is in the comments, and come chat with me to hear more!

English

119

Alon Jacoby retweetledi

Yuli Slavutsky@YuliSlavutsky·11 Ara

I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with @yuvalbenj. The gist is in the comments, and come chat with me to hear more!

English

651

Alon Jacoby@Alon_Jacoby·6 Eki

Another result is that there is some "preferred" order of decoding in MLMs (when there is more than one mask in the input). Seeing as MLMs are very relevant in retrieval, this is perhaps worth exploring and exploiting.

English

Alon Jacoby@Alon_Jacoby·6 Eki

One interesting result in the paper is that auto-regressive LLMs seem less consistent as their size increases, and the opposite is true with MLMs.

English

Alon Jacoby@Alon_Jacoby·6 Eki

Evaluating LLMs is often bit hand-wavy and it's hard to agree on what exactly we're testing... this kind of rigorous work, exposing inconsistencies in LLMs (=assigning similar token probabilities in semantically comparable settings), is very much needed

Yuli Slavutsky@YuliSlavutsky

Evaluating LLMs comes with two big challenges: - No solid metric for "accuracy" - Statistical analysis is often missing 1/4

English

125

Keşfet

@nir_benz @yoavgo @mosh_levy @mariusmosbach @YuliSlavutsky @NeurIPSConf @elonmusk @BarackObama