Jonibek Mansurov

23 posts

Jonibek Mansurov banner
Jonibek Mansurov

Jonibek Mansurov

@M_Jonibek

Katılım Mayıs 2023
130 Takip Edilen34 Takipçiler
Jonibek Mansurov retweetledi
Alham Fikri Aji
Alham Fikri Aji@AlhamFikri·
Should we treat LLM benchmarking like an annual Olympiad event?🏆 With current benchmarks, it is too easy to overfit tasks or manipulate settings. In some cases, people just cheat / being narrow-tuned to a specific benchmark (*cough* LLaMa-4) What if we organized an annual, Olympiad-like event? The tasks must be sealed and unknown. Models cannot study for the test. They must be prepared for anything. We explain this in our new position paper. I am an IOI alum long time ago. I practiced for years to master many algorithms. I wanted to be ready for whatever appeared on the contest day. I believe general LLMs should face the same standard. If they are truly general, they should be ready for whatever use cases. We propose a flow similar to how we typically organize an Olympiad: - Call for Task: We propose an open solicitation for challenging, high-quality tasks from the global research community. - Organizing Committee: A dedicated team curates and improves these submissions. They verify task quality and diversity. - Model Developers: Developers submit their systems blindly before the tasks are revealed. This prevents teams from iterative gaming or manual tuning once the exam starts. - The Actual Olympiad: Evaluation happens in a synchronized, short window. The sealed tasks are released, and all models are tested simultaneously to maintain total integrity under the same setting. Once it is done, everything will be released for reproducibility. Read the full position paper here: arxiv.org/abs/2603.23292 We worked on this together with my student @jcblaisecruz Let me know your thoughts!
Alham Fikri Aji tweet mediaAlham Fikri Aji tweet media
English
4
14
75
6.9K
Jonibek Mansurov retweetledi
Blaise Cruz
Blaise Cruz@jcblaisecruz·
New position paper! 📄 "LLM Olympiad: Why Model Evaluation Needs a Sealed Exam" We argue that NLP needs an Olympiad-style event: seal the problems, freeze submissions, run one harness, release everything for audit. w/ @AlhamFikri Paper: arxiv.org/abs/2603.23292
English
4
8
28
2.7K
Jonibek Mansurov retweetledi
Alham Fikri Aji
Alham Fikri Aji@AlhamFikri·
As an advisor, my take is that a PhD student is not a paper-generating machine. If a professor thinks AI agents can replace students, they might have lost sight of what advising is truly about IMO mentorship is about training an independent thinker, building a relationship, and cultivating lifelong bonds. The joy of raising a successful student and watching them flourish is something you simply can't get from prompting an LLM agent
Sayash Kapoor@sayashk

In the last few months, I've spoken to many CS professors who asked me if we even need CS PhD students anymore. Now that we have coding agents, can't professors work directly with agents? My view is that equipping PhD students with coding agents will allow them to do work that is orders of magnitude more impressive than they otherwise could. And they can be *accountable* for their outcomes in a way agents can't (yet). For example, who checks the agent's outputs are correct? Who is responsible for mistakes or errors?

English
6
33
245
23K
Jonibek Mansurov retweetledi
pat ✈️ CVPR
pat ✈️ CVPR@patrickamadeus_·
Excited to share that we have committed our paper “Vision-Language Models are Confused Tourists” to #CVPR2026 (Findings)! 🇺🇸🏔 Arxiv: arxiv.org/abs/2511.17004 We question whether current SOTA VLMs remain robust in simple cultural grounding QA when distracting contextual objects are present For example, if you eat chicken schnitzel with Mt. Fuji in the background, will the model fail to recognize it as Japanese katsu? ConfusedTourists introduces: 👉 5k+ evaluation samples across 3 cultural item categories, comprising 243 unique cultural items from 57 countries and 11 sub-regions 🌍 👉 Evaluation of 14 VLMs across 12 data features 🤖 👉 Findings showing that simple concept mixing can cause up to a -40% drop in perform 📉 Special thanks to my co-authors @IkhlasulHanif0 , @emthehunt, @gentaiscool, @FajriKoto, and my advisor @AlhamFikri for the valuable contributions along the way! #multimodal #vlm #multicultural #robustness #evaluation #NLProc #ComputerVision
pat ✈️ CVPR tweet mediapat ✈️ CVPR tweet mediapat ✈️ CVPR tweet media
English
5
15
44
10.7K
Jonibek Mansurov retweetledi
Farid Adilazuarda
Farid Adilazuarda@faridlazuarda·
🚀🚨 Sparse-Frontier Major Updates! You can now evaluate Reasoning + Sparse models at speed, with Sparse-Frontier upgraded to the @vllm_project's v1 engine🔥 We still provide support for Tensor Parallelism and the original sparse attention baselines, but it now works cleanly with newer models, decoding strategies, and evaluation setups. Task coverage and model support were also expanded as part of this release. The config-based workflow stays the same. If you’re working on sparse decoding, reasoning models, or long-context evaluation, this update makes it easier to run consistent experiments across models, tasks, and attention methods⚡️ Really enjoyed working with @p_nawrot and @PontiEdoardo over the past months to get this release out!
Farid Adilazuarda tweet media
English
1
6
37
17.2K
Jonibek Mansurov retweetledi
Sama🌪
Sama🌪@SamaHadhod·
(1/9) Excited to share our new paper🥳, Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming.
Sama🌪 tweet media
English
2
9
20
3.3K
Jonibek Mansurov retweetledi
Blaise Cruz
Blaise Cruz@jcblaisecruz·
1/11 Proud to share our new paper: SENSIA (SENse-based Symmetric Interlingual Alignment) — a sense-based approach to multilingual adaptation. Goal: explicit representation-level alignment of meaning.
Blaise Cruz tweet media
English
2
7
24
3.5K
Jonibek Mansurov retweetledi
Haryo
Haryo@haryoaw·
Most culture test benchmark is mostly static, which may lead to data saturation and leakage, hence making the score not reliable to measure the capability of LLMs. Thus, we benchmark these LLMs to play a social deduction game!
English
1
9
21
3.3K
Jonibek Mansurov retweetledi
Alham Fikri Aji
Alham Fikri Aji@AlhamFikri·
🎉Silver for @mbzuai at ACPC!!🥈 We also topped the Gulf rankings Amazing performance by our 1st year UG students in our very first participation Next: ICPC World Finals 2026🤞
Alham Fikri Aji tweet media
English
3
9
60
14.5K
Jonibek Mansurov retweetledi
Alham Fikri Aji
Alham Fikri Aji@AlhamFikri·
I heard someone canceled their conference trip coz of a threat over this. Juniors could also get their careers hurt for criticizing big name authors Don't let curiosity get the better of you. Knowing who hates your paper can affect you for years And please DON'T harass anyone
ICLR@iclr_conf

English
0
3
43
4.6K
pat ✈️ CVPR
pat ✈️ CVPR@patrickamadeus_·
Personal update: I am starting my PhD @mbzuai where I look forward to work in multimodal realm (interpretability, modality imbalance, eval & application) to address foundational gaps with @AlhamFikri and co.
pat ✈️ CVPR tweet mediapat ✈️ CVPR tweet mediapat ✈️ CVPR tweet media
English
6
3
144
13.1K
Jonibek Mansurov retweetledi
Yong Zheng-Xin
Yong Zheng-Xin@yong_zhengxin·
This is incredible findings – a reproducibility crisis where baselines are not faithfully reproduced or reported (e.g., footnote indicating performance difference) 🍎 In our work (arxiv.org/abs/2505.05408) we tried so hard to ensure apple-to-apple comparison.
Yong Zheng-Xin tweet media
Shashwat Goel@ShashwatGoel7

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

English
0
2
8
897
Jonibek Mansurov retweetledi
Yong Zheng-Xin
Yong Zheng-Xin@yong_zhengxin·
Amidst the evaluation/reproducibility crisis for reasoning LLMs, it's great to see *concurrent independent work (with different models & benchmarks) aligns with our findings*! We reported the same fundamental trade-off: language forcing leads to ✅ compliance, ❌ accuracy!
Yong Zheng-Xin tweet media
Jirui Qi@Jirui_Qi

[1/]💡New Paper Large reasoning models (LRMs) are strong in English — but how well do they reason in your language? Our latest work uncovers their limitation and a clear trade-off: Controlling Thinking Trace Language Comes at the Cost of Accuracy 📄Link: arxiv.org/abs/2505.22888

English
0
10
20
3.4K
Jonibek Mansurov retweetledi
Sophia Yang, Ph.D.
Sophia Yang, Ph.D.@sophiamyang·
Can an AI trained in English solve math problems in other languages without extra training?
English
18
77
605
35.8K
Jonibek Mansurov retweetledi
Farid Adilazuarda
Farid Adilazuarda@faridlazuarda·
Can English-finetuned LLMs reason in other languages? Short Answer: Yes, thanks to “quote-and-think” + test-time scaling. You can even force them to reason in a target language! But: 🌐 Low-resource langs & non-STEM topics still tough. New paper: arxiv.org/abs/2505.05408
Yong Zheng-Xin@yong_zhengxin

📣 New paper! We observe that reasoning language models finetuned only on English data are capable of zero-shot cross-lingual reasoning through a "quote-and-think" pattern. However, this does not mean they reason the same way across all languages or in new domains. [1/N]

English
1
6
34
6.4K
Jonibek Mansurov retweetledi
Alham Fikri Aji
Alham Fikri Aji@AlhamFikri·
🚨Multilingual LLMs, finetuned only on English reasoning data, can still reason when asked non-English questions, showing reasoning traces that go back & forth between languages. I had so much fun working on this project Please give our paper a read! arxiv.org/abs/2505.05408
Alham Fikri Aji tweet media
Yong Zheng-Xin@yong_zhengxin

📣 New paper! We observe that reasoning language models finetuned only on English data are capable of zero-shot cross-lingual reasoning through a "quote-and-think" pattern. However, this does not mean they reason the same way across all languages or in new domains. [1/N]

English
2
23
94
6.6K
Jonibek Mansurov retweetledi
Jonibek Mansurov retweetledi
AK
AK@_akhaliq·
Crosslingual Reasoning through Test-Time Scaling TL;DR: show that scaling up thinking tokens of English-centric reasoning language models, such as s1 models, can improve multilingual math reasoning performance. Also analyze the language-mixing patterns, effects of different reasoning languages (controlled by our language forcing strategies), and cross-domain generalization (from STEM to domains such as social sciences and cultural benchmarks).
AK tweet media
English
7
17
90
15.3K
Jonibek Mansurov retweetledi
Cohere Labs
Cohere Labs@Cohere_Labs·
Reasoning language models are primarily trained on English data, but do they generalize well to multilingual settings in various domains? We show that test-time scaling can improve their zero-shot crosslingual reasoning performance! 🔥
Cohere Labs tweet media
English
2
15
63
8.5K