Jonibek Mansurov

23 posts

Jonibek Mansurov

@M_Jonibek

Katılım Mayıs 2023

130 Takip Edilen34 Takipçiler

Jonibek Mansurov retweetledi

Alham Fikri Aji@AlhamFikri·19 Nis

@akshay_pachaar The main (first, last/corresponding) authors are from MBZUAI. The UCL author is also visiting MBZUAI. Please credit them accordingly x.com/AlhamFikri/sta…

Alham Fikri Aji@AlhamFikri

"Researchers from UCL" but they are all from MBZUAI, even the first and corresponding authors Give proper credit please. You don't have to go this low for clout

English

3.4K

Jonibek Mansurov retweetledi

Alham Fikri Aji@AlhamFikri·30 Mar

Should we treat LLM benchmarking like an annual Olympiad event?🏆 With current benchmarks, it is too easy to overfit tasks or manipulate settings. In some cases, people just cheat / being narrow-tuned to a specific benchmark (*cough* LLaMa-4) What if we organized an annual, Olympiad-like event? The tasks must be sealed and unknown. Models cannot study for the test. They must be prepared for anything. We explain this in our new position paper. I am an IOI alum long time ago. I practiced for years to master many algorithms. I wanted to be ready for whatever appeared on the contest day. I believe general LLMs should face the same standard. If they are truly general, they should be ready for whatever use cases. We propose a flow similar to how we typically organize an Olympiad: - Call for Task: We propose an open solicitation for challenging, high-quality tasks from the global research community. - Organizing Committee: A dedicated team curates and improves these submissions. They verify task quality and diversity. - Model Developers: Developers submit their systems blindly before the tasks are revealed. This prevents teams from iterative gaming or manual tuning once the exam starts. - The Actual Olympiad: Evaluation happens in a synchronized, short window. The sealed tasks are released, and all models are tested simultaneously to maintain total integrity under the same setting. Once it is done, everything will be released for reproducibility. Read the full position paper here: arxiv.org/abs/2603.23292 We worked on this together with my student @jcblaisecruz Let me know your thoughts!

English

6.9K

Jonibek Mansurov retweetledi

Blaise Cruz@jcblaisecruz·25 Mar

New position paper! 📄 "LLM Olympiad: Why Model Evaluation Needs a Sealed Exam" We argue that NLP needs an Olympiad-style event: seal the problems, freeze submissions, run one harness, release everything for audit. w/ @AlhamFikri Paper: arxiv.org/abs/2603.23292

English

2.7K

Jonibek Mansurov retweetledi

Alham Fikri Aji@AlhamFikri·14 Mar

As an advisor, my take is that a PhD student is not a paper-generating machine. If a professor thinks AI agents can replace students, they might have lost sight of what advising is truly about IMO mentorship is about training an independent thinker, building a relationship, and cultivating lifelong bonds. The joy of raising a successful student and watching them flourish is something you simply can't get from prompting an LLM agent

Sayash Kapoor@sayashk

In the last few months, I've spoken to many CS professors who asked me if we even need CS PhD students anymore. Now that we have coding agents, can't professors work directly with agents? My view is that equipping PhD students with coding agents will allow them to do work that is orders of magnitude more impressive than they otherwise could. And they can be *accountable* for their outcomes in a way agents can't (yet). For example, who checks the agent's outputs are correct? Who is responsible for mistakes or errors?

English

245

23K

Jonibek Mansurov retweetledi

pat ✈️ CVPR@patrickamadeus_·15 Mar

Excited to share that we have committed our paper “Vision-Language Models are Confused Tourists” to #CVPR2026 (Findings)! 🇺🇸🏔 Arxiv: arxiv.org/abs/2511.17004 We question whether current SOTA VLMs remain robust in simple cultural grounding QA when distracting contextual objects are present For example, if you eat chicken schnitzel with Mt. Fuji in the background, will the model fail to recognize it as Japanese katsu? ConfusedTourists introduces: 👉 5k+ evaluation samples across 3 cultural item categories, comprising 243 unique cultural items from 57 countries and 11 sub-regions 🌍 👉 Evaluation of 14 VLMs across 12 data features 🤖 👉 Findings showing that simple concept mixing can cause up to a -40% drop in perform 📉 Special thanks to my co-authors @IkhlasulHanif0 , @emthehunt, @gentaiscool, @FajriKoto, and my advisor @AlhamFikri for the valuable contributions along the way! #multimodal #vlm #multicultural #robustness #evaluation #NLProc #ComputerVision

English

10.7K

Jonibek Mansurov retweetledi

Farid Adilazuarda@faridlazuarda·29 Oca

🚀🚨 Sparse-Frontier Major Updates! You can now evaluate Reasoning + Sparse models at speed, with Sparse-Frontier upgraded to the @vllm_project's v1 engine🔥 We still provide support for Tensor Parallelism and the original sparse attention baselines, but it now works cleanly with newer models, decoding strategies, and evaluation setups. Task coverage and model support were also expanded as part of this release. The config-based workflow stays the same. If you’re working on sparse decoding, reasoning models, or long-context evaluation, this update makes it easier to run consistent experiments across models, tasks, and attention methods⚡️ Really enjoyed working with @p_nawrot and @PontiEdoardo over the past months to get this release out!

English

17.2K

Jonibek Mansurov retweetledi

Sama🌪@SamaHadhod·20 Oca

(1/9) Excited to share our new paper🥳, Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming.

English

3.3K

Jonibek Mansurov retweetledi

Blaise Cruz@jcblaisecruz·16 Oca

1/11 Proud to share our new paper: SENSIA (SENse-based Symmetric Interlingual Alignment) — a sense-based approach to multilingual adaptation. Goal: explicit representation-level alignment of meaning.

English

3.5K

Jonibek Mansurov retweetledi

Haryo@haryoaw·15 Oca

Most culture test benchmark is mostly static, which may lead to data saturation and leakage, hence making the score not reliable to measure the capability of LLMs. Thus, we benchmark these LLMs to play a social deduction game!

English

3.3K

Jonibek Mansurov retweetledi

Alham Fikri Aji@AlhamFikri·14 Ara

🎉Silver for @mbzuai at ACPC!!🥈 We also topped the Gulf rankings Amazing performance by our 1st year UG students in our very first participation Next: ICPC World Finals 2026🤞

English

14.5K

Jonibek Mansurov retweetledi

Alham Fikri Aji@AlhamFikri·28 Kas

I heard someone canceled their conference trip coz of a threat over this. Juniors could also get their careers hurt for criticizing big name authors Don't let curiosity get the better of you. Knowing who hates your paper can affect you for years And please DON'T harass anyone

ICLR@iclr_conf

English

4.6K

Jonibek Mansurov@M_Jonibek·25 Ağu

@patrickamadeus_ @mbzuai @AlhamFikri Welcome to the club

English

109

pat ✈️ CVPR@patrickamadeus_·24 Ağu

Personal update: I am starting my PhD @mbzuai where I look forward to work in multimodal realm (interpretability, modality imbalance, eval & application) to address foundational gaps with @AlhamFikri and co.

English

144

13.1K

Jonibek Mansurov retweetledi

Yong Zheng-Xin@yong_zhengxin·30 May

This is incredible findings – a reproducibility crisis where baselines are not faithfully reproduced or reported (e.g., footnote indicating performance difference) 🍎 In our work (arxiv.org/abs/2505.05408) we tried so hard to ensure apple-to-apple comparison.

Shashwat Goel@ShashwatGoel7

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

English

897

Jonibek Mansurov retweetledi

Yong Zheng-Xin@yong_zhengxin·30 May

Amidst the evaluation/reproducibility crisis for reasoning LLMs, it's great to see *concurrent independent work (with different models & benchmarks) aligns with our findings*! We reported the same fundamental trade-off: language forcing leads to ✅ compliance, ❌ accuracy!

Jirui Qi@Jirui_Qi

[1/]💡New Paper Large reasoning models (LRMs) are strong in English — but how well do they reason in your language? Our latest work uncovers their limitation and a clear trade-off: Controlling Thinking Trace Language Comes at the Cost of Accuracy 📄Link: arxiv.org/abs/2505.22888

English

3.4K

Jonibek Mansurov retweetledi

Sophia Yang, Ph.D.@sophiamyang·18 May

Can an AI trained in English solve math problems in other languages without extra training?

English

605

35.8K

Jonibek Mansurov retweetledi

Farid Adilazuarda@faridlazuarda·10 May

Can English-finetuned LLMs reason in other languages? Short Answer: Yes, thanks to “quote-and-think” + test-time scaling. You can even force them to reason in a target language! But: 🌐 Low-resource langs & non-STEM topics still tough. New paper: arxiv.org/abs/2505.05408

Yong Zheng-Xin@yong_zhengxin

📣 New paper! We observe that reasoning language models finetuned only on English data are capable of zero-shot cross-lingual reasoning through a "quote-and-think" pattern. However, this does not mean they reason the same way across all languages or in new domains. [1/N]

English

6.4K

Jonibek Mansurov retweetledi

Alham Fikri Aji@AlhamFikri·10 May

🚨Multilingual LLMs, finetuned only on English reasoning data, can still reason when asked non-English questions, showing reasoning traces that go back & forth between languages. I had so much fun working on this project Please give our paper a read! arxiv.org/abs/2505.05408

Yong Zheng-Xin@yong_zhengxin

English

6.6K

Jonibek Mansurov retweetledi

Genta Winata@gentaiscool·10 May

⭐️Reasoning LLMs trained on English data can think in other languages. Read our paper to learn more! Thank you @yong_zhengxin for leading the project and team! It was an exciting colab! @faridlazuarda @M_Jonibek @ruochenz_ @Muennighoff @CarstenEickhoff Julia Kreutzer @gentaiscool @stevebach @AlhamFikri

Yong Zheng-Xin@yong_zhengxin

English

2.1K

Jonibek Mansurov retweetledi

AK@_akhaliq·9 May

Crosslingual Reasoning through Test-Time Scaling TL;DR: show that scaling up thinking tokens of English-centric reasoning language models, such as s1 models, can improve multilingual math reasoning performance. Also analyze the language-mixing patterns, effects of different reasoning languages (controlled by our language forcing strategies), and cross-domain generalization (from STEM to domains such as social sciences and cultural benchmarks).

English

15.3K

Jonibek Mansurov retweetledi

Cohere Labs@Cohere_Labs·9 May

Reasoning language models are primarily trained on English data, but do they generalize well to multilingual settings in various domains? We show that test-time scaling can improve their zero-shot crosslingual reasoning performance! 🔥

English

8.5K

Keşfet

@akshay_pachaar @jcblaisecruz @AlhamFikri @IkhlasulHanif0 @emthehunt @gentaiscool @FajriKoto @vllm_project