pat

299 posts

pat banner
pat

pat

@patrickamadeus_

PhD stud @mbzuai | Past: @sgsmu @ucdavis @itbofficial | multimoding in NLP and life

Katılım Kasım 2020
568 Takip Edilen314 Takipçiler
Sabitlenmiş Tweet
pat
pat@patrickamadeus_·
2026 goal: do whatever sorcery it takes to relive these moments✌️
pat tweet media
English
3
0
9
961
pat retweetledi
Shenzhi Wang🌟
Shenzhi Wang🌟@ShenzhiWang_THU·
When training Qwen3.5, we kept asking ourselves: 🧐What kind of multimodal RLVR data actually leads to generalizable gains? 💡We believe the answer may not lie only in data tightly tailored to specific benchmarks, but also in OOD proxy tasks that train the foundational abilities behind long-chain visual reasoning. The motivation is simple: VLMs are still unreliable in long-CoT settings. Small mistakes in perception, reasoning, knowledge use, or grounding can compound across intermediate steps and eventually lead to much larger final errors. However, much of today’s RLVR data still does not require complex reasoning chains grounded in visual evidence throughout, meaning these failure modes are often not sufficiently stressed during training. 🚀Excited to share our new work from Qwen and Tsinghua LeapLab: HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning This is also one of the training task sources used in Qwen3.5 VL RLVR. To study this question, we propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training. The key idea is to build each query as a chain of logically dependent hops: earlier hops establish the instances, sets, or conditions needed for later hops, while the model must repeatedly return to the image for fresh visual grounding along the way. At the same time, each query ends with a specific, unambiguous numerical answer, making it naturally suitable for verifiable rewards. Concretely, HopChain combines two complementary structures: perception-level hops and instance-chain hops. We require each synthesized example to involve both, so the model cannot simply continue reasoning from language inertia. Instead, it is forced to keep grounding intermediate steps in the image, maintain cross-step dependencies, and control error accumulation across long reasoning trajectories. Our goal is not to mimic any specific downstream benchmark, but to strengthen the more fundamental abilities that long-CoT vision-language reasoning depends on. We add HopChain-synthesized data into RLVR training for Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and evaluate on 24 benchmarks spanning diverse domains. Despite not being designed for any particular benchmark, HopChain improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. We also find that full chained multi-hop queries are crucial: replacing them with half-multi-hop or single-hop variants reduces performance substantially. Most notably, the gains are especially strong on long-CoT and ultra-long-CoT vision-language reasoning, peaking at more than 50 accuracy points in the ultra-long-CoT regime. Our main takeaway is simple: beyond benchmark-aligned data, OOD proxy tasks that systematically train the core mechanics of long-chain visual reasoning can be a powerful and scalable source of RLVR supervision for VLMs — and can lead to more generalizableimprovements. 🔗 huggingface.co/papers/2603.17…
Shenzhi Wang🌟 tweet mediaShenzhi Wang🌟 tweet mediaShenzhi Wang🌟 tweet media
English
2
50
415
50.8K
pat retweetledi
Nunchi
Nunchi@nunchi·
Introducing Auto-Research Trading A fully autonomous research loop that lets your AI trading agent teach itself to trade. Inspired by Karpathy’s auto-research. Others use it to evolve how AI agents work together - we built it for trading. Now open-sourced on our GitHub.
English
81
131
1.6K
148.6K
pat retweetledi
OpenAI Newsroom
OpenAI Newsroom@OpenAINewsroom·
We've reached an agreement to acquire Astral. After we close, OpenAI plans for @astral_sh to join our Codex team, with a continued focus on building great tools and advancing the shared mission of making developers more productive. openai.com/index/openai-t…
English
486
828
7.3K
4M
pat retweetledi
Sam Altman
Sam Altman@sama·
I have so much gratitude to people who wrote extremely complex software character-by-character. It already feels difficult to remember how much effort it really took. Thank you for getting us to this point.
English
4.5K
2.2K
35.9K
5.5M
pat retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Visual-ERM An 8B multimodal reward model that judges visual equivalence in the rendered space for charts, tables & SVGs. Beats 235B models on fine-grained discrepancy detection and improves vision-to-code RL by +8.4 points.
DailyPapers tweet media
English
3
5
35
2.2K
pat retweetledi
Alham Fikri Aji
Alham Fikri Aji@AlhamFikri·
VLMs can easily get distracted by unrelated cultural cues. Happy to present our work on this soon at #CVPR2026🥳 Working on multilingual VLMs? Consider using our benchmark: 📜arxiv.org/pdf/2511.17004 🤗huggingface.co/datasets/patri… Amazing work by @patrickamadeus_ and colleagues!
Alham Fikri Aji tweet media
pat@patrickamadeus_

Excited to share that we have committed our paper “Vision-Language Models are Confused Tourists” to #CVPR2026 (Findings)! 🇺🇸🏔 Arxiv: arxiv.org/abs/2511.17004 We question whether current SOTA VLMs remain robust in simple cultural grounding QA when distracting contextual objects are present For example, if you eat chicken schnitzel with Mt. Fuji in the background, will the model fail to recognize it as Japanese katsu? ConfusedTourists introduces: 👉 5k+ evaluation samples across 3 cultural item categories, comprising 243 unique cultural items from 57 countries and 11 sub-regions 🌍 👉 Evaluation of 14 VLMs across 12 data features 🤖 👉 Findings showing that simple concept mixing can cause up to a -40% drop in perform 📉 Special thanks to my co-authors @IkhlasulHanif0 , @emthehunt, @gentaiscool, @FajriKoto, and my advisor @AlhamFikri for the valuable contributions along the way! #multimodal #vlm #multicultural #robustness #evaluation #NLProc #ComputerVision

English
2
18
72
7.2K
pat
pat@patrickamadeus_·
Excited to share that we have committed our paper “Vision-Language Models are Confused Tourists” to #CVPR2026 (Findings)! 🇺🇸🏔 Arxiv: arxiv.org/abs/2511.17004 We question whether current SOTA VLMs remain robust in simple cultural grounding QA when distracting contextual objects are present For example, if you eat chicken schnitzel with Mt. Fuji in the background, will the model fail to recognize it as Japanese katsu? ConfusedTourists introduces: 👉 5k+ evaluation samples across 3 cultural item categories, comprising 243 unique cultural items from 57 countries and 11 sub-regions 🌍 👉 Evaluation of 14 VLMs across 12 data features 🤖 👉 Findings showing that simple concept mixing can cause up to a -40% drop in perform 📉 Special thanks to my co-authors @IkhlasulHanif0 , @emthehunt, @gentaiscool, @FajriKoto, and my advisor @AlhamFikri for the valuable contributions along the way! #multimodal #vlm #multicultural #robustness #evaluation #NLProc #ComputerVision
pat tweet mediapat tweet mediapat tweet media
English
5
14
42
9.7K
pat
pat@patrickamadeus_·
@zmkzmkz Thank you!
English
0
0
1
85
pat retweetledi
Hanif | AI NOT FOR PRODUCTIVITY
Hanif | AI NOT FOR PRODUCTIVITY@IkhlasulHanif0·
Check out our work! (CVPR Findigs 2026)
pat@patrickamadeus_

Excited to share that we have committed our paper “Vision-Language Models are Confused Tourists” to #CVPR2026 (Findings)! 🇺🇸🏔 Arxiv: arxiv.org/abs/2511.17004 We question whether current SOTA VLMs remain robust in simple cultural grounding QA when distracting contextual objects are present For example, if you eat chicken schnitzel with Mt. Fuji in the background, will the model fail to recognize it as Japanese katsu? ConfusedTourists introduces: 👉 5k+ evaluation samples across 3 cultural item categories, comprising 243 unique cultural items from 57 countries and 11 sub-regions 🌍 👉 Evaluation of 14 VLMs across 12 data features 🤖 👉 Findings showing that simple concept mixing can cause up to a -40% drop in perform 📉 Special thanks to my co-authors @IkhlasulHanif0 , @emthehunt, @gentaiscool, @FajriKoto, and my advisor @AlhamFikri for the valuable contributions along the way! #multimodal #vlm #multicultural #robustness #evaluation #NLProc #ComputerVision

English
0
1
6
229
pat
pat@patrickamadeus_·
@erla_ndpg Yes! We found that the misattention pattern is very consistent with the wrongly guessed answer 🤔 Please take a look at our paper hook + more detailed threads here! x.com/i/status/20034…
pat tweet media
pat@patrickamadeus_

Craving holiday-themed paper? Say less🎄 Turns out, Vision Language Models are Confused Tourists ✈️😵‍💫 We show that adversarially induced cultural scenes significantly impair VLM cultural comprehension and trigger potential bias #NLProc #multimodal #robustness /thread 🧵(1/8)

English
1
1
5
108
Edd
Edd@erla_ndpg·
Very interesting probes. How does the attention map look like when the model given such probes?
pat@patrickamadeus_

Excited to share that we have committed our paper “Vision-Language Models are Confused Tourists” to #CVPR2026 (Findings)! 🇺🇸🏔 Arxiv: arxiv.org/abs/2511.17004 We question whether current SOTA VLMs remain robust in simple cultural grounding QA when distracting contextual objects are present For example, if you eat chicken schnitzel with Mt. Fuji in the background, will the model fail to recognize it as Japanese katsu? ConfusedTourists introduces: 👉 5k+ evaluation samples across 3 cultural item categories, comprising 243 unique cultural items from 57 countries and 11 sub-regions 🌍 👉 Evaluation of 14 VLMs across 12 data features 🤖 👉 Findings showing that simple concept mixing can cause up to a -40% drop in perform 📉 Special thanks to my co-authors @IkhlasulHanif0 , @emthehunt, @gentaiscool, @FajriKoto, and my advisor @AlhamFikri for the valuable contributions along the way! #multimodal #vlm #multicultural #robustness #evaluation #NLProc #ComputerVision

English
1
0
7
199
pat
pat@patrickamadeus_·
@AlhamFikri Thank you master! 🫡
English
0
0
2
148
pat
pat@patrickamadeus_·
ch1 progress can take 2 forms, (horizontal / extensive / globalization) or (vertical / intensive / technology). without new technology, globalization will just speed up catastrophe startup thinking: - question received ideas and rethink venture from scratch - big orgs moves slow extensively, lone genius invents greatness that can't change whole industry - so work with other people to get stuffs done, but stay small enough so that u actually can
English
0
0
0
29
pat
pat@patrickamadeus_·
@naval zero to one - peter thiel (w/ blake masters)
English
1
0
0
31
pat
pat@patrickamadeus_·
2026 goal: do whatever sorcery it takes to relive these moments✌️
pat tweet media
English
3
0
9
961
pat
pat@patrickamadeus_·
@naval Happiness is emptiness of desire
English
0
0
0
30
pat
pat@patrickamadeus_·
@naval the almanack of naval ravikant - eric jorgenson
English
4
0
0
86