Jaden Park

37 posts

Jaden Park

Jaden Park

@_jadenpark

CS Ph.D. student @UWMadison | prev. intern: @AdobeResearch; @Krafton_AI

Madison, WI Katılım Ekim 2023
587 Takip Edilen88 Takipçiler
Sabitlenmiş Tweet
Jaden Park
Jaden Park@_jadenpark·
Me: memorize past exams 📚💯 Also me: fail on a slight tweak 🤦‍♂️🤦‍♂️ Turns out, we can use the same method to 𝗱𝗲𝘁𝗲𝗰𝘁 𝗰𝗼𝗻𝘁𝗮𝗺𝗶𝗻𝗮𝘁𝗲𝗱 𝗩𝗟𝗠𝘀! 🧵(1/10) - Project Page: mm-semantic-perturbation.github.io
Jaden Park tweet media
English
1
11
28
5.5K
Jaden Park
Jaden Park@_jadenpark·
@thaoshibe GPT told me a few times that it won't be helping me if I keep being mean at it 😂
English
0
0
1
13
Thao Nguyen (Shibe)
Thao Nguyen (Shibe)@thaoshibe·
idk why talking with codex is like talking with a senior dev/ grumpy postdoc 🤣 he's knowledgeable but always very cold --- but claude (sonnet) always feel like he's your friend -- charismatic, sharp, and relatable 😊 🥲sometimes i do open claude just to chitchat with him 🥲
Thao Nguyen (Shibe) tweet mediaThao Nguyen (Shibe) tweet media
English
2
0
6
409
Jaden Park retweetledi
Aniket Rege
Aniket Rege@wregss·
🚨New work with @Meta @RealityLabs We introduce EGAgent, an agentic reasoning framework for very long video understanding powered by entity scene graphs Why? With long multimodal data streams, agents must search and reason across multiple modalities! 🧵 (1/n)
English
2
7
17
1.4K
Jaden Park retweetledi
Harris Zhang
Harris Zhang@HyperStorm9682·
New paper out! 🚨 Introducing STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs. We tackle the massive token bottleneck in video models by jointly identifying the tokens that actually matter. The overall figure below breaks down the core problem! 🧵👇
Harris Zhang tweet media
English
1
4
15
2.7K
Jaden Park retweetledi
Aniket Rege
Aniket Rege@wregss·
Hi ML Twitter! My Summer 2026 internship unfortunately fell through last minute 😵‍💫 If your team is looking for interns, I’d love to connect - RTs appreciated 🙏 My website: aniketrege.github.io
English
18
29
267
30.9K
Jaden Park
Jaden Park@_jadenpark·
There should be a meta-conference where reviewers are Claude Code. (1) Claude Code figures out how to run your code like TerminalBench. (2) Claude Code tries to run your code for 48 hours. If Claude Code can't beat your Table 1 in 2 days of vibe-research, it gets accepted ✅.
Jaden Park tweet media
English
0
2
3
138
Jaden Park retweetledi
Hoyeon Chang
Hoyeon Chang@hoyeon_chang·
1/ Pattern matching has been left ambiguous for so long. Our precise formalization pinpoints its predictive power on composition tasks! 📣 Check out our ICLR 2026 paper: 1️⃣ Sharp sample complexity bound for 2-hop task; 2️⃣ Some unexpectedly hard composition tasks: CoT won’t help.
Hoyeon Chang tweet media
English
1
11
63
5.1K
Jaden Park retweetledi
Jongwon Jeong
Jongwon Jeong@jongwonjeong123·
(1/7) 📈 3.5% → 29.5% average success on Sokoban! Even if the environment is perfectly known, LM agents can still: - plan the wrong action - execute a different action than intended Under constraints, these errors compound, so how should we design LM agents? We proposed TAPE🧵
Jongwon Jeong tweet media
English
7
9
31
1.8K
Jaden Park
Jaden Park@_jadenpark·
Excited to share that our work on detecting data contamination in VLMs has been accepted to #ICLR2026! In v2 of our paper, we add - Detecting contamination with paraphrased data. - Detecting contamination in free-form QA. To learn more: x.com/_jadenpark/sta… See you in Rio🇧🇷
Jaden Park@_jadenpark

Me: memorize past exams 📚💯 Also me: fail on a slight tweak 🤦‍♂️🤦‍♂️ Turns out, we can use the same method to 𝗱𝗲𝘁𝗲𝗰𝘁 𝗰𝗼𝗻𝘁𝗮𝗺𝗶𝗻𝗮𝘁𝗲𝗱 𝗩𝗟𝗠𝘀! 🧵(1/10) - Project Page: mm-semantic-perturbation.github.io

English
0
0
17
444
Ayoung Lee
Ayoung Lee@o_cube01·
Accepted to ICLR 2026!🎉 So grateful to my amazing collaborators 🫶 We introduce CLASH to evaluate value reasoning, revealing new failure modes in reasoning models and intriguing steerability results! 📰 Paper: arxiv.org/pdf/2504.10823
Ayoung Lee tweet media
English
6
2
62
4.7K
Jaden Park
Jaden Park@_jadenpark·
@Kangwook_Lee Very insightful and cool (almost to a point of being counterintuitive) takeaway!
English
1
0
2
286
Kangwook Lee
Kangwook Lee@Kangwook_Lee·
When should you NOT use LLM-as-a-judge at all? We have just uploaded an updated preprint on how to correctly report LLM-as-a-judge evaluations. 1. The key take away of the original preprint was -- there is no free lunch. A lot of people believe that since an LLM can tell if the model’s output is correct or not, one can use an “unlabeled” test set for evaluation. In addition to this, as LLM inference is pretty cheap, one can use a huge unlabeled test set and not worry about the estimation error. This is completely wrong. You canNOT avoid the need of a dataset that's guaranteed to be perfectly labeled. You either use this “labeled” test dataset for calibrating your LLM judge’s error rates, or you use this to directly measure the accuracy of the model being tested. 2. I hope it's clear that one MUST have a labeled test set for either reason (direct accuracy evaluation or judge calibration). A natural follow up question is this. ""Which one is better?"" Let’s say I have 1000 labeled test examples and infinitely large unlabeled test examples. Should I use this finite labeled test set for directly measuring the accuracy or should I use this as a LLM judge calibration set and use infinitely large unlabeled test examples with correction? Here’s a very interesting intuition. It boils down to which parameter is easier to estimate. From basic statistics, anything that’s closer to purely random is more difficult to estimate (needs more samples.) Accuracy is more difficult to estimate if it’s close to 0.5. Same for error rates: FPR and FNR are more difficult to estimate if they are close to 0.5. This gives an interesting conclusion. If the model being tested has an accuracy close to 0.5, directly estimating its accuracy is very inefficient. For this case, if you have an LLM judge whose error rates are relatively easier to estimate, it’s better to use the finite set to calibrate the judge and use the corrected LLM judge-based estimation. If the model being tested has an accuracy close to 0 or 1, using the dataset to directly estimate accuracy usually better off. (Unless error rates of the LLM judge are way much easier to estimate!) This answers the very first question I wrote above -- when to use and when to not use LLM as a judge. See the diagram above -- if your parameters belong to the gray region, use the labeled dataset for calibrating LLM as a judge. Otherwise, use the labeled dataset to directly estimate the accuracy. Enjoy :-)! p.s. The updated preprint also has other new results such as in-depth comparisons with PPI!
Kangwook Lee tweet media
English
14
45
373
26.4K
Jaden Park
Jaden Park@_jadenpark·
@peter9863 Very interesting work! (and great presentation btw! Love the poster)
English
0
0
1
165
Peter Lin
Peter Lin@peter9863·
Our research: Adversarial Flow Models (AF) arxiv.org/abs/2511.22475 AF unifies Adversarial and Flow Models. Unlike GANs, AF learns optimal transport (stable). Unlike CMs, AF only trains on needed timesteps (save capacity). We can train super-deep 112-layer 1NFE model! SOTA FIDs!
Peter Lin tweet media
English
13
79
564
41.8K
Xueyan Zou
Xueyan Zou@xyz2maureen·
I will join Tsinghua University, College of AI, as an Assistant Professor in the coming month. I am actively looking for 2026 spring interns and future PhDs (ping me if you are in #NeurIPS). It has been an incredible journey of 10 years since I attended an activity organized by Tsinghua University and decided to change my undergraduate major from Economics to Computer Science, inspired by one of the teammates. During the 10 years, I met with appreciation of many wonderful researchers/professors who led me to continued growth. 🐿️ My research focus will continue to be AI & Robotics, with a specific emphasis on Interactive Embodied Intelligence. You can check my homepage to learn more: maureenzou.github.io/lab.html. I am currently local to San Diego and will be attending #NeurIPS. Please ping me over WeChat or Email if any old or new friends are interested in having a coffee chat! (Really looking forward to meeting as many friends as possible at #NeurIPS) [The photo is one of the places that I will miss a lot in the US]
Xueyan Zou tweet media
English
69
87
1.1K
111.1K
Jaden Park retweetledi
Kangwook Lee
Kangwook Lee@Kangwook_Lee·
LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our preprint on how to properly use LLM as a judge. 🧵 === So how do people actually use LLM as a judge? Most people just use the LLM as an evaluator and report the empirical probability that the LLM says the answer looks correct. When the LLM is perfect, this works fine and gives an unbiased estimator. If the LLM is not perfect, this breaks. Consider a case where the LLM evaluates correctly 80 percent of the time. More specifically, if the answer is correct, the LLM says "this looks correct" with 80 percent probability, and the same 80 percent applies when the answer is actually incorrect. In this situation, you should not report the empirical probability, because it is biased. Why? Let the true probability of the tested model being correct be p. Then the empirical probability that the LLM says "correct" (= q) is q = 0.8p + 0.2(1 - p) = 0.2 + 0.6p So the unbiased estimate should be (q - 0.2) / 0.6 Things get even more interesting if the error pattern is asymmetric or if you do not know these error rates a priori. === So what does this mean? First, follow the suggested guideline in our preprint. There is no free lunch. You cannot evaluate how good your model is unless your LLM as a judge is known to be perfect at judging it. Depending on how close it is to a perfect evaluator, you need a sufficient size of test set (= calibration set) to estimate the evaluator’s error rates, and then you must correct for them. Second, very unfortunately, many findings we have seen in papers over the past few years need to be revisited. Unless two papers used the exact same LLM as a judge, comparing results across them could have produced false claims. The improvement could simply come from changing the evaluation pipeline slightly. A rigorous meta study is urgently needed. === tldr: (1) Almost all LLM-as-a-judge evaluations in the past few years were reported with a biased estimator. (2) It is easy to fix, so wait for our full preprint. (3) Many LLM-as-a-judge results should be taken with grains of salt. Full preprint coming in a few days, so stay tuned! Amazing work by my students and collaborators. @chungpa_lee @tomzeng200 @jongwonjeong123 and @jysohn1108
Kangwook Lee tweet mediaKangwook Lee tweet mediaKangwook Lee tweet media
English
47
174
1.2K
220.4K
Jaden Park
Jaden Park@_jadenpark·
We also perform extensive ablation studies: (1) using real-world counterfactuals instead of synthetic perturbations (2) detecting contamination during pre-training (3) model sizes and much more. If this interests you, please check out our work: arxiv.org/abs/2511.03774 🧵(9/10)
English
1
0
0
170
Jaden Park
Jaden Park@_jadenpark·
Me: memorize past exams 📚💯 Also me: fail on a slight tweak 🤦‍♂️🤦‍♂️ Turns out, we can use the same method to 𝗱𝗲𝘁𝗲𝗰𝘁 𝗰𝗼𝗻𝘁𝗮𝗺𝗶𝗻𝗮𝘁𝗲𝗱 𝗩𝗟𝗠𝘀! 🧵(1/10) - Project Page: mm-semantic-perturbation.github.io
Jaden Park tweet media
English
1
11
28
5.5K