Harris Zhang

16 posts

Harris Zhang banner
Harris Zhang

Harris Zhang

@HyperStorm9682

PhD Student at UW-Madison in Computer Vision

Madison, WI, US Katılım Ağustos 2021
41 Takip Edilen48 Takipçiler
Harris Zhang
Harris Zhang@HyperStorm9682·
📉 3. LoRA Finetune: Full multi-vector training is expensive. SMART acts as a highly efficient finetuning technique. By leveraging LoRA, you can convert ANY single-vector model into a multi-vector variant while saving at least 20% of compute! 🏆
Harris Zhang tweet media
English
1
1
3
257
Harris Zhang
Harris Zhang@HyperStorm9682·
🚨 Your Embedding Model is SMARTer Than You Think! Single-vector models actually hide powerful multi-vector capabilities in their frozen hidden states. We introduce SMART, a framework that unlocks this ability for SoTA multimodal retrieval. 🧵👇 🔗 huggingface.co/papers/2605.24…
Harris Zhang tweet media
English
1
18
75
16K
Harris Zhang retweetledi
Jaden Park
Jaden Park@_jadenpark·
We all knew LLM agents struggle to explore, but we had to eyeball it 👀. We couldn't measure exploration errors. Until now. 🗺️🤖 We built a policy-agnostic metric to quantify exploration and exploitation errors in LLM agents. Spoiler: Exploration error is what kills📉 agent performance in our setting 👇🧵(1/8)
Jaden Park tweet media
English
1
17
31
2K
Harris Zhang
Harris Zhang@HyperStorm9682·
@baifeng_shi Great paper Baifeng! I actually also have a recent paper Spatio-Temporal Token Scoring arxiv.org/abs/2603.18004 where we also prune tokens both in the ViT and the LLM. I'm astounded by how much you can save in the number of tokens! I've learned a lot from this work.
English
1
1
2
149
Harris Zhang
Harris Zhang@HyperStorm9682·
Paper link: arxiv.org/abs/2603.18004 Huge thanks to the people of PRIOR team at Ai2! This paper would not have been done without you all!
English
0
0
1
161
Harris Zhang
Harris Zhang@HyperStorm9682·
The final pruning figure shows the result—static, redundant background tokens are dropped, while key actions are perfectly preserved. ✂️ By filtering out the noise, STTS significantly speeds up inference while maintaining high performance. Code is open-sourced! 🔥
Harris Zhang tweet media
English
1
0
0
245
Harris Zhang
Harris Zhang@HyperStorm9682·
New paper out! 🚨 Introducing STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs. We tackle the massive token bottleneck in video models by jointly identifying the tokens that actually matter. The overall figure below breaks down the core problem! 🧵👇
Harris Zhang tweet media
English
1
4
18
4.6K
Harris Zhang retweetledi
Zhengzhong Tu
Zhengzhong Tu@_vztu·
Dear @NeurIPSConf PCs, I don't understand why we still need reviewers and area chairs if PCs are finally going to take over and overturn the AC decision without providing any reason, whereby our weeks of effort spent on rebuttals (both authors and reviewers) have been ignored.
Zhengzhong Tu tweet media
English
7
25
225
30.6K
Harris Zhang retweetledi
Yong Jae Lee
Yong Jae Lee@yong_jae_lee·
Here is the final decision for one of our NeurIPS D&B ACs-accepted-but-PCs-rejected papers, with the vague message mentioning some kind of ranking. Why was the ranking necessary? Venue capacity? If so, this sets a concerning precedent. @NeurIPSConf
Yong Jae Lee tweet media
Yong Jae Lee@yong_jae_lee

@yuyinzhou_cs @NeurIPSConf I have two D&B papers in the same situation: ACs recommended accept, but PCs overruled and rejected with the same exact vague reason that you got. They should at least provide a proper reason.

English
1
5
46
8.5K
Harris Zhang retweetledi
Mu Cai
Mu Cai@MuCai7·
1/N) Are current large multimodal models like #GPT4o really good at video understanding? 🚀 We are thrilled to introduce TemporalBench to examine temporal dynamics understanding for LMMs! Our TemporalBench reveals even the SOTA LMM #GPT4o achieves only 38.5, far from reaching the human performance 67.9. With high-quality human annotations, our TemporalBench investigates 1). Action order (change the order); (2). Action frequency (1 times v.s. two times); (3). Action type (put v.s. pull); (4). Motion magnitude (slightly v.s. intensively); (5). Motion Direction/Orientation (forward v.s. Backward, circular v.s. back-and-forth). (6). Action effector (cutting with left hand v.s. cutting with right hand) Explore TemporalBench: temporalbench.github.io
Mu Cai tweet media
English
1
15
59
25.5K
Harris Zhang retweetledi
Mu Cai
Mu Cai@MuCai7·
1/N) All current video models poorly understand videos! Even when videos are less than 10 seconds long! Best model-GPT4o achieves 35.0 while humans get 90.0 in group score. Existing LMMs severely struggle to distinguish temporal differences in Vinoground vinoground.github.io
Mu Cai tweet media
English
2
27
128
16.5K