Harris Zhang

16 posts

Harris Zhang

@HyperStorm9682

PhD Student at UW-Madison in Computer Vision

Madison, WI, US Katılım Ağustos 2021

41 Takip Edilen48 Takipçiler

Harris Zhang@HyperStorm9682·1d

💻 Stop discarding the fine-grained local evidence in your token sequences! SMART gives you the efficiency of a single-vector retriever with the richness of multi-vector. Code and weights are fully open-sourced: github.com/HanSolo9682/SM… huggingface.co/collections/Ha…

English

218

Harris Zhang@HyperStorm9682·1d

📉 3. LoRA Finetune: Full multi-vector training is expensive. SMART acts as a highly efficient finetuning technique. By leveraging LoRA, you can convert ANY single-vector model into a multi-vector variant while saving at least 20% of compute! 🏆

English

257

Harris Zhang@HyperStorm9682·1d

🚨 Your Embedding Model is SMARTer Than You Think! Single-vector models actually hide powerful multi-vector capabilities in their frozen hidden states. We introduce SMART, a framework that unlocks this ability for SoTA multimodal retrieval. 🧵👇 🔗 huggingface.co/papers/2605.24…

English

16K

Harris Zhang retweetledi

Jaden Park@_jadenpark·17 Nis

We all knew LLM agents struggle to explore, but we had to eyeball it 👀. We couldn't measure exploration errors. Until now. 🗺️🤖 We built a policy-agnostic metric to quantify exploration and exploitation errors in LLM agents. Spoiler: Exploration error is what kills📉 agent performance in our setting 👇🧵(1/8)

English

Harris Zhang@HyperStorm9682·25 Mar

@baifeng_shi Great paper Baifeng! I actually also have a recent paper Spatio-Temporal Token Scoring arxiv.org/abs/2603.18004 where we also prune tokens both in the ViT and the LLM. I'm astounded by how much you can save in the number of tokens! I've learned a lot from this work.

English

149

Baifeng@baifeng_shi·24 Mar

Humans can see in high-res, high-FPS in real-time. Why can't VLMs? Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos. 📄 arxiv.org/abs/2603.12254 🌐 autogaze.github.io 🤗 huggingface.co/collections/bf… (1/n)🧵

English

201

1.6K

155.4K

Harris Zhang@HyperStorm9682·19 Mar

Paper link: arxiv.org/abs/2603.18004 Huge thanks to the people of PRIOR team at Ai2! This paper would not have been done without you all!

English

161

Harris Zhang@HyperStorm9682·19 Mar

The final pruning figure shows the result—static, redundant background tokens are dropped, while key actions are perfectly preserved. ✂️ By filtering out the noise, STTS significantly speeds up inference while maintaining high performance. Code is open-sourced! 🔥

English

245

Harris Zhang@HyperStorm9682·19 Mar

New paper out! 🚨 Introducing STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs. We tackle the massive token bottleneck in video models by jointly identifying the tokens that actually matter. The overall figure below breaks down the core problem! 🧵👇

English

4.6K

Harris Zhang@HyperStorm9682·17 Ara

Super glad to be a part of the Molmo2 project! Was able to train a couple of variants and experiment with modeling along the way. What a great effort from our team!

Ai2@allen_ai

Molmo 2 doesn't just answer questions about clips—it searches & points. The model returns coordinates & timestamps over videos + images, powering QA, counting, dense captioning, artifact detection, & subtitle-aware analysis. You can see exactly how it reasoned.

English

296

Harris Zhang retweetledi

Zhengzhong Tu@_vztu·18 Eyl

Dear @NeurIPSConf PCs, I don't understand why we still need reviewers and area chairs if PCs are finally going to take over and overturn the AC decision without providing any reason, whereby our weeks of effort spent on rebuttals (both authors and reviewers) have been ignored.

English

225

30.6K

Harris Zhang retweetledi

Yong Jae Lee@yong_jae_lee·19 Eyl

Here is the final decision for one of our NeurIPS D&B ACs-accepted-but-PCs-rejected papers, with the vague message mentioning some kind of ranking. Why was the ranking necessary? Venue capacity? If so, this sets a concerning precedent. @NeurIPSConf

Yong Jae Lee@yong_jae_lee

@yuyinzhou_cs @NeurIPSConf I have two D&B papers in the same situation: ACs recommended accept, but PCs overruled and rejected with the same exact vague reason that you got. They should at least provide a proper reason.

English

8.5K

Harris Zhang retweetledi

Mu Cai@MuCai7·15 Eki

1/N) Are current large multimodal models like #GPT4o really good at video understanding? 🚀 We are thrilled to introduce TemporalBench to examine temporal dynamics understanding for LMMs! Our TemporalBench reveals even the SOTA LMM #GPT4o achieves only 38.5, far from reaching the human performance 67.9. With high-quality human annotations, our TemporalBench investigates 1). Action order (change the order); (2). Action frequency (1 times v.s. two times); (3). Action type (put v.s. pull); (4). Motion magnitude (slightly v.s. intensively); (5). Motion Direction/Orientation (forward v.s. Backward, circular v.s. back-and-forth). (6). Action effector (cutting with left hand v.s. cutting with right hand) Explore TemporalBench: temporalbench.github.io

English

25.5K

Harris Zhang retweetledi

Mu Cai@MuCai7·4 Eki

1/N) All current video models poorly understand videos! Even when videos are less than 10 seconds long! Best model-GPT4o achieves 35.0 while humans get 90.0 in group score. Existing LMMs severely struggle to distinguish temporal differences in Vinoground vinoground.github.io

English

128

16.5K

Keşfet

@baifeng_shi @NeurIPSConf @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA