Haoxuan You

12 posts

Haoxuan You

Haoxuan You

@XyouH

Research Scientist @ Apple AI/ML. Prev CS Ph.D. @Columbia University.

New York Katılım Aralık 2018
50 Takip Edilen395 Takipçiler
Haoxuan You retweetledi
Jiahui Yu
Jiahui Yu@jhyuxm·
Happy to share Muse Spark, a natively multimodal reasoning model w/ tool-use, visual chain of thought, and multi-agent orchestration! It’s been a fulfilling journey not just building the model, but the team and culture behind it. Now live in product. ai.meta.com/blog/introduci…
Jiahui Yu tweet media
English
18
52
445
44.1K
Haoxuan You retweetledi
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
Apple presents Manzano: Simple & scalable unified multimodal LLM • Hybrid vision tokenizer (continuous ↔ discrete) cuts task conflict • SOTA on text-rich benchmarks, competitive in gen vs GPT-4o/Nano Banana • One model for both understanding & generation • Joint recipe: pretrain + refine + SFT • Scales cleanly (300M → 30B) with consistent gains
Aran Komatsuzaki tweet media
English
4
63
317
50.6K
Haoxuan You
Haoxuan You@XyouH·
@cihangxie Cihang, our team (Foundation Models at Apple) is hiring 2025 interns. Please see my post if they are interested in!
English
2
0
3
470
Cihang Xie
Cihang Xie@cihangxie·
Two of my 4th-year PhD students, Xianhang Li (xhl-video.github.io/xianhangli/) and Zeyu Wang (zw615.github.io), are seeking for (possbily their last) internship opportunties. They are talented, hardworking and accomplished researchers who have been the leading contributors to several important works from our lab over the past few years, such as CLIPA, Recap-DataComp, and AdvXL. Their interests include multimodal LLMs, image/video generation, and AI safety. If you have related positions, please consider them 😉 You'll be amazed!
English
1
2
19
2.2K
Haoxuan You
Haoxuan You@XyouH·
Looking for a 2025 summer research intern, in the Foundation Model Team at Apple AI/ML, with the focus of Multimodal LLM / Vision-Language. Phd preferred. Apply through jobs.apple.com/en-us/details/… Also email me your resume to haoxuanyou@gmail.com! 😊
English
17
69
429
56.8K
Haoxuan You retweetledi
Zhe Gan
Zhe Gan@zhegan4·
🚀🚀 Thrilled to share MM1.5! MM1.5 is a significant upgrade of MM1. With one single set of weights, MM1.5 excels at (1) read your charts, tables, any text-rich images, (2) understand visual prompts like points and boxes, provide grounded outputs, and (3) multi-image reasoning. 🔥🔥 We also introduce two variants: (1) MM1.5-UI to understand your iPhone screen 📱, and (2) MM1.5-Video for video inputs 🎥. As a research study, we also share the detailed ablations that guided our research process (🧵)
Zhe Gan tweet media
English
5
29
180
27.6K
Haoxuan You retweetledi
Ruoming Pang
Ruoming Pang@ruomingpang·
As Apple Intelligence is rolling out to our beta users today, we are proud to present a technical report on our Foundation Language Models that power these features on devices and cloud: machinelearning.apple.com/research/apple…. 🧵
English
13
185
700
161.1K
Harold
Harold@LiLiunian·
Thank you and I am honored to receive the fellowship! Many thanks to Kai-Wei and my amazing collaborators and lab mates! Thank @GoogleAI for the generous support!
Kai-Wei Chang@kaiwei_chang

Congrats to @LiLiunian for winning Google PhD Fellowship! 🎉🥳🎊 Harold led pioneering efforts in vision-language research, including developing notable models such as VisualBERT, CLIP, and recently introduced Desco. He will be on the market this year! @uclanlp @UCLAengineering

English
9
2
58
8.5K
Haoxuan You
Haoxuan You@XyouH·
Excited to present my summer internship work🍎! Ferret is a new multimodal LLM that can accurately understand any region in an image no matter how you refer to it, and precisely localize the open-vocabulary descriptions in output! It can beat GPT-4V very often in above tasks!
Zhe Gan@zhegan4

🚀🚀Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity. 📰arxiv.org/abs/2310.07704 1⃣ Ferret enables referring of an image region at any shape 2⃣ It often shows better precise understanding of small image regions than GPT-4V (sec 5.6)

English
2
4
22
3K
Haoxuan You retweetledi
Zhe Gan
Zhe Gan@zhegan4·
🚀🚀Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity. 📰arxiv.org/abs/2310.07704 1⃣ Ferret enables referring of an image region at any shape 2⃣ It often shows better precise understanding of small image regions than GPT-4V (sec 5.6)
Zhe Gan tweet media
English
9
104
445
110.2K