Weiming Ren

89 posts

Weiming Ren

Weiming Ren

@wmren993

CS PhD student @UWaterloo @UWCheritonCS

Katılım Kasım 2023
145 Takip Edilen132 Takipçiler
Sabitlenmiş Tweet
Weiming Ren
Weiming Ren@wmren993·
1/ 🚀 We’re excited to share Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! Tuna-2 is a native unified multimodal model that supports visual understanding, text-to-image generation, and image editing directly from pixel embeddings. 🐟✨ 📄 Paper: arxiv.org/abs/2604.24763 🌐 Project: tuna-ai.org/tuna-2 💻 Code: github.com/facebookresear… Most unified multimodal models still rely on pretrained vision encoders, which add architectural complexity and can create representation mismatches between understanding and generation. Tuna-2 asks a simple question: Do we still need vision encoders? 👀 Our answer is No! Tuna-2 has a completely encoder-free architecture, where images are processed directly by a unified transformer together with text tokens. Take a glimpse at what our model can generate ↓ 🎨🖼️
Weiming Ren tweet media
English
4
14
25
1.2K
Weiming Ren
Weiming Ren@wmren993·
@paranioar @Jacoed @andrew_n_carr Hi Haiwen, congrats on the SenseNova-U1 release! We started Tuna-2 last December and I believe my coauthor also shared some of our encoder-free UMM insights with you before NEO-Unify’s blog was out. Looking forward to NEO-Unify's tech report and happy to cite it in our revision!
English
1
0
2
247
Haiwen Diao
Haiwen Diao@paranioar·
@Jacoed @andrew_n_carr That is NEOv1, an understanding-only model. NEO-Unify (March 5) is our encoder-free unified model. Details: huggingface.co/blog/sensenova… However, NEO-Unify is not included in tuna2 at all, even though I discussed some insights with the first author after NEO-Unify’ release…
English
1
0
4
255
Andrew Carr 🤸
Andrew Carr 🤸@andrew_n_carr·
history doesn't repeat, but lots of rhyming going on here
Andrew Carr 🤸 tweet mediaAndrew Carr 🤸 tweet media
English
6
9
115
15.4K
Weiming Ren retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Meta releases Tuna-2: an encoder-free multimodal model It understands and generates images from raw pixels alone. No VAE. No vision encoder. Just patch embeddings. And it beats encoder-based models on fine-grained perception benchmarks.
DailyPapers tweet media
English
5
44
237
16.8K
Luc
Luc@lucrbvi·
@wmren993 Thanks for your work, it's really cool! I was thinking about this within a "big and beautiful unified transformer" for every modalities
English
1
0
0
90
Weiming Ren
Weiming Ren@wmren993·
1/ 🚀 We’re excited to share Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! Tuna-2 is a native unified multimodal model that supports visual understanding, text-to-image generation, and image editing directly from pixel embeddings. 🐟✨ 📄 Paper: arxiv.org/abs/2604.24763 🌐 Project: tuna-ai.org/tuna-2 💻 Code: github.com/facebookresear… Most unified multimodal models still rely on pretrained vision encoders, which add architectural complexity and can create representation mismatches between understanding and generation. Tuna-2 asks a simple question: Do we still need vision encoders? 👀 Our answer is No! Tuna-2 has a completely encoder-free architecture, where images are processed directly by a unified transformer together with text tokens. Take a glimpse at what our model can generate ↓ 🎨🖼️
Weiming Ren tweet media
English
4
14
25
1.2K
Weiming Ren
Weiming Ren@wmren993·
6/ We further visualize attention maps to understand what Tuna-2 learns from end-to-end pixel-space training. Compared with encoder-based LMMs and previous Tuna variants, Tuna-2 shows more accurate and stable cross-modal alignment.
Weiming Ren tweet media
English
1
0
1
178
Rosinality
Rosinality@rosinality·
Pixel-based unified understanding and generation model using JiT. Uses MAE for representation learning.
Rosinality tweet media
English
4
54
341
23K
Weiming Ren retweetledi
Yuren Cong
Yuren Cong@CongYuren·
1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation. Paper: huggingface.co/papers/2604.24… Code: github.com/facebookresear… Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!
Yuren Cong tweet media
English
11
11
88
84.7K
Weiming Ren retweetledi
Cong Wei
Cong Wei@CongWei1230·
🚀 Introducing RationalRewards — a reasoning-based reward model that improves image generation(t2i and editing) quality at both training and test time. 🧠 We taught reward models to think before they score, leading to more reliable feedback and reduced reward hacking during RL fine-tuning. 🔁 At test time, the same RationalRewards model performs prompt tuning by iteratively generating, scoring with reasoning, and rewriting prompts. Here’s the surprising part 👇 test-time prompt tuning matches — sometimes beats — full RL fine-tuning, which costs 400 GPU hours.🔥 💡 This suggests an alternative to parameter tuning: your diffusion model may already have strong capabilities — prompt tuning matters. --- 🧵 Fully open-source. ✅ Code + training recipes ✅ RationalRewards-8B model + data ✅ Full benchmarks + GCR loop 🔗 Website: tiger-ai-lab.github.io/RationalReward… 🔗 GitHub: github.com/TIGER-AI-Lab/R… 🤗 Models & Data: huggingface.co/collections/TI… 📄 Paper: huggingface.co/papers/2604.11… --- #ImageGeneration #AIGC #RewardModels #RL #DiffusionModels #TestTimeScaling
Cong Wei tweet media
English
4
15
94
10.3K
Weiming Ren retweetledi
Yuntian Deng
Yuntian Deng@yuntiandeng·
🚀 Launching ProgramAsWeights (PAW)! Define functions in English → PAW compiles them into tiny neural programs → Run locally like normal Python functions. A neural program combines discrete text + continuous LoRA to adapt a fixed small interpreter. 🔗 programasweights.com
Yuntian Deng tweet media
English
10
71
328
32.5K
Weiming Ren retweetledi
Dongfu Jiang
Dongfu Jiang@DongfuJiang·
🚨 Introducing ClawBench: a benchmark for evaluating whether AI agents can actually complete everyday online tasks on the real web. 💡 We move beyond static HTML and sandbox replicas to 153 realistic tasks across 144 live websites—from booking flights and filling forms to submitting applications and completing purchases. The goal is simple: measure the gap between benchmark success and real-world usefulness. 📉 That gap is large: models that look strong on traditional web-agent benchmarks drop sharply on ClawBench—Claude Sonnet 4.6 gets 33.3%, and GPT-5.4 gets 6.5%. 🧪 Thanks to @ReacherZhang’s great work, we also make it easy to run: uv pip install clawbench-eval clawbench 🔒 Under the hood, agents interact with real websites, while we intercept only the final submission request to prevent real-world side effects. 🤗HF Paper: huggingface.co/papers/2604.08… 🌐Website: claw-bench.com ⚙️Github: github.com/reacher-z/Claw… 🧵 More on what makes it hard, how evaluation works, and where current agents fail 👇: #AI #Agents #WebAgents #Benchmark #LLM #OpenSource #Claw
Dongfu Jiang tweet media
English
3
8
26
1.4K
Weiming Ren retweetledi
Zhuofeng Li
Zhuofeng Li@zhuofengli96475·
🚀 OpenResearcher paper is finally released! 🔥 We explore how to synthesize long-horizon research trajectories for deep-research agents — fully offline, scalable, and low-cost, without relying on live web APIs. 📄 huggingface.co/papers/2603.20… 🧩Two key ideas: Offline Corpus — One-time bootstrapping seeds 10K gold passages + 15M-doc FineWeb corpus. 📚 Explicit Browsing Primitives — Just 3 ops: search / open / find. The agent learns not just what to retrieve, but how to inspect docs and localize evidence at multiple scales. 🔎 📊 Results: 54.8% on BrowseComp-Plus with our 30B-A3B — #1 open-source under the same search engine setup. Beating much larger models like GPT-4.1, Claude-Opus-4, Gemini-2.5-Pro, and DeepSeek-R1. 💡 Insights: Beyond accuracy, we dissect deep research pipeline design—from data filtering and agent configuration to retrieval accuracy dynamics (RQ1-RQ5). Try it yourself: 🛠️ Code: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT
Zhuofeng Li tweet mediaZhuofeng Li tweet mediaZhuofeng Li tweet mediaZhuofeng Li tweet media
Dongfu Jiang@DongfuJiang

🚀 Introducing OpenResearcher: a fully offline pipeline for synthesizing 100+ turn deep-research trajectories—no search/scrape APIs, no rate limits, no nondeterminism. 💡 We use GPT-OSS-120B + a local retriever + a 10T-token corpus to generate long-horizon tool-use traces (search → open → find) that look like real browsing, but are free + reproducible. 📈 The payoff: SFT on these trajectories turns Nemotron-3-Nano-30B-A3B from 20.8% → 54.8% accuracy on BrowseComp-Plus (+34.0). 🧩 What makes it work? 🔎 Offline corpus = 15M FineWeb docs + 10K “gold” passages (bootstrapped once) 🧰 Explicit browsing primitives = better evidence-finding than “retrieve-and-read” 🎯 Reject sampling = keep only successful long-horizon traces 🧵 And we’re releasing everything: ✅ code + search engine + corpus recipe ✅ 96K-ish trajectories + eval logs ✅ trained models + live demo 👨‍💻 GitHub: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… 🔎 Eval logs: huggingface.co/datasets/OpenR… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT

English
11
60
310
45.9K