Weiming Ren

89 posts

Weiming Ren

@wmren993

CS PhD student @UWaterloo @UWCheritonCS

Katılım Kasım 2023

145 Takip Edilen132 Takipçiler

Sabitlenmiş Tweet

Weiming Ren@wmren993·5d

1/ 🚀 We’re excited to share Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! Tuna-2 is a native unified multimodal model that supports visual understanding, text-to-image generation, and image editing directly from pixel embeddings. 🐟✨ 📄 Paper: arxiv.org/abs/2604.24763 🌐 Project: tuna-ai.org/tuna-2 💻 Code: github.com/facebookresear… Most unified multimodal models still rely on pretrained vision encoders, which add architectural complexity and can create representation mismatches between understanding and generation. Tuna-2 asks a simple question: Do we still need vision encoders? 👀 Our answer is No! Tuna-2 has a completely encoder-free architecture, where images are processed directly by a unified transformer together with text tokens. Take a glimpse at what our model can generate ↓ 🎨🖼️

English

1.2K

Weiming Ren@wmren993·4d

@paranioar @Jacoed @andrew_n_carr Hi Haiwen, congrats on the SenseNova-U1 release! We started Tuna-2 last December and I believe my coauthor also shared some of our encoder-free UMM insights with you before NEO-Unify’s blog was out. Looking forward to NEO-Unify's tech report and happy to cite it in our revision!

English

247

Haiwen Diao@paranioar·5d

@Jacoed @andrew_n_carr That is NEOv1, an understanding-only model. NEO-Unify (March 5) is our encoder-free unified model. Details: huggingface.co/blog/sensenova… However, NEO-Unify is not included in tuna2 at all, even though I discussed some insights with the first author after NEO-Unify’ release…

English

255

Andrew Carr 🤸@andrew_n_carr·6d

history doesn't repeat, but lots of rhyming going on here

English

115

15.4K

Weiming Ren retweetledi

青龍聖者@bdsqlsz·5d

Abandoning VAE and moving towards pixel space seems to be a trend.🧐

AK@_akhaliq

Meta presents Tuna-2 Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation paper: huggingface.co/papers/2604.24…

English

126

13.8K

Weiming Ren@wmren993·5d

@felixudr Thanks for sharing our work🍣

English

Felix Juefei Xu@felixudr·5d

🍣🍣🍣🎉🤩🤩🤩 great work from Meta. Also great work from SenseNova-U1 github.com/OpenSenseNova/… @liuziwei7

AK@_akhaliq

Meta presents Tuna-2 Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation paper: huggingface.co/papers/2604.24…

English

11.4K

Weiming Ren retweetledi

DailyPapers@HuggingPapers·5d

Meta releases Tuna-2: an encoder-free multimodal model It understands and generates images from raw pixels alone. No VAE. No vision encoder. Just patch embeddings. And it beats encoder-based models on fine-grained perception benchmarks.

English

237

16.8K

Weiming Ren@wmren993·5d

@lucrbvi Thanks for following our work!

English

Luc@lucrbvi·5d

@wmren993 Thanks for your work, it's really cool! I was thinking about this within a "big and beautiful unified transformer" for every modalities

English

Weiming Ren@wmren993·5d

English

1.2K

Weiming Ren@wmren993·5d

Thanks @_akhaliq for sharing our work! We present Tuna-2 as an initial attempt to unify visual understanding, text-to-image generation, and image editing directly from pixel embeddings. Also check out this thread if you are interested in knowing more! x.com/wmren993/statu…

AK@_akhaliq

Meta presents Tuna-2 Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation paper: huggingface.co/papers/2604.24…

English

11.6K

Weiming Ren@wmren993·5d

7/ This work was done with the awesome collaborators: @__Johanan, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer, @CongYuren. Our prior work Tuna has been accepted to CVPR 2026 as a Highlight presentation. Take a look if you are also interested! arxiv.org/abs/2512.02014

English

185

Weiming Ren@wmren993·5d

6/ We further visualize attention maps to understand what Tuna-2 learns from end-to-end pixel-space training. Compared with encoder-based LMMs and previous Tuna variants, Tuna-2 shows more accurate and stable cross-modal alignment.

English

178

Weiming Ren@wmren993·6d

@rosinality Thanks for sharing our work❤️

English

331

Rosinality@rosinality·6d

Pixel-based unified understanding and generation model using JiT. Uses MAE for representation learning.

English

341

23K

Weiming Ren retweetledi

Yuren Cong@CongYuren·6d

1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation! We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation. Paper: huggingface.co/papers/2604.24… Code: github.com/facebookresear… Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!

English

84.7K

Weiming Ren retweetledi

Cong Wei@CongWei1230·16 Nis

🚀 Introducing RationalRewards — a reasoning-based reward model that improves image generation(t2i and editing) quality at both training and test time. 🧠 We taught reward models to think before they score, leading to more reliable feedback and reduced reward hacking during RL fine-tuning. 🔁 At test time, the same RationalRewards model performs prompt tuning by iteratively generating, scoring with reasoning, and rewriting prompts. Here’s the surprising part 👇 test-time prompt tuning matches — sometimes beats — full RL fine-tuning, which costs 400 GPU hours.🔥 💡 This suggests an alternative to parameter tuning: your diffusion model may already have strong capabilities — prompt tuning matters. --- 🧵 Fully open-source. ✅ Code + training recipes ✅ RationalRewards-8B model + data ✅ Full benchmarks + GCR loop 🔗 Website: tiger-ai-lab.github.io/RationalReward… 🔗 GitHub: github.com/TIGER-AI-Lab/R… 🤗 Models & Data: huggingface.co/collections/TI… 📄 Paper: huggingface.co/papers/2604.11… --- #ImageGeneration #AIGC #RewardModels #RL #DiffusionModels #TestTimeScaling

English

10.3K

Weiming Ren retweetledi

Yuntian Deng@yuntiandeng·14 Nis

🚀 Launching ProgramAsWeights (PAW)! Define functions in English → PAW compiles them into tiny neural programs → Run locally like normal Python functions. A neural program combines discrete text + continuous LoRA to adapt a fixed small interpreter. 🔗 programasweights.com

English

328

32.5K

Weiming Ren retweetledi

Dongfu Jiang@DongfuJiang·13 Nis

🚨 Introducing ClawBench: a benchmark for evaluating whether AI agents can actually complete everyday online tasks on the real web. 💡 We move beyond static HTML and sandbox replicas to 153 realistic tasks across 144 live websites—from booking flights and filling forms to submitting applications and completing purchases. The goal is simple: measure the gap between benchmark success and real-world usefulness. 📉 That gap is large: models that look strong on traditional web-agent benchmarks drop sharply on ClawBench—Claude Sonnet 4.6 gets 33.3%, and GPT-5.4 gets 6.5%. 🧪 Thanks to @ReacherZhang’s great work, we also make it easy to run: uv pip install clawbench-eval clawbench 🔒 Under the hood, agents interact with real websites, while we intercept only the final submission request to prevent real-world side effects. 🤗HF Paper: huggingface.co/papers/2604.08… 🌐Website: claw-bench.com ⚙️Github: github.com/reacher-z/Claw… 🧵 More on what makes it hard, how evaluation works, and where current agents fail 👇: #AI #Agents #WebAgents #Benchmark #LLM #OpenSource #Claw

English

1.4K

Weiming Ren retweetledi

Zhaochong An@ZhaochongAn·30 Mar

🎥 Video diffusion models can create beautiful videos, but their worlds often fall apart. 🔥 Excited to share our latest work at @Google: 📢 VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward 📢 📄 arxiv.org/pdf/2603.26599 🌐 zhaochongan.github.io/projects/VGGRPO 🧵

English

290

17.2K

Weiming Ren retweetledi

Zhuofeng Li@zhuofengli96475·24 Mar

🚀 OpenResearcher paper is finally released! 🔥 We explore how to synthesize long-horizon research trajectories for deep-research agents — fully offline, scalable, and low-cost, without relying on live web APIs. 📄 huggingface.co/papers/2603.20… 🧩Two key ideas: Offline Corpus — One-time bootstrapping seeds 10K gold passages + 15M-doc FineWeb corpus. 📚 Explicit Browsing Primitives — Just 3 ops: search / open / find. The agent learns not just what to retrieve, but how to inspect docs and localize evidence at multiple scales. 🔎 📊 Results: 54.8% on BrowseComp-Plus with our 30B-A3B — #1 open-source under the same search engine setup. Beating much larger models like GPT-4.1, Claude-Opus-4, Gemini-2.5-Pro, and DeepSeek-R1. 💡 Insights: Beyond accuracy, we dissect deep research pipeline design—from data filtering and agent configuration to retrieval accuracy dynamics (RQ1-RQ5). Try it yourself: 🛠️ Code: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT

Dongfu Jiang@DongfuJiang

🚀 Introducing OpenResearcher: a fully offline pipeline for synthesizing 100+ turn deep-research trajectories—no search/scrape APIs, no rate limits, no nondeterminism. 💡 We use GPT-OSS-120B + a local retriever + a 10T-token corpus to generate long-horizon tool-use traces (search → open → find) that look like real browsing, but are free + reproducible. 📈 The payoff: SFT on these trajectories turns Nemotron-3-Nano-30B-A3B from 20.8% → 54.8% accuracy on BrowseComp-Plus (+34.0). 🧩 What makes it work? 🔎 Offline corpus = 15M FineWeb docs + 10K “gold” passages (bootstrapped once) 🧰 Explicit browsing primitives = better evidence-finding than “retrieve-and-read” 🎯 Reject sampling = keep only successful long-horizon traces 🧵 And we’re releasing everything: ✅ code + search engine + corpus recipe ✅ 96K-ish trajectories + eval logs ✅ trained models + live demo 👨‍💻 GitHub: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… 🔎 Eval logs: huggingface.co/datasets/OpenR… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT

English

310

45.9K

Keşfet

@paranioar @Jacoed @andrew_n_carr @felixudr @liuziwei7 @lucrbvi @_akhaliq @__Johanan