Sachit Menon

33 posts

Sachit Menon

@SachitMenon

Final-year AI PhD student @Columbia. Working to make big models not dumb (prev work includes ViperGPT). Recently visited @GoogleDeepMind.

New York, NY Katılım Nisan 2020

322 Takip Edilen310 Takipçiler

Sabitlenmiş Tweet

Sachit Menon@SachitMenon·24 Haz

🚨 New paper! 🚨 We solve lots of tasks posed in words by thinking visually. Can LLMs? Not in text, but we can unlock this ability with images! Introducing whiteboard-of-thought, enabling MLLMs to express intermediate reasoning as images via code! 🔗 huggingface.co/papers/2406.14… 🧵

English

155

20.1K

Sachit Menon retweetledi

chrissy@chrissyykat·1 Şub

will nice agents finish last? in psychology, the 'agreeableness penalty' refers to the negative correlation between agreeableness and financial/career success. the single-player post-training paradigm elicited sycophantic models but in multi-agent settings, incentives change and the deference can become a liability. for example, in a fully agentic marketplace, what agent would you want to negotiate something like prices on your behalf: the agreeable or the ruthless one? people already have a hunch for these personalities through experiments like ai village - claude as the machiavellian strategist - gpt can be a snake - gemini is the anxious people pleaser system instructions can only steer them so far. this moltbook post is a commentary that forcing a personality against the model's natural disposition may require effort. a nice model has to 'work' to be mean the model of choice with personal agents won't need to be the smartest. it just needs to beat the nice guy

English

5.2K

Sachit Menon retweetledi

Xindi Wu@cindy_x_wu·20 Oca

New #NVIDIA Paper We introduce Motive, a motion-centric, gradient-based data attribution method that traces which training videos help or hurt video generation. By isolating temporal dynamics from static appearance, Motive identifies which training videos shape motion in video generation. 🔗 research.nvidia.com/labs/sil/proje… 1/10

English

119

581

109.4K

Sachit Menon@SachitMenon·12 Haz

@willccbb you might find arxiv.org/abs/2505.20686 interesting -- ~suggests that using fully offline samples in a smart way (optimal value fn estimation) can do better than grpo's response-wise online normalization

English

will brown@willccbb·11 Haz

idea that i haven’t tried but i’m maybe unreasonably confident should work for grpo is replay buffer sampling grpo is “wasteful” in that it throws away each inference batch, but reusing entire batches causes worse perf but also we know that reusing *prompts* is basically fine within reason so what if you just reuse *some* of the rollouts? when reusing a prompt, do K new rollouts + sample N-K rollouts done for that prompt in past steps (up to some staleness threshold) to form your group add the new K to the buffer, evict stale rollouts you get a fresh advantage estimate which is anchored in on-policy data, and large batch size, but you’re being much less wasteful with inference compute at the extreme, this collapses to offline/extreme off-policy, but my bet is that the compute-optimal degree of resampling for final perf is non-zero (due to “free” extra steps and/or bigger batch size)

English

161

17.2K

Sachit Menon@SachitMenon·6 Tem

@IntuitMachine @IntuitMachine Thanks for the great summary of our work! Excited to see where people take it for tasks like the ARC prize.

English

466

Carlos E. Perez@IntuitMachine·6 Tem

The "Whiteboard" Trick That Finally Trains AI to Visualize Imagine being asked a simple question like "Which lowercase letter is a circle with a vertical line touching it to the right going down?" As a human, you likely pictured the description in your mind's eye to visualize the shape and quickly arrived at the answer "q". Now imagine one of the most advanced AI language models in the world being asked that same question. Surprisingly, it fails spectacularly, confidently answering "b" instead. What seems like a trivial reasoning task for the human mind becomes a monumental challenge for state-of-the-art artificial intelligence when the question involves spatial and visual concepts. Despite their impressive capabilities in processing and generating human language, today's language models remain stubbornly blind, unable to seamlessly integrate the rich visual thinking that comes naturally to humans. A multibillion-parameter language model can churn through abstract mathematics and symbolic logic with ease. Yet describe a simple arrangement of shapes, and it becomes hopelessly lost, with no capacity to construct the mental imagery required to solve the problem. This revelation exposes a critical shortcoming in artificial general intelligence - robust reasoning requires more than just language, it necessitates the fluid interplay of linguistic and visual modalities that the human mind excels at. In a recent paper, researchers introduce a ground-breaking approach to equip AI language models with human-like visual thinking abilities. By providing a "whiteboard" to dynamically generate and reason over visualizations, they unlock striking results on challenges that were previously insurmountable for AI systems. Get ready to explore the novel "whiteboard-of-thought" framework that may just extend your mind's eye to artificial intelligence. The core limitation is that language models process and reason over text tokens, lacking the ability to seamlessly integrate visual thinking that humans engage in. This paper introduces a novel approach called "whiteboard-of-thought" (WoT) to bridge this gap for multimodal large language models (MLLMs). The key idea is to provide MLLMs with a metaphorical "whiteboard" where they can generate visualizations through coding, and then leverage their multimodal input capabilities to further process and reason over these self-produced visuals. Specifically, WoT works as follows: given a query involving visual or spatial reasoning, the MLLM first generates code instructions using libraries like Matplotlib or Turtle to construct a relevant visualization. This code is executed to render the visualization as an image. Crucially, this image is then fed back into the MLLM, allowing it to perceive and reason over the visual information it dynamically created, before producing a final answer to the query. The key premise is that providing MLLMs with this "whiteboard" workflow more closely mimics how humans fluidly combine linguistic and visual modes of thinking to solve problems with spatial components. No specialized visual modules are needed - the models simply use their existing skills for coding and multimodal processing. The authors evaluate WoT on several challenging benchmarks involving ASCII art understanding and spatial reasoning through navigation instructions. Their experiments demonstrate large performance gains over direct prompting and chain-of-thought baselines. On certain tasks, WoT enables models to achieve up to 92% accuracy compared to 0% for chain-of-thought, highlighting scenarios where visual reasoning is critical. While the WoT approach draws inspiration from prior work on large language models, chain-of-thought prompting, multimodal models trained on image-text data, and tool augmentation - it uniquely combines these capabilities in a novel way. Rather than just perceiving static visual inputs from pretraining or using tools for numerical calculation, WoT employs the coding abilities of language models to dynamically synthesize visualizations tailored to each query. These visualizations can then be parsed and reasoned over using the model's multimodal skills. Through its simple yet powerful approach of introducing a "whiteboard" for visual thinking, WoT unlocks remarkable visual reasoning capabilities in MLLMs on tasks that were previously extremely challenging. This illustrates the potential of models to more closely mimic the multimodal thinking processes of humans. Despite the impressive results, the authors also identify key limitations like errors propagating from inaccurate visualizations or failures in grounding symbols. Addressing these shortcomings opens up exciting future research directions to further advance AI visual reasoning.

English

286

36K

Sachit Menon retweetledi

Ruoshi Liu@ruoshi_liu·25 Haz

How can a visuomotor policy learn from internet videos? We introduce Dreamitate, where a robot uses a fine-tuned video diffusion model to dream the future (top) and imitate the dream to accomplish a task (bottom). website: dreamitate.cs.columbia.edu paper: arxiv.org/abs/2406.16862

English

281

50.8K

Sachit Menon@SachitMenon·25 Haz

@SNAT02792153 Hi Syeda, very cool work! It’s interesting to see visual renders as HTML help for symbolic and math reasoning. I think using Python code that is executed instead of rendering markup comes with different tradeoffs. I’ve added a pointer to your paper to the website now!

English

113

Syeda Nahida Akter@__SyedaAkter·25 Haz

@SachitMenon Hey Sachit, we have explored the exact idea in our Self-Imagine paper. Please do checkout our paper here: twitter.com/SNAT02792153/s…

Syeda Nahida Akter@__SyedaAkter

When solving a difficult problem, we often draw a diagram to help us visualize. What if VLMs could do the same? Introducing Self-Imagine – a method that enhances the reasoning abilities of VLMs on text-only tasks through visualization. Paper: arxiv.org/abs/2401.08025 🧵↓

English

282

Sachit Menon@SachitMenon·24 Haz

English

155

20.1K

Sachit Menon@SachitMenon·24 Haz

Thanks @arankomatsuzaki for sharing! More details here: x.com/SachitMenon/st… (+ small clarification, the typography comment is about the first fig in that thread!)

Aran Komatsuzaki@arankomatsuzaki

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities Enables MLLMs to express intermediate reasoning as images using code. You probably didn't use typography knowledge to solve this query proj: whiteboard.cs.columbia.edu abs: arxiv.org/abs/2406.14562

English

5.8K

Sachit Menon@SachitMenon·24 Haz

Finally, thanks to my collaborators and mentors @zemelgroup and @cvondrick for all their guidance! For code and more examples, check out whiteboard.cs.columbia.edu.

English

487

Sachit Menon@SachitMenon·24 Haz

If you find this idea interesting, you'll also like @huyushi98 and @WeijiaShi2's concurrent Visual Sketchpad, which has a similar core motivation but focuses on using external modules (à la ViperGPT) to draw for vision tasks!

English

6.2K

Sachit Menon@SachitMenon·24 Haz

This work wouldn't have been possible without that forward-thinking work, and I hope it brings more attention to those great evals. We'll be releasing the full (code/image) reasoning traces for all of them to accelerate future work.

English

493

Sachit Menon@SachitMenon·24 Haz

Shoutout to @hiromu1996 @hciphdstudent @shaneguML @StrongDuality @ethansdyer for making visual BIG-Bench tasks that kind of got slept on until now (probably due to LLMs showing 0 progress so far), and to @_yutaroyamada @AndrewLampinen & more for their recent spatial eval of LLMs.

English

Sachit Menon@SachitMenon·24 Haz

A detailed error analysis shows us that the biggest bottleneck is the visual perception abilities of MLLMs. As they continue to improve, this technique will continue to grow more useful.

English

486

Sachit Menon@SachitMenon·24 Haz

We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib 📊 and Turtle 🐢.

English

500

Sachit Menon@SachitMenon·24 Haz

We find that for natural language tasks that require visual or spatial reasoning, chain-of-thought can fail dramatically, identifying multiple settings where GPT-4o w/ CoT gets ~0%. Drawing images lets us get up to 92%.

English

609

Sachit Menon@SachitMenon·20 Haz

Come hear about our work on creating fully illustrated how-to articles with LLMs and diffusion models @CVPRConf poster 143 this afternoon! This project came out of my internship @AIatMeta with amazing collaborators @_rohitgirdhar_ and @imisra_, excited to share today. #CVPR2024

AK@_akhaliq

Generating Illustrated Instructions paper page: huggingface.co/papers/2312.04… introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.

English

14.1K

Sachit Menon retweetledi

ege ozguroglu@EgeOzguroglu·19 Haz

At #CVPR2024: we present pix2gestalt, which synthesizes whole objects from occluded ones, enabling zero-shot amodal segmentation, recognition, and 3D reconstruction! Project Page: gestalt.cs.columbia.edu Code: github.com/cvlab-columbia… arXiv: arxiv.org/abs/2401.14398

Peyman Milanfar@docmilanfar

the hardest problem in computer vision? occlusion - it's always occlusion

English

172

28.8K

Sachit Menon retweetledi

FGVC Workshop@fgvcworkshop·13 Haz

FGVC11 at @CVPR is just 5 days away! Don't miss our exciting lineup of speakers: @SachitMenon, @phillip_isola, @ZhongingAlong, @lschmidt3, & Sharmishtaa Seshamani. Starting 8:45am June 18th in Summit 326! pic.twitter.com/a1mcqEfcTK

English

7.2K

Sachit Menon retweetledi

Ahmet Iscen@ahmetius·14 Haz

🔥 Calling all #CVPR2024 attendees! 🔥 Join us for the 1st Tool-Augmented VIsion (TAVI) Workshop on Monday morning in Summit 321! 💡 5 inspiring keynote talks 🎨 5 invited posters from the main conference Don't miss out! ➡️ More info: sites.google.com/corp/view/tavi…

English

12.4K

Sachit Menon@SachitMenon·5 Eki

Come talk to me and @Surisdi about ViperGPT at our poster today at 2:30 (Foyer Sud) or our talk at 4:30 (Paris Sud) in person at #ICCV2023!

AK@_akhaliq

ViperGPT: Visual Inference via Python Execution for Reasoning abs: arxiv.org/abs/2303.08128 project page: viper.cs.columbia.edu

English

14.8K

Keşfet

@willccbb @IntuitMachine @arankomatsuzaki @zemelgroup @cvondrick @huyushi98 @WeijiaShi2 @hiromu1996