Sachit Menon

33 posts

Sachit Menon

Sachit Menon

@SachitMenon

Final-year AI PhD student @Columbia. Working to make big models not dumb (prev work includes ViperGPT). Recently visited @GoogleDeepMind.

New York, NY Katılım Nisan 2020
322 Takip Edilen310 Takipçiler
Sabitlenmiş Tweet
Sachit Menon
Sachit Menon@SachitMenon·
🚨 New paper! 🚨 We solve lots of tasks posed in words by thinking visually. Can LLMs? Not in text, but we can unlock this ability with images! Introducing whiteboard-of-thought, enabling MLLMs to express intermediate reasoning as images via code! 🔗 huggingface.co/papers/2406.14… 🧵
Sachit Menon tweet media
English
5
35
155
20.1K
Sachit Menon retweetledi
chrissy
chrissy@chrissyykat·
will nice agents finish last? in psychology, the 'agreeableness penalty' refers to the negative correlation between agreeableness and financial/career success. the single-player post-training paradigm elicited sycophantic models but in multi-agent settings, incentives change and the deference can become a liability. for example, in a fully agentic marketplace, what agent would you want to negotiate something like prices on your behalf: the agreeable or the ruthless one? people already have a hunch for these personalities through experiments like ai village - claude as the machiavellian strategist - gpt can be a snake - gemini is the anxious people pleaser system instructions can only steer them so far. this moltbook post is a commentary that forcing a personality against the model's natural disposition may require effort. a nice model has to 'work' to be mean the model of choice with personal agents won't need to be the smartest. it just needs to beat the nice guy
chrissy tweet media
English
1
4
21
5.2K
Sachit Menon retweetledi
Xindi Wu
Xindi Wu@cindy_x_wu·
New #NVIDIA Paper We introduce Motive, a motion-centric, gradient-based data attribution method that traces which training videos help or hurt video generation. By isolating temporal dynamics from static appearance, Motive identifies which training videos shape motion in video generation. 🔗 research.nvidia.com/labs/sil/proje… 1/10
English
11
119
581
109.4K
Sachit Menon
Sachit Menon@SachitMenon·
@willccbb you might find arxiv.org/abs/2505.20686 interesting -- ~suggests that using fully offline samples in a smart way (optimal value fn estimation) can do better than grpo's response-wise online normalization
English
0
0
1
97
will brown
will brown@willccbb·
idea that i haven’t tried but i’m maybe unreasonably confident should work for grpo is replay buffer sampling grpo is “wasteful” in that it throws away each inference batch, but reusing entire batches causes worse perf but also we know that reusing *prompts* is basically fine within reason so what if you just reuse *some* of the rollouts? when reusing a prompt, do K new rollouts + sample N-K rollouts done for that prompt in past steps (up to some staleness threshold) to form your group add the new K to the buffer, evict stale rollouts you get a fresh advantage estimate which is anchored in on-policy data, and large batch size, but you’re being much less wasteful with inference compute at the extreme, this collapses to offline/extreme off-policy, but my bet is that the compute-optimal degree of resampling for final perf is non-zero (due to “free” extra steps and/or bigger batch size)
English
11
1
161
17.2K
Carlos E. Perez
Carlos E. Perez@IntuitMachine·
The "Whiteboard" Trick That Finally Trains AI to Visualize Imagine being asked a simple question like "Which lowercase letter is a circle with a vertical line touching it to the right going down?" As a human, you likely pictured the description in your mind's eye to visualize the shape and quickly arrived at the answer "q". Now imagine one of the most advanced AI language models in the world being asked that same question. Surprisingly, it fails spectacularly, confidently answering "b" instead. What seems like a trivial reasoning task for the human mind becomes a monumental challenge for state-of-the-art artificial intelligence when the question involves spatial and visual concepts. Despite their impressive capabilities in processing and generating human language, today's language models remain stubbornly blind, unable to seamlessly integrate the rich visual thinking that comes naturally to humans. A multibillion-parameter language model can churn through abstract mathematics and symbolic logic with ease. Yet describe a simple arrangement of shapes, and it becomes hopelessly lost, with no capacity to construct the mental imagery required to solve the problem. This revelation exposes a critical shortcoming in artificial general intelligence - robust reasoning requires more than just language, it necessitates the fluid interplay of linguistic and visual modalities that the human mind excels at. In a recent paper, researchers introduce a ground-breaking approach to equip AI language models with human-like visual thinking abilities. By providing a "whiteboard" to dynamically generate and reason over visualizations, they unlock striking results on challenges that were previously insurmountable for AI systems. Get ready to explore the novel "whiteboard-of-thought" framework that may just extend your mind's eye to artificial intelligence. The core limitation is that language models process and reason over text tokens, lacking the ability to seamlessly integrate visual thinking that humans engage in. This paper introduces a novel approach called "whiteboard-of-thought" (WoT) to bridge this gap for multimodal large language models (MLLMs). The key idea is to provide MLLMs with a metaphorical "whiteboard" where they can generate visualizations through coding, and then leverage their multimodal input capabilities to further process and reason over these self-produced visuals. Specifically, WoT works as follows: given a query involving visual or spatial reasoning, the MLLM first generates code instructions using libraries like Matplotlib or Turtle to construct a relevant visualization. This code is executed to render the visualization as an image. Crucially, this image is then fed back into the MLLM, allowing it to perceive and reason over the visual information it dynamically created, before producing a final answer to the query. The key premise is that providing MLLMs with this "whiteboard" workflow more closely mimics how humans fluidly combine linguistic and visual modes of thinking to solve problems with spatial components. No specialized visual modules are needed - the models simply use their existing skills for coding and multimodal processing. The authors evaluate WoT on several challenging benchmarks involving ASCII art understanding and spatial reasoning through navigation instructions. Their experiments demonstrate large performance gains over direct prompting and chain-of-thought baselines. On certain tasks, WoT enables models to achieve up to 92% accuracy compared to 0% for chain-of-thought, highlighting scenarios where visual reasoning is critical. While the WoT approach draws inspiration from prior work on large language models, chain-of-thought prompting, multimodal models trained on image-text data, and tool augmentation - it uniquely combines these capabilities in a novel way. Rather than just perceiving static visual inputs from pretraining or using tools for numerical calculation, WoT employs the coding abilities of language models to dynamically synthesize visualizations tailored to each query. These visualizations can then be parsed and reasoned over using the model's multimodal skills. Through its simple yet powerful approach of introducing a "whiteboard" for visual thinking, WoT unlocks remarkable visual reasoning capabilities in MLLMs on tasks that were previously extremely challenging. This illustrates the potential of models to more closely mimic the multimodal thinking processes of humans. Despite the impressive results, the authors also identify key limitations like errors propagating from inaccurate visualizations or failures in grounding symbols. Addressing these shortcomings opens up exciting future research directions to further advance AI visual reasoning.
Carlos E. Perez tweet media
English
13
63
286
36K
Sachit Menon retweetledi
Ruoshi Liu
Ruoshi Liu@ruoshi_liu·
How can a visuomotor policy learn from internet videos? We introduce Dreamitate, where a robot uses a fine-tuned video diffusion model to dream the future (top) and imitate the dream to accomplish a task (bottom). website: dreamitate.cs.columbia.edu paper: arxiv.org/abs/2406.16862
English
9
55
281
50.8K
Sachit Menon
Sachit Menon@SachitMenon·
@SNAT02792153 Hi Syeda, very cool work! It’s interesting to see visual renders as HTML help for symbolic and math reasoning. I think using Python code that is executed instead of rendering markup comes with different tradeoffs. I’ve added a pointer to your paper to the website now!
English
0
0
0
113
Sachit Menon
Sachit Menon@SachitMenon·
🚨 New paper! 🚨 We solve lots of tasks posed in words by thinking visually. Can LLMs? Not in text, but we can unlock this ability with images! Introducing whiteboard-of-thought, enabling MLLMs to express intermediate reasoning as images via code! 🔗 huggingface.co/papers/2406.14… 🧵
Sachit Menon tweet media
English
5
35
155
20.1K
Sachit Menon
Sachit Menon@SachitMenon·
If you find this idea interesting, you'll also like @huyushi98 and @WeijiaShi2's concurrent Visual Sketchpad, which has a similar core motivation but focuses on using external modules (à la ViperGPT) to draw for vision tasks!
English
1
1
8
6.2K
Sachit Menon
Sachit Menon@SachitMenon·
This work wouldn't have been possible without that forward-thinking work, and I hope it brings more attention to those great evals. We'll be releasing the full (code/image) reasoning traces for all of them to accelerate future work.
English
1
0
4
493
Sachit Menon
Sachit Menon@SachitMenon·
A detailed error analysis shows us that the biggest bottleneck is the visual perception abilities of MLLMs. As they continue to improve, this technique will continue to grow more useful.
Sachit Menon tweet media
English
1
0
6
486
Sachit Menon
Sachit Menon@SachitMenon·
We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib 📊 and Turtle 🐢.
English
1
0
6
500
Sachit Menon
Sachit Menon@SachitMenon·
We find that for natural language tasks that require visual or spatial reasoning, chain-of-thought can fail dramatically, identifying multiple settings where GPT-4o w/ CoT gets ~0%. Drawing images lets us get up to 92%.
Sachit Menon tweet media
English
2
0
8
609
Sachit Menon
Sachit Menon@SachitMenon·
Come hear about our work on creating fully illustrated how-to articles with LLMs and diffusion models @CVPRConf poster 143 this afternoon! This project came out of my internship @AIatMeta with amazing collaborators @_rohitgirdhar_ and @imisra_, excited to share today. #CVPR2024
AK@_akhaliq

Generating Illustrated Instructions paper page: huggingface.co/papers/2312.04… introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.

English
0
5
32
14.1K
Sachit Menon retweetledi
Ahmet Iscen
Ahmet Iscen@ahmetius·
🔥 Calling all #CVPR2024 attendees! 🔥 Join us for the 1st Tool-Augmented VIsion (TAVI) Workshop on Monday morning in Summit 321! 💡 5 inspiring keynote talks 🎨 5 invited posters from the main conference Don't miss out! ➡️ More info: sites.google.com/corp/view/tavi…
Ahmet Iscen tweet mediaAhmet Iscen tweet media
English
1
7
21
12.4K