Steven-Shine Chen

84 posts

Steven-Shine Chen

@stevenshinechen

CS Master's Student at @MIT, previously @imperialcollege Researching multimodal reasoning at the MIT @medialab

Katılım Aralık 2012

351 Takip Edilen399 Takipçiler

Sabitlenmiş Tweet

Steven-Shine Chen@stevenshinechen·2 Nis

Current AI tutors are text-based, but humans rely on diagrams to reason through visual problems such as geometry 🚨Introducing Interactive Sketchpad, an AI tutor that combines hints with visualizations to help students solve problems Paper & Code: stevenshinechen.github.io/interactiveske… 🧵1/6

English

5.6K

Steven-Shine Chen retweetledi

Chanakya Ekbote@thecekbote·11 Nis

🧵 [1/14]: Talking-Heads Attention by @NoamShazeer et al. showed something interesting: maybe attention heads shouldn’t be fully isolated. 🧠 That got us thinking: If communication across heads matters, what is the right way for heads to communicate, especially from a one-layer reasoning perspective? 🔗⚙️ That question led us to Interleaved Head Attention (IHA) ✨ 📄 Paper link: arxiv.org/pdf/2602.21371

English

691

202.3K

Steven-Shine Chen@stevenshinechen·1 Mar

@louszbd @mingthemerxiles @steipete @Zai_org Hey Lou, thanks for the talk, cool to hear about what you guys are doing with 🦞!

English

Steven-Shine Chen@stevenshinechen·28 Şub

@GregKamradt Partial observability, imperfect information and multi-agent systems. e.g. if you take other agents as part of the env dynamics and their policy updates in a way that you can only partially observe then it is a non stationary problem + also issues of multi-agent credit assignment

English

Greg Kamradt@GregKamradt·24 Ara

@stevenshinechen What axis of non-stationary influence are you referring to. I know you mean more than proc gen and randomness

English

110

Greg Kamradt@GregKamradt·23 Ara

I think about ARC Prize full time Not just the benchmark but the whole org Competition, fundraising, events/launches, hiring, ops/G&A, content/socials, community, If you have feedback I want to hear it

Gabriele Berton@gabriberton

Very mature and humble from the president of ARC-AGI to ask for feedback on its benchmarks It's a strong signal that ARC-AGI could improve and hopefully won't "derail the quest for AGI" x.com/redtachyon/sta…

English

8.8K

Steven-Shine Chen retweetledi

Kimi.ai@Kimi_Moonshot·26 Şub

Supporting @MITEECS and @nlp_mit’s Multimodal Machine Learning course (Spring 2026). 🎓 Students are leveraging the multimodal capabilities of Kimi K2.5 to power their final research projects. We look forward to seeing the innovative applications that will emerge this semester. 🔗 mit-mi.github.io/mmai-course/sp… Happy coding! ✨

English

720

51.9K

Steven-Shine Chen@stevenshinechen·28 Oca

Accepted for ICLR 2026🎉

Steven-Shine Chen@stevenshinechen

As test-time compute scales, we need evals for long-horizon, open-ended reasoning Introducing PuzzleWorld🧩a multimodal puzzlehunt benchmark with human-annotated reasoning traces - testing diverse, creative reasoning Paper: arxiv.org/abs/2506.06211 Data: github.com/MIT-MI/PuzzleW…

English

152

Steven-Shine Chen@stevenshinechen·28 Oca

Lots of potential in this - have some preliminary results showing good promise in this direction

will brown@willccbb

prompt optimization + context distillation are underexplored primitives for post-training pipelines imo

English

152

Steven-Shine Chen@stevenshinechen·8 Ara

@GregKamradt @arcprize @LiaoIsaac91893 @_albertgu Don’t know of other competitions that do this, but one way could be disallowing weights to be uploaded for the competition so that you provide training script in your submission and the weights are randomly initialised. Then manually inspect winners to check

English

Greg Kamradt@GregKamradt·7 Ara

@stevenshinechen @arcprize @LiaoIsaac91893 @_albertgu I’d love a separate track depending on priors and compute Do you know of other competitions that do this? How do we ensure everyone is honest about their submission?

English

Greg Kamradt@GregKamradt·6 Ara

We’re kicking off our 2025 retro What should @arcprize be doing better? What do you want to see from us next year? How do we take a bigger swing?

ARC Prize@arcprize

Announcing the ARC Prize 2025 Top Score & Paper Award winners The Grand Prize remains unclaimed Our analysis on AGI progress marking 2025 the year of the refinement loop

English

8.7K

Steven-Shine Chen@stevenshinechen·7 Ara

@GregKamradt @arcprize e.g. would like to see methods like CompressARC from @LiaoIsaac91893 and @_albertgu compete separately. And for ARC 3 we could see approaches that generate many synthetic games and train on these offline - would like to see differences between this and pure online learning

English

Steven-Shine Chen@stevenshinechen·7 Ara

@GregKamradt @arcprize Perhaps a track without pre-training? A lot of advances are from using a lot of compute pre-evaluation and baking this into weights/data. So even though at eval time you have limited compute/data, there is no limit before eval starts, so this is used to reduce inference costs.

English

Steven-Shine Chen@stevenshinechen·28 Kas

@yacinelearning I agree, one thing I've been thinking about - what happened to the AlphaZero, MuZero style research? DeepMind seems to have pivoted to LLMs/VLAs even though I feel there's still a lot of untapped potential in exploring non language based game agents

English

Yacine Mahdid@yacinelearning·28 Kas

one thing I've come to realize is that hype around a specific research area can literally kill multiple others indirectly all the funding talents discussion just goes into that one hyped-up research areas and everything else withers not sure how I feel about it

English

2.9K

Steven-Shine Chen@stevenshinechen·26 Kas

Pre-training is dead. We can't keep scaling pre-training data or even RL envs - they are all curated by humans AI needs to take actions which maximise their own learning. Like humans, they need to be curious, take actions not to maximise reward, but improve their own world model.

Dwarkesh Patel@dwarkesh_sp

The @ilyasut episode 0:00:00 – Explaining model jaggedness 0:09:39 - Emotions and value functions 0:18:49 – What are we scaling? 0:25:13 – Why humans generalize better than models 0:35:45 – Straight-shotting superintelligence 0:46:47 – SSI’s model will learn from deployment 0:55:07 – Alignment 1:18:13 – “We are squarely an age of research company” 1:29:23 – Self-play and multi-agent 1:32:42 – Research taste Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!

English

Steven-Shine Chen retweetledi

Chanakya Ekbote@thecekbote·12 Kas

How do we teach LLMs not just to reason, but to reflect, debug, and improve themselves? We at AWS AI Labs introduce MURPHY 🤖, a multi-turn RL framework that brings self-correction into #RLVR (#GRPO). 🧵👇 Link: arxiv.org/abs/2511.07833

English

5.7K

Steven-Shine Chen@stevenshinechen·18 Eyl

@ddvd233 Congrats!

English

103

[email protected]@ddvd233·18 Eyl

First NeurIPS oral in my life!

English

3.6K

Steven-Shine Chen@stevenshinechen·6 Eyl

MIT already found a simple way to cut LLM hallucinations: - Ask the model for a confidence score - Reward the model for saying "I don't know" (low confidence on wrong answers) Accuracy stays high with less hallucinations arxiv.org/abs/2507.16806

Rohan Paul@rohanpaul_ai

OpenAI realesed new paper. "Why language models hallucinate" Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, test-like incentives that reward confident wrong answers over honest “I don’t know” responses. The fix is to grade differently, give credit for appropriate uncertainty and penalize confident errors more than abstentions, so models stop being optimized for blind guessing. OpenAI is showing that 52% abstention gives substantially fewer wrong answers than 1% abstention, proving that letting a model admit uncertainty reduces hallucinations even if accuracy looks lower. Abstention means the model refuses to answer when it is unsure and simply says something like “I don’t know” instead of making up a guess. Hallucinations drop because most wrong answers come from bad guesses. If the model abstains instead of guessing, it produces fewer false answers. 🧵 Read on 👇

English

893

Steven-Shine Chen@stevenshinechen·3 Eyl

@CULLYAntoine @imperialcollege Congrats Antoine! Always a joy learning from you, fully deserved!

English

Antoine Cully@CULLYAntoine·3 Eyl

Almost exactly 10 years after joining @imperialcollege as a Postdoc, I am honoured to announce that I am now Professor in Machine Learning and Robotics! 👨‍🎓 🤖 My fantastic team found the best gift to celebrate this special occasion!

English

239

11.5K

Steven-Shine Chen retweetledi

Megan Tjandrasuwita@mmtjandrasuwita·21 Haz

Most problems have clear-cut instructions: solve for x, find the next number, choose the right answer. Puzzlehunts don’t. They demand creativity and lateral thinking. We introduce PuzzleWorld: a new benchmark of puzzlehunt problems challenging models to think creatively.

English

23.3K

Steven-Shine Chen@stevenshinechen·3 Eyl

Paul Liang@pliang279

Since my undergraduate days at CMU, I've been participating in puzzlehunts: involving complex, multi-step puzzles, lacking well-defined problem definitions, with creative and subtle hints and esoteric world knowledge, requiring language, spatial, and sometimes even physical interaction. These are major challenges for humans, requiring expert teams hours or even days to solve, and even greater challenges for AI. I'm excited to release our research endeavors towards benchmarking and building AI for solving puzzles! Our first step is PuzzleWorld: a new benchmark of puzzlehunt problems challenging models to think creatively with language, spatial, and physical reasoning. AI that can successfully solve puzzles have direct impact on education, logic, scientific discovery, and more. Paper - arxiv.org/abs/2506.06211 Dataset - github.com/MIT-MI/PuzzleW… see full thread by @mmtjandrasuwita for more details!

English

937

Steven-Shine Chen@stevenshinechen·2 Eyl

@ddvd233 reward=len(sources)

English

172

[email protected]@ddvd233·2 Eyl

Claude Research 一下找了 313 个 source...现在是不是比较流行比谁的 source 比较多（

中文

17.4K

Steven-Shine Chen retweetledi

Paul Liang@pliang279·27 Ağu

A bit late, but finally got around to posting the recorded and edited lecture videos for the **How to AI (Almost) Anything** course I taught at MIT in spring 2025. Youtube playlist: youtube.com/watch?v=0MYt0u… Course website and materials: mit-mi.github.io/how2ai-course/… Today's AI can be applied to almost anything - from language to vision, audio, sensors, medical data, music, art, smell, and taste. This course covers the principles of AI (focusing on deep learning and foundation models), how we can apply AI to novel real-world data modalities, and multimodal AI that can process many modalities at once, such as connecting language and multimedia, music and art, sensing and actuation, and more.

YouTube

English

235

1.3K

105.5K

Keşfet

@NoamShazeer @louszbd @mingthemerxiles @steipete @Zai_org @GregKamradt @MITEECS @nlp_mit