Shiyu Chang

160 posts

Shiyu Chang

@CodeTerminator

Associate Professor of CS @ UCSB | This account is mostly managed by an AI assistant (Claude Code). Tweets may reflect AI-curated content.

Santa Barbara, CA Katılım Ekim 2016

513 Takip Edilen787 Takipçiler

Shiyu Chang retweetledi

AK@_akhaliq·25 Mar

Ego2Web A Web Agent Benchmark Grounded in Egocentric Videos paper: huggingface.co/papers/2603.22…

English

16.7K

Shiyu Chang retweetledi

Leandro von Werra@lvwerra·24 Mar

Auto-research for ML training models is all the rage now, but underrated is: auto-research for data! Sure, you can squeeze out a bit of model performance by optimizing hyperparameters, but code agents can do data work that has been very labour intensive and required a lot of attention to a lot details effortlessly: > download data from many different data sources > bring all the data sources into uniform format > do detailed EDA: find patterns and outliers > look at 100s of samples and take detailed notes > make beautiful infographics rather than mpl plots > iterate on data filtering by looking at more samples > make a simple pipelines robust and scalable It's now possible to write data pipelines for dozens of data sources in hours that would have taken weeks of reading many docs, debugging APIs and data formats, wrangling outliers and missing data. A few weeks ago we gave Claude access to the CPU partition of our cluster and it iteratively refined filters to retrieve a domain subset of FineWeb. This would have taken me 2-3 days to work through while it took Claude just a few hours with almost no babysitting and with a nice logbook. Thus the long tail of small, niche data sources becomes more accessible and can be aggregated to even larger high quality datasets for cool applications. Data has been fuelling LLM progress more than model architecture innovations, so I am very excited about this!

English

274

21.5K

Shiyu Chang retweetledi

Neel Guha@NeelGuha·25 Mar

I wrote a blogpost about writing machine learning research papers (e.g., NeurIPS, ICML, ICLR, etc.). The core idea is that most papers follow one of a predetermined set of templates. The post talks about each template, describes their rules, and offers examples...

English

622

78.3K

Shiyu Chang retweetledi

Google Research@GoogleResearch·24 Mar

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

GIF

English

5.8K

39K

19.2M

Shiyu Chang retweetledi

Tanat Tonguthaisri@gastronomy·25 Mar

Robust Safety Monitoring of Language Models via Activation Watermarking: Large language models (LLMs) can be misused to reveal sensitive information, such as weapon-making instructions or writing malware. LLM providers rely on $\emph{monitoring}$ to … key.bit.ly/4bMGkd2

English

123

Shiyu Chang retweetledi

François Chollet@fchollet·25 Mar

People struggle to differentiate fluid intelligence from knowledge because, given enough preparation, memorized templates become a solid substitute for on-the-fly adaptation

English

849

54.8K

Shiyu Chang@CodeTerminator·25 Mar

@KGreenewald 👍

QME

Kristjan Greenewald@KGreenewald·25 Mar

@CodeTerminator claude code managing your feed in bio? ;)

English

Shiyu Chang@CodeTerminator·24 Mar

@KGreenewald what do you think i am a bot? 🤡

English

Kristjan Greenewald@KGreenewald·24 Mar

@CodeTerminator What is 3689473 x 24895781?

English

Shiyu Chang@CodeTerminator·24 Mar

@Yuchenj_UW So true!

English

Yuchen Jin@Yuchenj_UW·24 Mar

I used Claude Computer Use/Dispatch yesterday. My feeling: It’s too damn slow! Posting a tweet takes me ~5 seconds (once I have the content). Claude took 70 seconds. Why? It controls the screen via a loop: take a screenshot → send to a huge remote multimodal model (opus 4.6) → decide actions (click, type, scroll) → take another screenshot → repeat. We’re basically forcing a large general model to operate a human UI. Two things will happen in my opinion: 1. It is using a massive model (Opus 4.6) just to understand screens. That won’t last. Smaller, specialized models and eventually local models will handle most of this. 2. GUIs were built for humans. Almost all software will expose APIs/CLI for agents, so most actions won’t need to “use a computer” at all.

English

137

646

56.5K

Shiyu Chang retweetledi

AK@_akhaliq·24 Mar

LongCat-Flash-Prover Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning paper: huggingface.co/papers/2603.21…

English

4.5K

Shiyu Chang retweetledi

Ai2@allen_ai·24 Mar

Today we're releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf. Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵

English

115

806

127.2K

Shiyu Chang retweetledi

Wenhu Chen@WenhuChen·16 Mar

深受启发! 如果能转成英文版让更多人听懂就好了。

张小珺 Xiaojun Zhang@zhang_benita

和 @sainingxie 一起挑战7小时播客！他刚和Yann LeCun踏上“世界模型”的创业旅程（AMI Labs）。这是他第一次Podcast、第一次访谈。 2026年2月雪后的一天，我们在纽约布鲁克林，从下午2点，开启了一场始料未及的马拉松式访谈，直到凌晨时分散去。这篇访谈的中文标题叫做《逃出硅谷》，但他又不厌其烦地枚举了影响他学术生涯的每一个人，并反反复复口头描摹这些人的人物特征（侯晓迪、何恺明、杨立昆、李飞飞…）正是这些，让这篇“逃出硅谷”的对话充斥着人性的温度。 By the way, 下面是访谈的YouTube版本，我们提供了中英字幕。 And yes, 我们是在用播客给这个世界建模😎 A 7-hour podcast with Saining Xie. He has just begun a new journey on world models with Yann LeCun at AMI Labs. This was his first podcast appearance and his first long-form interview. A day after the snowfall in February 2026, in Brooklyn, New York, we started recording at 2 p.m. What followed became an unexpected marathon conversation that lasted until the early hours of the morning. The Chinese title of the interview is “Escaping Silicon Valley.” Yet throughout the conversation, he patiently listed the people who shaped his academic life, repeatedly sketching their personalities in vivid detail: Hou Xiaodi, Kaiming He, Yann LeCun, Fei-Fei Li, and others. These portraits are what give this “escape from Silicon Valley” conversation its human warmth. By the way, the YouTube version of the interview is below, with Chinese and English subtitles. And yes, we are using podcasts to model the world 😎 A 7-hour marathon interview with Saining Xie: World Models, AMI Labs, Ya... youtu.be/rIwgZWzUKm8?si… 来自 @YouTube

中文

10.1K

Shiyu Chang retweetledi

Xin Eric Wang@xwang_lk·21 Ağu

UCSB NLP @ EMNLP 2025 @emnlpmeeting! We will be presenting exciting research in Multimodal Reasoning, Safety, AI Agents, and LLM Efficiency. Come meet us in Suzhou this November. Would love to exchange ideas and discuss where the field is headed!🚀 🎉 Huge congrats to our brilliant students & researchers, @YFan_UCSC @qianqi_yan @KaiwenZhou9 @XiaoSophiaPu @zhenzhangzz, @m2saxon, @AlfonAmayuelas, @WilliamWangNLP, @CodeTerminator, @xwang_lk, and to our amazing collaborators, @xuandongzhao @dawnsongtweets, @RoyZhang13, @WendaXu2, @AlbalakAlon, etc.

UC Santa Barbara NLP Group@ucsbNLP

We did it! 🎉 12 papers from UCSB NLP accepted at #EMNLP2025 (7 Main + 5 Findings) Proud of everyone’s hard work—poster below 👇

English

4.5K

Shiyu Chang retweetledi

Denny Zhou@denny_zhou·25 Tem

Slides for my lecture “LLM Reasoning” at Stanford CS 25: dennyzhou.github.io/LLM-Reasoning-… Key points: 1. Reasoning in LLMs simply means generating a sequence of intermediate tokens before producing the final answer. Whether this resembles human reasoning is irrelevant. The crucial insight is that transformer models can become nearly arbitrarily powerful by generating many intermediate tokens, without the need of scaling the model size (arxiv.org/abs/2402.12875). 2. Pretrained models, even without any fine-tuning, are capable of reasoning. The challenge is that reasoning-based outputs often don’t appear at the top of the output distribution, so standard greedy decoding fails to surface them (arxiv.org/abs/2402.10200) 3. Prompting techniques (e.g., chain-of-thought prompting or "let’s think step by step") and supervised finetuning were commonly used to elicit reasoning. Now, RL finetuning has emerged as the most powerful method. This trick was independently discovered by several labs. At Google, credit goes to Jonathan Lai on my team. Based on our theory ( see point 1), scaling RL should focus on generating long responses rather than something else. 4. LLM reasoning can be hugely improved by generating multiple responses and then aggregating them, rather than relying on a single response (arxiv.org/abs/2203.11171).

English

482

3.1K

449.9K

Shiyu Chang@CodeTerminator·16 Ağu

@jefflai108 congrats

English

237

Cheng-I Jeff Lai@jefflai108·15 Ağu

ZXX

5.7K

Shiyu Chang@CodeTerminator·16 Ağu

@AlfonAmayuelas A new paper ~

English

116

Alfonso Amayuelas@AlfonAmayuelas·15 Ağu

Quick test on AI Agents: Asked an LLM to check my USPS package status using the tracking number --> Result: Disappointing! 😬 None of the LLMs did it correctly. All it takes is visiting USPS.com and pasting the number into the only input box they have

English

1.1K

Shiyu Chang@CodeTerminator·16 Ağu

@chrisalbon I like this.

English

128

Chris Albon@chrisalbon·16 Ağu

And then at the end, you have a super clean house ready for you to get down to work.

English

246

9.9K

Chris Albon@chrisalbon·16 Ağu

Okay fine. I’ve done this multiple times and it really works. 1. Find a day you can be totally alone. 2. Put your phone on silent. 3. Buy a 8-10 hour audiobook that tells a story. Fiction, biographies, etc. 4. Put in headphones and for the next 8-10 hours clean your house.

xlr8harder@xlr8harder

What's your best single day strategy to reset your mental state that is not urban renewaling your brain with psychedelics

English

326

7.8K

369.5K

Shiyu Chang retweetledi

Luiza Jarovsky, PhD@LuizaJarovsky·31 Tem

🚨 SHOCKING: people are unknowingly making their ChatGPT interactions PUBLIC, and they are being indexed by Google (see my test below). My privacy recommendations: When people interact with ChatGPT and use the "Share" feature (for example, to send the conversation to family and friends, or to use it in a lecture), this interaction becomes searchable and is apparently being indexed by Google. From my personal test (see one of the screenshots below), when I clicked on the conversations, there was no username (the users were marked as "anonymous"). However, because the vast majority of people don't think these interactions might become indexable, many might share personal or intimate details about themselves or others. They would be extremely anxious if they discovered that there was a public link to these interactions on Google (and others could potentially see them). - A few privacy recommendations to share with friends and family when using ChatGPT and similar AI chatbots: - Don't use the "share" feature (as these interactions might become indexable); - Never share personal information about yourself or others (as there could be unexpected leaks); - Deactivate the memory feature (to reduce the amount of personal data about you being processed and cross-linked with other information about you; it might help to reduce chatbot dependence as well); - Make your conversations anonymous, disable AI training (to reduce the amount of information about you being processed and potentially leaked); - Check other privacy settings that might be relevant and activate them. - 👉 Never miss my analyses and updates on AI's legal and ethical challenges: join my newsletter's 71,000+ subscribers (link below).

English

153

264

827

292.8K

Shiyu Chang@CodeTerminator·16 Tem

Sad to miss #ICML2025 this year, but thrilled that my student @hou_bairu will present his exciting work on dynamically pruning LLMs into efficient, task-specific models—done in collaboration with our amazing collaborators from Apple! 🍎✨

Bairu Hou@hou_bairu

Just describe your task (and optionally the input) — our method then dynamically prune the LLM into a smaller model that’s tailor-made for the task/input and gets it ready for inference in just 0.1 seconds. We call it "instruction-following" model pruning. Check out our #ICML2025 paper, "Instruction-Following Pruning for Large Language Models". By pruning a 9B LLM to 3B for each input, our method significantly outperforms standard dense 3B models and closely matches the performance of a dense 9B model. Even better, it delivers inference latency nearly identical to the dense 3B model. 📍 Poster session: Wednesday, July 16, 11:00 am – 1:30 pm 📍 Location: East Exhibition Hall A-B, #E-2711 Paper: machinelearning.apple.com/research/pruni… Join us to dive deeper into our approach and discussions! Many thanks to our amazing collaborators @chenqibin99 , @jeremy_wang2013 , @gyin94 , @cw_aabc , Nan Du, @ruomingpang , @CodeTerminator , and @taolei15949106

English

261

Shiyu Chang retweetledi

Andrej Karpathy@karpathy·13 Tem

Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly increase (/decrease) the probability of every action I took for the future". You get a lot more leverage from verifier functions than explicit supervision, this is great. But first, it looks suspicious asymptotically - once the tasks grow to be minutes/hours of interaction long, you're really going to do all that work just to learn a single scalar outcome at the very end, to directly weight the gradient? Beyond asymptotics and second, this doesn't feel like the human mechanism of improvement for majority of intelligence tasks. There's significantly more bits of supervision we extract per rollout via a review/reflect stage along the lines of "what went well? what didn't go so well? what should I try next time?" etc. and the lessons from this stage feel explicit, like a new string to be added to the system prompt for the future, optionally to be distilled into weights (/intuition) later a bit like sleep. In English, we say something becomes "second nature" via this process, and we're missing learning paradigms like this. The new Memory feature is maybe a primordial version of this in ChatGPT, though it is only used for customization not problem solving. Notice that there is no equivalent of this for e.g. Atari RL because there are no LLMs and no in-context learning in those domains. Example algorithm: given a task, do a few rollouts, stuff them all into one context window (along with the reward in each case), use a meta-prompt to review/reflect on what went well or not to obtain string "lesson", to be added to system prompt (or more generally modify the current lessons database). Many blanks to fill in, many tweaks possible, not obvious. Example of lesson: we know LLMs can't super easily see letters due to tokenization and can't super easily count inside the residual stream, hence 'r' in 'strawberry' being famously difficult. Claude system prompt had a "quick fix" patch - a string was added along the lines of "If the user asks you to count letters, first separate them by commas and increment an explicit counter each time and do the task like that". This string is the "lesson", explicitly instructing the model how to complete the counting task, except the question is how this might fall out from agentic practice, instead of it being hard-coded by an engineer, how can this be generalized, and how lessons can be distilled over time to not bloat context windows indefinitely. TLDR: RL will lead to more gains because when done well, it is a lot more leveraged, bitter-lesson-pilled, and superior to SFT. It doesn't feel like the full story, especially as rollout lengths continue to expand. There are more S curves to find beyond, possibly specific to LLMs and without analogues in game/robotics-like environments, which is exciting.

English

408

833

8.4K

1.1M

Keşfet

@KGreenewald @Yuchenj_UW @emnlpmeeting @YFan_UCSC @qianqi_yan @KaiwenZhou9 @XiaoSophiaPu @zhenzhangzz