Yao Liu

56 posts

Yao Liu

Yao Liu

@yaoliucs

Research Scientist at AWS AI. Opinions are my own. Previously @AIforHI @StanfordAILab

Bellevue, WA Katılım Haziran 2018
197 Takip Edilen312 Takipçiler
Yao Liu retweetledi
Ke Yang
Ke Yang@EmpathYang·
Excited to announce that our web agent paper, AgentOccam, has been accepted to ICLR 2025! 🏂🏂🏂 Huge thanks to all collaborators! 😊 Special thanks to my brilliant and considerate mentor, Yao @yaoliucs, for your constant guidance and encouragement! Sapana @Sapana_007 and Rasool @rasoolfa, your insightful support has been invaluable. Huzefa, your unwavering support as our manager has been instrumental in our success. Pratik and George, your invaluable suggestions have greatly enriched our work. 📸: a recent photo capturing some senses of the Chinese phrase "大隐隐于市". #ICLR2025 #webagent
Ke Yang tweet media
Ke Yang@EmpathYang

👾 Introducing AgentOccam: Automating Web Tasks with LLMs! 🌐 AgentOccam showcases the impressive power of Large Language Models (LLMs) on web tasks, without any in-context examples, new agent roles, online feedback, or search strategies. 🏄🏄🏄 🧙 Link: arxiv.org/abs/2410.13825 🧐 By refining the observation and action spaces, AgentOccam achieves a groundbreaking zero-shot performance, outperforming previous methods on the WebArena benchmark. This simple yet effective approach underlines the importance of aligning these spaces closely with LLM capabilities for enhanced efficiency. 📈 ✨ Highlights: - AgentOccam leads with a 29.4% improvement over state-of-the-art methods SteP, and a 161% boost in success rate compared to the vanilla agent. 🤖 - Achievements made possible without complicating the process with additional examples or strategies. 🚫 - All our replication work, prompts, and evaluator error rectifications are transparently shared in the appendix. 📚 🌟 Special thanks to my super brilliant and considerate mentor Yao and Rasool, our supportive manager Huzefa, and the invaluable suggestions and contributions from Sapana, Pratik, and George. Your guidance and support have been pivotal in this journey! #AgentOccam #LLM #WebAutomation #AI

English
0
6
16
1.3K
Yao Liu retweetledi
Ke Yang
Ke Yang@EmpathYang·
👾 Introducing AgentOccam: Automating Web Tasks with LLMs! 🌐 AgentOccam showcases the impressive power of Large Language Models (LLMs) on web tasks, without any in-context examples, new agent roles, online feedback, or search strategies. 🏄🏄🏄 🧙 Link: arxiv.org/abs/2410.13825 🧐 By refining the observation and action spaces, AgentOccam achieves a groundbreaking zero-shot performance, outperforming previous methods on the WebArena benchmark. This simple yet effective approach underlines the importance of aligning these spaces closely with LLM capabilities for enhanced efficiency. 📈 ✨ Highlights: - AgentOccam leads with a 29.4% improvement over state-of-the-art methods SteP, and a 161% boost in success rate compared to the vanilla agent. 🤖 - Achievements made possible without complicating the process with additional examples or strategies. 🚫 - All our replication work, prompts, and evaluator error rectifications are transparently shared in the appendix. 📚 🌟 Special thanks to my super brilliant and considerate mentor Yao and Rasool, our supportive manager Huzefa, and the invaluable suggestions and contributions from Sapana, Pratik, and George. Your guidance and support have been pivotal in this journey! #AgentOccam #LLM #WebAutomation #AI
Ke Yang tweet mediaKe Yang tweet mediaKe Yang tweet mediaKe Yang tweet media
English
3
27
60
11.1K
Yao Liu retweetledi
Allen Nie (🇺🇦☮️)
Allen Nie (🇺🇦☮️)@allenainie·
Training and deploying RL in real life is tough! Our report on making an RL system for 2-4th grade math is finally out. 🚀 After interacting with just 269 students, our RL policy can significantly improve their learning outcomes.🎓 🔥 However, results from a controlled study reveal a twist: only students with lower entrance exam scores significantly benefit from it, while high-scoring students are fine with basic practices. RL systems like ours can close the gap between low-performers and high-performers. 🏆 More surprisingly, when we use behavior cloning and offline RL evaluation to learn a distilled new policy from the logged dataset, the new policy generalizes to new geographical communities! This means the RL policy can adapt & serve students far beyond where we initially trained🌍. 🫤 Because the training and development were done quite a few years ago, LLMs didn't exist. We asked real teachers to write chat messages. Now, we wonder: Could today's LLMs scale these systems to new heights? The work was done by the amazing @dr123sr (Sherry Ruan), William Steenbergen, @yaoliucs, advised by @landay and @EmmaBrunskill. 📰Read more: link.springer.com/article/10.100…
Allen Nie (🇺🇦☮️) tweet mediaAllen Nie (🇺🇦☮️) tweet mediaAllen Nie (🇺🇦☮️) tweet media
English
1
9
45
8.7K
Yao Liu retweetledi
Dr. Dawn Wright + @deepseadawn.bsky.social 🇺🇦
“Their intellectual humility lies in their openness to the possibility, indeed strong likelihood, that nobody is in possession of the full truth, and that others, too, may have insights, ideas and evidence that should be taken into account when forming their own best judgments.”
English
0
2
14
2.4K
Yao Liu retweetledi
Banghua Zhu
Banghua Zhu@BanghuaZ·
I'll be at #NeurIPS2023, and the academic job market this year! RT will be greatly appreciated! I work on statistics and information theory, with applications in robust statistics, offline RL, game theory, human-AI interactions and LLMs. I'm recently working on better fine-tuning and serving of LLMs. I'd love to chat about the theoretical formulations of RLHF, better reward training and policy fine-tuning algorithms, the creation of Nectar, Starling-7B and NexusRaven-13B, and watermarking / caching / model multiplexing / speculative decoding / quantization / S-Lora / fairness in serving. I'll also be at Booth 423 of Nexusflow! Some papers I'll be presenting are listed below. 1. On Optimal Caching and Model Multiplexing for Large Model Inference. arxiv.org/abs/2306.02003 2. Doubly Robust Self-Training: arxiv.org/abs/2306.00265 3. NexusRaven: A Commercially-Permissive Language Model for Function Calling: openreview.net/forum?id=5lcPe… 4. Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment: arxiv.org/abs/2310.00212 5. Towards Optimal Statistical Watermarking: openreview.net/forum?id=Fc2Fa… 6. A Theoretical Explanation of Deep RL Performance in Stochastic Environments: openreview.net/forum?id=KzR07… 7. Towards the Fundamental Limits of Knowledge Transfer over Finite Domains: arxiv.org/abs/2310.07838 8. Efficient Prompt Caching for Large Language Model Inference via Embedding Similarity: t.ly/Qgfq4
Banghua Zhu tweet mediaBanghua Zhu tweet media
English
0
18
68
14.6K
Yao Liu
Yao Liu@yaoliucs·
@ethayarajh It's really impressive Kawin! Not only the KTO but I more surprised at the dummy offline PPO result. I remember when I saw your earlier tweet and wonder "how stable this PPO implemention is" x.com/ethayarajh/sta… can't wait to check if my PPO implementation is stable enough
Kawin Ethayarajh@ethayarajh

@sebgehr @rajammanabrolu It really depends on your implementation. Most implementations of PPO are really unstable. I rolled my own PPO over the summer and found that even without a learned reward model, it outperformed the closed form loss functions that have come out recently.

English
1
0
2
525
Kawin Ethayarajh
Kawin Ethayarajh@ethayarajh·
📢The problem in model alignment no one talks about — the need for preference data, which costs $$$ and time! Enter Kahneman-Tversky Optimization (KTO), which matches or exceeds DPO without paired preferences. And with it, the largest-ever suite of feedback-aligned LLMs. 🧵
Kawin Ethayarajh tweet media
English
18
126
678
169.9K
Yao Liu retweetledi
Banghua Zhu
Banghua Zhu@BanghuaZ·
Excited to see starling-7B-alpha is (slightly) more preferred than other 7B model! Actually I expected the other way around. Attaching my favorite example below. Starling-alpha is for sure slightly over-RLHFed to maximize GPT-4 preference rather than human preference and can be very verbose. It will hopefully become better after we mitigate this issue in the next release. Some more detailed analysis: in terms of win rate to GPT-3.5-Turbo in human-eval, starling is 49% while its base model, openchat 3.5 is 41%. But when we directly compare starling vs openchat, the win rate is only 49% (likely due to verbose output). In contrast, when we use GPT-4 to compare, the win rate of Starling is 70%! This suggests a significant discrepancy between GPT-4-based eval and human eval! At this stage, it seems that the gap between small and large model primarily lies in the hallucination part. Starling hallucinates much more than 30B+ models, which also lead to lower human eval score. Besides the hot DPO vs PPO debate, I think there're a lot of other exciting research there: 1. How to evaluate a chat model with minimal human effort, when MT-Bench and Alpaca Eval may not be a very correlated proxy. This is particularly important when we need to select from 10+ checkpoints trained from SFT / RL. 2. How to train a better reward model that is less biased towards lengthy or sycophantic output. 3. How to make smaller models hallucinate less, and understand its knowledge boundary. 4. How to achieve a balance between helpfulness and harmlessness without under-alignment or over-alignment. 5. How to evaluate and compare the reward model. 6. How to learn from both human preference data (smaller bias but larger variance) and synthetic data (larger bias but smaller variance). I believe the dataset Nectar's potential is definitely constrained by the model size. And both DPO and PPO shall bring good improvements for models with larger size. But unfortunately we don't have enough resource to scale beyond 7B...
Banghua Zhu tweet media
Arena.ai@arena

Exciting Arena Leaderboard Updates! Six new models: - Tulu-2-DPO-70B and Yi-34B-Chat are the new SoTA open models - Mistral-based 7B models (OpenChat, OpenHermes-2.5, Starling-7B) are stronger than ever Big congrats to the OSS AI community! Learn more lmsys.org/blog/2023-12-0…

English
5
21
73
28.2K
Yao Liu
Yao Liu@yaoliucs·
Interestingly such finding also overlaps with the deadly-triad about off-policyness, as we show that the role data distribution plays depends on its intereffect with other components in the optimization. Off-policy data do not always give you worse contraction factor. 5/N
English
1
0
0
236
Yao Liu
Yao Liu@yaoliucs·
One common misconception about (deep) RL is that is was done by first defining some empirical loss as objective and then deriving model updating rules from GD, just like supervised learning. This is NOT the case for popular RL algorithms like policy gradient or TD-based. 1/N
English
1
2
13
2K
Yao Liu
Yao Liu@yaoliucs·
This new operator applies at most B times of max_{a \in A}, while the vanilla Bellman takes H or 1/(1-\gamma) times of max. We test online off-policy RL algo TD3 or SAC with this change, it beats the offline RL SOTA in antmaze tasks (hardest ones in D4RL) and mujoco tasks. 4/N
English
1
0
0
203
Yao Liu
Yao Liu@yaoliucs·
Offline RL is much harder than online RL or imitation learning as it needs to solve a sequence of counterfactual reasoning problems. That often gives an error of (1+\delta)^H, where delta is the one-step divergence of policy or extrapolation of Q and H is the horizon. 1/N
English
1
2
24
2.6K