Yao Liu (@yaoliucs) - Twitter Profili | Zamantika Mersobahis Locabet

Yao Liu retweetledi

Ke Yang@EmpathYang·23 Oca

Excited to announce that our web agent paper, AgentOccam, has been accepted to ICLR 2025! 🏂🏂🏂 Huge thanks to all collaborators! 😊 Special thanks to my brilliant and considerate mentor, Yao @yaoliucs, for your constant guidance and encouragement! Sapana @Sapana_007 and Rasool @rasoolfa, your insightful support has been invaluable. Huzefa, your unwavering support as our manager has been instrumental in our success. Pratik and George, your invaluable suggestions have greatly enriched our work. 📸: a recent photo capturing some senses of the Chinese phrase "大隐隐于市". #ICLR2025 #webagent

Ke Yang@EmpathYang

👾 Introducing AgentOccam: Automating Web Tasks with LLMs! 🌐 AgentOccam showcases the impressive power of Large Language Models (LLMs) on web tasks, without any in-context examples, new agent roles, online feedback, or search strategies. 🏄🏄🏄 🧙 Link: arxiv.org/abs/2410.13825 🧐 By refining the observation and action spaces, AgentOccam achieves a groundbreaking zero-shot performance, outperforming previous methods on the WebArena benchmark. This simple yet effective approach underlines the importance of aligning these spaces closely with LLM capabilities for enhanced efficiency. 📈 ✨ Highlights: - AgentOccam leads with a 29.4% improvement over state-of-the-art methods SteP, and a 161% boost in success rate compared to the vanilla agent. 🤖 - Achievements made possible without complicating the process with additional examples or strategies. 🚫 - All our replication work, prompts, and evaluator error rectifications are transparently shared in the appendix. 📚 🌟 Special thanks to my super brilliant and considerate mentor Yao and Rasool, our supportive manager Huzefa, and the invaluable suggestions and contributions from Sapana, Pratik, and George. Your guidance and support have been pivotal in this journey! #AgentOccam #LLM #WebAutomation #AI

English

0

6

16

1.3K

Yao Liu retweetledi

Ke Yang@EmpathYang·18 Eki

👾 Introducing AgentOccam: Automating Web Tasks with LLMs! 🌐 AgentOccam showcases the impressive power of Large Language Models (LLMs) on web tasks, without any in-context examples, new agent roles, online feedback, or search strategies. 🏄🏄🏄 🧙 Link: arxiv.org/abs/2410.13825 🧐 By refining the observation and action spaces, AgentOccam achieves a groundbreaking zero-shot performance, outperforming previous methods on the WebArena benchmark. This simple yet effective approach underlines the importance of aligning these spaces closely with LLM capabilities for enhanced efficiency. 📈 ✨ Highlights: - AgentOccam leads with a 29.4% improvement over state-of-the-art methods SteP, and a 161% boost in success rate compared to the vanilla agent. 🤖 - Achievements made possible without complicating the process with additional examples or strategies. 🚫 - All our replication work, prompts, and evaluator error rectifications are transparently shared in the appendix. 📚 🌟 Special thanks to my super brilliant and considerate mentor Yao and Rasool, our supportive manager Huzefa, and the invaluable suggestions and contributions from Sapana, Pratik, and George. Your guidance and support have been pivotal in this journey! #AgentOccam #LLM #WebAutomation #AI

English

3

27

60

11.1K

Yao Liu retweetledi

Allen Nie (🇺🇦☮️)@allenainie·14 Şub

Training and deploying RL in real life is tough! Our report on making an RL system for 2-4th grade math is finally out. 🚀 After interacting with just 269 students, our RL policy can significantly improve their learning outcomes.🎓 🔥 However, results from a controlled study reveal a twist: only students with lower entrance exam scores significantly benefit from it, while high-scoring students are fine with basic practices. RL systems like ours can close the gap between low-performers and high-performers. 🏆 More surprisingly, when we use behavior cloning and offline RL evaluation to learn a distilled new policy from the logged dataset, the new policy generalizes to new geographical communities! This means the RL policy can adapt & serve students far beyond where we initially trained🌍. 🫤 Because the training and development were done quite a few years ago, LLMs didn't exist. We asked real teachers to write chat messages. Now, we wonder: Could today's LLMs scale these systems to new heights? The work was done by the amazing @dr123sr (Sherry Ruan), William Steenbergen, @yaoliucs, advised by @landay and @EmmaBrunskill. 📰Read more: link.springer.com/article/10.100…

English

1

9

45

8.7K

Yao Liu retweetledi

Dr. Dawn Wright + @deepseadawn.bsky.social 🇺🇦@deepseadawn·11 Ara

“Their intellectual humility lies in their openness to the possibility, indeed strong likelihood, that nobody is in possession of the full truth, and that others, too, may have insights, ideas and evidence that should be taken into account when forming their own best judgments.”

English

0

2

14

2.4K

Yao Liu retweetledi

Banghua Zhu@BanghuaZ·11 Ara

I'll be at #NeurIPS2023, and the academic job market this year! RT will be greatly appreciated! I work on statistics and information theory, with applications in robust statistics, offline RL, game theory, human-AI interactions and LLMs. I'm recently working on better fine-tuning and serving of LLMs. I'd love to chat about the theoretical formulations of RLHF, better reward training and policy fine-tuning algorithms, the creation of Nectar, Starling-7B and NexusRaven-13B, and watermarking / caching / model multiplexing / speculative decoding / quantization / S-Lora / fairness in serving. I'll also be at Booth 423 of Nexusflow! Some papers I'll be presenting are listed below. 1. On Optimal Caching and Model Multiplexing for Large Model Inference. arxiv.org/abs/2306.02003 2. Doubly Robust Self-Training: arxiv.org/abs/2306.00265 3. NexusRaven: A Commercially-Permissive Language Model for Function Calling: openreview.net/forum?id=5lcPe… 4. Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment: arxiv.org/abs/2310.00212 5. Towards Optimal Statistical Watermarking: openreview.net/forum?id=Fc2Fa… 6. A Theoretical Explanation of Deep RL Performance in Stochastic Environments: openreview.net/forum?id=KzR07… 7. Towards the Fundamental Limits of Knowledge Transfer over Finite Domains: arxiv.org/abs/2310.07838 8. Efficient Prompt Caching for Large Language Model Inference via Embedding Similarity: t.ly/Qgfq4

English

0

18

68

14.6K

Yao Liu@yaoliucs·9 Ara

@ethayarajh It's really impressive Kawin! Not only the KTO but I more surprised at the dummy offline PPO result. I remember when I saw your earlier tweet and wonder "how stable this PPO implemention is" x.com/ethayarajh/sta… can't wait to check if my PPO implementation is stable enough

Kawin Ethayarajh@ethayarajh

@sebgehr @rajammanabrolu It really depends on your implementation. Most implementations of PPO are really unstable. I rolled my own PPO over the summer and found that even without a learned reward model, it outperformed the closed form loss functions that have come out recently.

English

1

0

2

525

Kawin Ethayarajh@ethayarajh·7 Ara

📢The problem in model alignment no one talks about — the need for preference data, which costs $$$ and time! Enter Kahneman-Tversky Optimization (KTO), which matches or exceeds DPO without paired preferences. And with it, the largest-ever suite of feedback-aligned LLMs. 🧵

English

18

126

678

169.9K

Yao Liu retweetledi

Banghua Zhu@BanghuaZ·7 Ara

Excited to see starling-7B-alpha is (slightly) more preferred than other 7B model! Actually I expected the other way around. Attaching my favorite example below. Starling-alpha is for sure slightly over-RLHFed to maximize GPT-4 preference rather than human preference and can be very verbose. It will hopefully become better after we mitigate this issue in the next release. Some more detailed analysis: in terms of win rate to GPT-3.5-Turbo in human-eval, starling is 49% while its base model, openchat 3.5 is 41%. But when we directly compare starling vs openchat, the win rate is only 49% (likely due to verbose output). In contrast, when we use GPT-4 to compare, the win rate of Starling is 70%! This suggests a significant discrepancy between GPT-4-based eval and human eval! At this stage, it seems that the gap between small and large model primarily lies in the hallucination part. Starling hallucinates much more than 30B+ models, which also lead to lower human eval score. Besides the hot DPO vs PPO debate, I think there're a lot of other exciting research there: 1. How to evaluate a chat model with minimal human effort, when MT-Bench and Alpaca Eval may not be a very correlated proxy. This is particularly important when we need to select from 10+ checkpoints trained from SFT / RL. 2. How to train a better reward model that is less biased towards lengthy or sycophantic output. 3. How to make smaller models hallucinate less, and understand its knowledge boundary. 4. How to achieve a balance between helpfulness and harmlessness without under-alignment or over-alignment. 5. How to evaluate and compare the reward model. 6. How to learn from both human preference data (smaller bias but larger variance) and synthetic data (larger bias but smaller variance). I believe the dataset Nectar's potential is definitely constrained by the model size. And both DPO and PPO shall bring good improvements for models with larger size. But unfortunately we don't have enough resource to scale beyond 7B...

Arena.ai@arena

Exciting Arena Leaderboard Updates! Six new models: - Tulu-2-DPO-70B and Yi-34B-Chat are the new SoTA open models - Mistral-based 7B models (OpenChat, OpenHermes-2.5, Starling-7B) are stronger than ever Big congrats to the OSS AI community! Learn more lmsys.org/blog/2023-12-0…

English

5

21

73

28.2K

Yao Liu@yaoliucs·8 Ara

This is a work lead by @AsadiKavosh and Shoham Sabach and done with Omer Gottesman and @rasoolfa . If you are interested in this topics, please check out our paper: openreview.net/forum?id=XOCbd… and poster at #1503 next Wed 5pm. 6/6.

English

0

1

255

Yao Liu@yaoliucs·8 Ara

Interestingly such finding also overlaps with the deadly-triad about off-policyness, as we show that the role data distribution plays depends on its intereffect with other components in the optimization. Off-policy data do not always give you worse contraction factor. 5/N

English

1

0

236

Yao Liu@yaoliucs·8 Ara

One common misconception about (deep) RL is that is was done by first defining some empirical loss as objective and then deriving model updating rules from GD, just like supervised learning. This is NOT the case for popular RL algorithms like policy gradient or TD-based. 1/N

English

1

2

13

2K

Yao Liu@yaoliucs·8 Ara

This is a work done with @pratikac and @rasoolfa . If you are interested in this topic, please check out our paper: openreview.net/forum?id=1MUxt… and poster at #1403 next Tue 5:15 pm. 5/5.

English

0

184

Yao Liu@yaoliucs·8 Ara

This new operator applies at most B times of max_{a \in A}, while the vanilla Bellman takes H or 1/(1-\gamma) times of max. We test online off-policy RL algo TD3 or SAC with this change, it beats the offline RL SOTA in antmaze tasks (hardest ones in D4RL) and mujoco tasks. 4/N

English

1

0

203

Yao Liu@yaoliucs·8 Ara

Offline RL is much harder than online RL or imitation learning as it needs to solve a sequence of counterfactual reasoning problems. That often gives an error of (1+\delta)^H, where delta is the one-step divergence of policy or extrapolation of Q and H is the horizon. 1/N

English

1

2

24

2.6K

Yao Liu retweetledi

Rasool Fakoor@rasoolfa·7 Ara

TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models FMDM workshop, Hall E2 (level 1). Fri 15 Dec, 8:15 a.m. CST - 4 PM CST neurips.cc/virtual/2023/w… joint work Zuxin Liu, @Jesse_Y_Zhang, @AsadiKavosh @yaoliucs, Shoham 5/n

English

0

3

554

Yao Liu retweetledi

Rasool Fakoor@rasoolfa·7 Ara

Budgeting Counterfactual for Offline RL Great Hall & Hall B1+B2 (level 1) #1403 Tue 12 Dec 5:15 p.m. CST — 7:15 p.m. CST neurips.cc/virtual/2023/p… joint work with @yaoliucs @pratikac. 3/n

English

0

1

2

609

Yao Liu

Keşfet