Geoffrey Cideron

20 posts

Geoffrey Cideron

Geoffrey Cideron

@CdrGeo

Research Engineer at Google DeepMind. Spent time at FAIR London, INRIA Lille, and Instadeep.

Entrou em Mart 2019
410 Seguindo229 Seguidores
Geoffrey Cideron
Geoffrey Cideron@CdrGeo·
Happy to introduce our new paper "Diversity-Rewarded CFG Distillation". We combine distillation, a novel diversity reward, and model merging to improve the quality-diversity tradeoff of MusicLM. arxiv: arxiv.org/abs/2410.06084 More info:
Alexandre Ramé@ramealexandre

An AI will win a Nobel price someday✨. Yet currently, alignment reduces creativity. Our new @GoogleDeepMind paper "diversity-rewarded CFG distillation" improves quality AND diversity for music, via distillation of test-time compute, RL with a diversity reward, and model merging. arxiv: arxiv.org/abs/2410.06084 website: google-research.github.io/seanet/musiclm…

English
3
4
23
2.6K
Geoffrey Cideron retweetou
Robert Dadashi
Robert Dadashi@robdadashi·
I am very happy to announce that Gemma 1.1 Instruct 2B and “7B” are out! Here are a few details about the new models: 1/11
English
12
66
362
324.6K
Geoffrey Cideron retweetou
Robert Dadashi
Robert Dadashi@robdadashi·
I am so proud to see Gemma released today! I have had a fantastic time working on post-training and RLHF with an amazing team. Cannot wait to see what the community builds with these models!
Google DeepMind@GoogleDeepMind

Introducing Gemma: a family of lightweight, state-of-the-art open models for developers and researchers to build with AI. 🌐 We’re also releasing tools to support innovation and collaboration - as well as to guide responsible use. Get started now. → dpmd.ai/3UJu1Y1

English
3
7
54
8.5K
Geoffrey Cideron retweetou
Johan Ferret
Johan Ferret@johanferret·
Online feedback is crucial for alignment, so we propose a simple recipe to make any direct alignment method (think DPO / IPO / SLiC-HF) online using AI feedback 🧙‍♂️ In human evals, online methods yield on avg 66% wins, 28% ties and 6% losses vs offline methods (on TL;DR) 👀
AK@_akhaliq

Direct Language Model Alignment from Online AI Feedback paper page: huggingface.co/papers/2402.04… Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference datasets used in DAP methods are usually collected ahead of training and never updated, thus the feedback is purely offline. Moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as annotator: on each training iteration, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback. Despite its simplicity, we demonstrate via human evaluation in several tasks that OAIF outperforms both offline DAP and RLHF methods. We further show that the feedback leveraged in OAIF is easily controllable, via instruction prompts to the LLM annotator.

English
1
6
30
4.3K
Geoffrey Cideron retweetou
AK
AK@_akhaliq·
Google presents MusicRL Aligning Music Generation to Human Preferences paper page: huggingface.co/papers/2402.04… propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.
AK tweet media
English
0
50
276
41.2K
Geoffrey Cideron retweetou
Johan Ferret
Johan Ferret@johanferret·
Our #ACL2023 paper "Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback" is now on arXiv! tl;dr - we improve the factuality of summaries via RL, without human feedback! 📜 arxiv.org/abs/2306.00186 Thread (1/10) 👇
English
4
24
86
30.8K
Geoffrey Cideron retweetou
ëugene kharitonov 🏴‍☠️
We* are looking for a Student Researcher** to work with us on a project in intersection of modeling/generating speech/audio, NLP, and representation learning. *AudioLM team @ Google Research (@zalanborsos, @neilzegh, myself and many others!) **not-last-year PhD student
English
3
12
36
17.8K
Geoffrey Cideron retweetou
Olivier Bachem
Olivier Bachem@OlivierBachem·
A common belief is that text auto encoders produce badly structured latent spaces with holes. We were surprised to find that using round-trip translations (e.g. en->de->en) one can obtain nicely structured latent spaces. Check out arxiv.org/pdf/2209.06792….
Olivier Bachem tweet media
English
4
29
96
0
Geoffrey Cideron retweetou
Johan Ferret
Johan Ferret@johanferret·
Excited to announce that our #AAMAS2022 paper "Lazy-MDPs: Towards Interpretable RL by Learning When to Act" is on arXiv! 🦥 tl;dr - we introduce lazy-MDPs, modified MDPs that allow agents to defer decision-making to a third-party policy 📜 arxiv.org/abs/2203.08542 🧵👇
English
2
16
45
0
Geoffrey Cideron
Geoffrey Cideron@CdrGeo·
It was great to work with @AmartyaSanyal, @_rockt, and @egrefen at FAIR London. This line of research is fascinating! Thank you for the opportunity! Additional gratitude to @RCalandra for the support and advice.
Edward Grefenstette@egrefen

I've been thinking a lot about this work recently, esp. the fascinating ML problems that emerge when you want to solve it without generating doc/env variants. Ongoing work on this with @AmartyaSanyal+@CdrGeo who I had the pleasure of remotely hosting as interns this year. [3/14]

English
0
4
18
0