Sergej Chicherin

564 posts

Sergej Chicherin banner
Sergej Chicherin

Sergej Chicherin

@chroneus

carpe diem

Katılım Nisan 2008
59 Takip Edilen39 Takipçiler
Al Dragon
Al Dragon@aldragon_net·
Друид ждёт, когда прошедший между мегалитами солнечный луч упадёт на центральный обелиск, возвещая начало реновации.
Al Dragon tweet media
Русский
1
0
46
817
Elliott
Elliott@EllSvenkeson·
@chroneus @nrehiew_ I am interested in friendly discourse on the subject. Why isn’t this the lowest N of pieces to fulfill the requirements?
Elliott tweet media
English
1
0
1
32
wh
wh@nrehiew_·
I’m no mathematician but curious why 1) No lab solved this 2) Is this much harder than P1-5 3) What specifically about this problem is difficult
wh tweet media
English
75
45
1.2K
264.3K
Elliott
Elliott@EllSvenkeson·
@nrehiew_ 🤔 The answer I came up with is a fractal of squares filling two opposite corners of a larger square with a tiny diagonal line of empty pixels.
English
2
0
1
1.5K
Sergej Chicherin
Sergej Chicherin@chroneus·
@mushenka_ Я пил в юности, друзьям было плохо, потом мне сказали что вообще-то это бальзам для втирания
Русский
0
0
2
683
муша
муша@mushenka_·
а что если всё-таки выпить эту лабуду со змеей и скорпионом, которая стоит у меня дома лет 15
муша tweet media
Русский
319
50
6.8K
271.2K
Sergej Chicherin
Sergej Chicherin@chroneus·
удобно, что живая елка во дворе с прошлого года, я даже гирлянды не убирал
Русский
0
0
2
74
Sergej Chicherin
Sergej Chicherin@chroneus·
@karpathy Did not get that the author wanted to say, but the right position looks worse - if they take by N8, black N7 and Q9 make white loose almost 20 points.
English
0
0
1
35
Andrej Karpathy
Andrej Karpathy@karpathy·
# RLHF is just barely RL Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well. What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better: Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this: 1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse, 2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM. For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was. And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL. No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.
Andrej Karpathy tweet media
English
403
1.2K
8.8K
1.2M
Sergej Chicherin
Sergej Chicherin@chroneus·
подумал, что холодный биткойн - самая идеальная инвестиция для таких, как я. Когда он падал от 50 до 40, мне просто лень бежать в обменник, в тревоге от все пропало.
Русский
0
0
0
121
Sergej Chicherin
Sergej Chicherin@chroneus·
@LinusEkenstam Could not see the shadow from separate hair thread in the left corner. Typically, neither SD nor GAN could trace the light.
English
0
0
0
27
Linus ✦ Ekenstam
Linus ✦ Ekenstam@LinusEkenstam·
Now we are in the territory where its hard to tell any more if this is a photo or AI generated. Critique this.
Linus ✦ Ekenstam tweet media
English
478
104
1.4K
860.7K
Sergej Chicherin
Sergej Chicherin@chroneus·
Боярина Козаринова-Голохватова, принявшего схиму, чтобы избежать казни, Иван Грозный велел взорвать на бочке пороха, на том основании, что схимники — ангелы, а потому должны лететь на небо
Русский
0
0
0
130
Sergej Chicherin
Sergej Chicherin@chroneus·
@magicianbrain I saw several GAN-produced images in exhibitions but never met anyone who used copilot for coding. The criteria for programmers are more definite, so it is out of the discussion for them to cut AI-generated contributions.
English
0
0
0
0
brainlet
brainlet@magicianbrain·
why is it that artists are so much more hostile to AI making art than programmers are to AI writing code
English
885
285
4.5K
0
Moon
Moon@paywithmoon·
@chroneus We are aware of the issue and working to fix
English
1
0
0
0
Sergej Chicherin
Sergej Chicherin@chroneus·
@paywithmoon I’ve got pending transaction during several days. When I click complete payment it shows “sorry there was a problem loading this page”.
English
1
0
0
0
Sergej Chicherin
Sergej Chicherin@chroneus·
@paywithmoon It was on step of creating card. I paid btc, transaction confirmed but shows underpaid few cents and voila, no card ,no money, nothing to do with it
English
1
0
0
0
Moon
Moon@paywithmoon·
@chroneus Can you DM us details? Card support code and the merchant you're spending at?
English
1
0
0
0
Sergej Chicherin
Sergej Chicherin@chroneus·
@WaitSpotyRussia Мы уже начали создавать копию ваших персональных данных. Обычно это занимает не больше 30 дн.
Русский
0
0
0
0
Not Spotify Russia
Not Spotify Russia@WaitSpotyRussia·
Сегодня Spotify в России не запустился
Русский
153
169
5.9K
0
Sergej Chicherin
Sergej Chicherin@chroneus·
программистское
Sergej Chicherin tweet media
Русский
0
0
1
0
Sergej Chicherin
Sergej Chicherin@chroneus·
Вообще-то, на Островитянова
Sergej Chicherin tweet media
Русский
0
0
1
0
Sergej Chicherin
Sergej Chicherin@chroneus·
На прогулке увидели бежавшую лисицу. Я : в интонации советского мультика - а может, это - страус злой,пауза, а может - и не злой. Псы переглядываются и успокаиваются
Русский
0
0
0
0
Dr. Evgeniy
Dr. Evgeniy@larsmars·
The word "smile" doesn't even exist in Latin. Smiling was an invention of the Middle Ages
English
1
0
0
0
Sergej Chicherin
Sergej Chicherin@chroneus·
Нет — как вак­хан­ка под чарой кад­мей­ско­го бога, цари­ца Мчит­ся в чащи лесов, брач­ный поки­нув покой.
Русский
0
0
0
0