Hiroaki Kitano

1.3K posts

Hiroaki Kitano

Hiroaki Kitano

@HKitano

参加日 Ocak 2010
156 フォロー中809 フォロワー
Hiroaki Kitano がリツイート
Kanika
Kanika@KanikaBK·
🚨I JUST READ SOMETHING SHOCKING. Researchers just trained an AI to predict which scientific ideas will succeed before any experiment is run. It is now better at judging research than GPT-5.2, Gemini 3 Pro, and every top AI model on the market. And it learned by studying 2.1 million research papers without a single human scientist teaching it what "good science" looks like. Here is what they did. A team of Chinese researchers built two AI systems. The first, called Scientific Judge, was trained on 700,000 matched pairs of high-citation vs low-citation papers. Every pair came from the same field and the same time period. The AI's only job: figure out which paper would have more impact. It worked. The AI now predicts which research will succeed with 83.7% accuracy. That is higher than GPT-5.2. Higher than Gemini 3 Pro. Higher than every frontier model that exists. Then they built the second system. Scientific Thinker doesn't just judge ideas. It proposes them. You give it a research paper, and it generates a follow-up idea with high potential impact. When tested head to head against GPT-5.2, Scientific Thinker's ideas were rated as higher impact 61% of the time. It is generating better research directions than the smartest AI models in the world. It gets stranger. They trained the Judge only on computer science papers. Then they tested it on biology. Physics. Mathematics. Fields it had never seen. It still worked. 71% accuracy on biology papers it was never trained on. The AI didn't learn what makes good computer science. It learned what makes good science, period. Then the researchers tested whether it could see the future. They trained it on papers through 2024, then asked it to judge 2025 papers. It predicted which ones would gain traction with 74% accuracy. The AI learned to spot winners before the scientific community did. Here is what nobody is talking about. A 1.5 billion parameter model, tiny by today's standards, jumped from 7% to 72% accuracy after training. That is a 65-point leap. The ability to judge scientific quality isn't some emergent property of massive models. It can be taught to small, cheap, fast AI systems that anyone can run. Every year, over 2 million papers flood scientific databases. Researchers spend months deciding what to work on next. Grant committees spend billions deciding what to fund. An AI just learned to make those decisions faster, cheaper, and more accurately than any of them. If an AI can now judge which ideas will shape the future of science, what exactly is left that only a human scientist can do?
Kanika tweet media
English
34
104
368
23K
Hiroaki Kitano
Hiroaki Kitano@HKitano·
ソニー・チーフテクノロジーフェロー×『Nature』編集長が語る「初期・中堅キャリアの女性研究者に光を当てる理由」 | Forbes JAPAN 公式サイト(フォーブス ジャパン) forbesjapan.com/articles/detai…
日本語
0
0
6
244
ミド建築・都市観測所
ミド建築・都市観測所@Mid_observatory·
安住アナが東京の暑さに対して「やっぱり許し難いのは湾岸に建てた高層マンションだよね。私が都知事になったら、あれ全部解体しますよ。住んでる人には悪いけど全部取り壊しちゃう。風の通り道をつくりましょうね」。おっしゃる通り、湾岸ビルがなくなれば涼しくなります。 news.yahoo.co.jp/articles/6d851…
日本語
645
12.5K
92.5K
20M
Hiroaki Kitano
Hiroaki Kitano@HKitano·
@rkmt Organoid Intelligenceが出てきても同じことになるのか?
日本語
0
0
0
65
Jun Rekimoto: 暦本純一
OI(organic intelligence,つまり人間)と話すときは、相手の感情を気遣ったりなどの負荷が高い。こう言ったら相手はどう思うだろう、みたいなシミュレーションを常にしながら会話をしているが、AIだとその負荷がないのでコミュニケーションが楽。。
null-sensei@GOROman

これは実体験ベースなんですけど、社内SlackにAIエンジニアを召喚してやりとりに慣れてしまうと、人間に依頼するのがすごいストレスに変わるんですよね。人間の場合は気を遣ったり、機嫌を損ねないようにとか、今は話しかけていいのか?とかそういう小さいストレスがAIだと感じないのです。

日本語
8
213
714
204.9K
Hiroaki Kitano がリツイート
Sony AI
Sony AI@SonyAI_global·
GT Sophy 2.1 is now available in Custom Race on all GT Sophy supported tracks and layouts in World Circuits (PS5 only). See you on the track! bit.ly/4jbB84w #SonyAI #AI #GTSophy #GranTurismo
Sony AI tweet media
English
1
11
41
1.9K
Demis Hassabis
Demis Hassabis@demishassabis·
Massive congratulations to my good friend and former Google colleague @geoffreyhinton on winning the Nobel Prize in Physics (with John Hopfield)! Incredibly well deserved, Geoff laid the foundations for the deep learning revolution that underpins the modern AI field.
The Nobel Prize@NobelPrize

BREAKING NEWS The Royal Swedish Academy of Sciences has decided to award the 2024 #NobelPrize in Physics to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks.”

English
203
351
4.8K
602.6K
Hiroaki Kitano がリツイート
Comphies
Comphies@ComphiesTees·
Sony Research and AI Singapore Sign MOU to Conduct Collaborative Research on Large Language Models for Southeast Asian Languages: The collaboration will begin with an initial focus on exploring SEA-LION for the Tamil language TOKYO and SINGAPORE, Sept.… dlvr.it/TD2NZt
English
0
1
1
326
Hiroaki Kitano がリツイート
Yu Gu
Yu Gu@yugu_nlp·
Karpathy’s post on RLHF resonates a lot with some of my recent thinkings. A quick summary: 1. RL with a learned reward model is not the same as RL with true rewards, and this is the main reason why LLMs can not be a superhuman general problem solver just like AlphaGo for playing Go 2. The learned reward model may have some OOD holes and might be gamed by the policy model 3. A main advantage of RLHF comes from the discriminator-generator gap of humans, i.e., it’s easier for humans to tell what outputs are good than writing good outputs In addition to these, here are my two cents (in particular on language agents research): 1. The discriminator-generator gap may not only exist for humans, but also for AI; training a discriminator (i.e., a reward model) would lead to a more robust and generalizable decision boundary than direct training the generator or the policy using the same training data (e.g., via SFT or DPO, many works have shown that RL is much more robust than DPO in OOD settings) 2. As a result, RL with a reward model is in a sense to close the discriminator-generator gap of the model, i.e., to make the generator’s decision boundary more aligned with the reward model. In a sense, the reward model’s capacity might be an upper bound of the learned policy 3. For agent tasks, if we have a good process reward model, then probably the first thing we want to do with it is planning, rather than RL, since the step-wise search space is much smaller than chat or open-ended generation (e.g., 20 actions for an android app), and we can easily approach the “upper bound” by using the reward model as a discriminator or a ranker 4. That said, i think no one has systematically studied the aforementioned gap, and we may still need more analyses to understand whether and when RL is needed for agent tasks 5. My personal belief and intuition is, if we wanna do RL for language agents, the only meaningful way is to use the actual rewards (probably human-defined)
Andrej Karpathy@karpathy

# RLHF is just barely RL Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well. What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better: Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this: 1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse, 2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM. For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was. And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL. No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.

English
5
59
403
53.5K