arlo_son

268 posts

arlo_son

@gson_AI

Undergraduate @ Yonsei. UIC Economics.

Bergabung Şubat 2023

245 Mengikuti193 Pengikut

Tweet Disematkan

arlo_son@gson_AI·20 May

#NLProc AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫) In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro also suffer greatly! arxiv.org/abs/2505.11855

English

161

26K

arlo_son me-retweet

EleutherAI@AiEleuther·14 Eki

We are announcing an opportunity for paid question writers to contribute to a new PhD-level math benchmark. Accepted contributors will be paid per question and will be invited to be authors on the resulting dataset paper. Check out the link below for more information!

English

6.7K

arlo_son@gson_AI·26 May

For more details, paper: arxiv.org/abs/2505.11855 project: llms-in-science.github.io/spot/ I'm planning follow-up works, so let me know if you are interested! 🔥

English

199

arlo_son@gson_AI·26 May

Imagine you’re collaborating with an AI co-scientist: you ask it to proofread your manuscript and flag any errors. Which LLM would you choose? 🤔 We evaluated the new Claude 4 models on SPOT. It looks like o3 is still the best model for this.

English

1.2K

arlo_son me-retweet

Stella Biderman @ ICLR@BlancheMinerva·23 May

People are really eager to use AIs "to accelerate science" (whatever that means). Designing meaningful tests tailored to proposed use-cases is a lot of work, but it's work I'm quite excited about. Bottom line: Current models aren't usable at identifying major flaws in papers.

arlo_son@gson_AI

English

4.9K

arlo_son@gson_AI·20 May

Last but not least, I’d like to thank all coauthors for their help 👍👍👍 @jiwoohong98 @Void13950782 @hazel_heejeong @cartinoe__5930 @sngwonlim @jinyeop_song @GoncaloSPaulo @YoungjaeYu3 @stella

English

377

arlo_son@gson_AI·20 May

🔥 SPOT drives home a crucial point – verification must catch up with generation if AI co-scientists are to earn our trust. It’s time to build smarter error detectors before we rely on AI in labs 🛠️ Check out the paper for more details! arxiv.org/abs/2505.11855

English

415

arlo_son@gson_AI·20 May

English

161

26K

arlo_son@gson_AI·30 Nis

I'll be presenting KMMLU, the-most used korean benchmark by Korean big techs at the moment with @seungonekim today, at 2pm!

EleutherAI@AiEleuther

If you're at #NAACL25, don't miss @gson_AI presenting our paper on localizing MMLU to Korea! Session C: Oral/Poster 2: 2pm-3.30pm

English

981

arlo_son me-retweet

Seungone Kim@seungonekim·25 Nis

@naaclmeeting I'll also be presenting our KMMLU paper with @gson_AI! It is one of the most widely adopted benchmarks used by companies such as @official_naver @LG_AI_Research @kakaocorpglobal that develop Korean LLMs. 📅 Session C: Wednesday April 30th, 14:00-15:30 x.com/gson_AI/status…

arlo_son@gson_AI

🌟 KMMLU 🌟This benchmark replicates the methodology that produced MMLU, but using examinations common in Korea. We manually annotate a subset of the questions as to whether they require Korea-specific knowledge and also designate a KMMLU-Hard subset that current models find especially challenging. We benchmark 26 openly available and proprietary models including: Qwen, Yi, Llama-2, Polyglot-Ko, GPT-3.5/4, Gemini-Pro and HyperCLOVA X. To our surprise GPT-4 outperforms all. However, when limited questions requiring knowledge specific to Korea HyperCLOVA X seems to be better. 🎖️ Paper: arxiv.org/abs/2402.11548 Dataset: huggingface.co/datasets/HAERA…

English

1.2K

arlo_son@gson_AI·16 Şub

@kalomaze thanks for the reply. how do you set the context length?

English

kalomaze@kalomaze·16 Şub

@gson_AI yes, even for 14b full finetuning with no lora adapter, it works at bsz 8, generation=8 on that configuration. (at this context len at least, but for 7b i still have like... half of vram left to spare). it's odd that i haven't seen anyone post about models bigger than 3b (?)

English

kalomaze@kalomaze·15 Şub

anyways... ~1.5b, ~3b models? not enough. i say if you're going to climb a hill, you ought to aim for the top.

kalomaze@kalomaze

>RL 7b model only does XML formatting 99.5% of the time, not 100 >look at actual model outputs >it's violating the formatting rarely sometimes, but with structure. a single token appears after </answer> before the EOT; always "assed" or "inati" ..."passed" or "terminated"? hm.

English

2.6K

arlo_son me-retweet

Trelis Research@TrelisResearch·14 Şub

+ GRPO is Poor and for the GPU-Rich + ------------------------------- *A specific GRPO vs SFT video will be out next week, but I'm putting initial results here* I trained Llama 3.2 1B on GSM8K with: 1. SFT 2. ORPO 3. GRPO For SFT and ORPO, I generated training data using Llama 3.2 1B by taking 8x samples per training problem and retaining only those that are correct (as well as some incorrect for ORPO pairs). For GRPO I also set sampling to 8x per training example. I'm attaching the graphs... here's what the columns are: 1. SFT_1 is one round of fine-tuning on correct answers filtered from sampling. 2. ORPO_1 is one round of ORPO on correct-incorrect pairs from sampling. 3. SFT_1_SFT_2 takes the SFT_1 model to generate 8x samples per training row (and these are better samples!), and then uses those samples to do a second round of SFT. 4. SFT_1_ORPO_2 does the same as #3 but uses ORPO pairs for the second round. 5. GRPO does 8x sampling and uses reward functions for correctness and format. GRPO was run past toward reward flatlining (which was about 3k rows of training data, out of 7k max available). {Side note: SFT on data sampled from a few thousand rows shows an improvement.} In short: 1. Nothing improves pass@k by much. Basically, whatever answers the model can reach, the base instruct model can already reach. 2. SFT and ORPO improve maj@k, which is the percentage of test examples that are right in 5/8 or more cases. Note: if you're only measuring single attempt performance (pass@1), performance will improve, but it is because maj@k improves, not because pass@k is improving. 3. GRPO makes pass@k and maj@k worse in this case. I'm not saying that necessarily generalises. Most likely GRPO is improving formatting at the expense of correctness [See graph attached for tensor board logs of correctness and formatting rewards. Formatting indeed improves, correctness improves but that encompasses correctness because of formatting imrpovements]. Now, why does GRPO seem to improve performance in a lot of quick tests? - MEASUREMENT NUANCE: You can get a lot of apparent improvement in correctness just through formatting improvements. Once you impose a format requirement, answers that were correct are marked wrong (making the start of RL look bad). As formatting improves through RL (whatever the method), this results in more answers being marked correct, and so you see an apparent improvement. [Once you mitigate this by using multiple correctness checks, including output vs ground truth checks using LLM as judge, this source of improvement goes away]. Why does GRPO work for DeepSeek? - I don't entirely know. - One hypothesis is that DeepSeek v3 is just bigger and more powerful, and has base pre-training patterns engrained that allow improved answers to be iteratively elicited. - Another (not mutually exclusive) hypothesis is that the breadth of learning in RL is quite narrow, i.e. you can make a model (provided big enough) recursively improve provided the domain is sufficiently narrow and the model has encoded relevant pre-training data. Essentially, once you have defined a benchmark for an LLM, it's possible to perform well on a benchmark (given a large enough model and sufficient training examples - although generating training examples may not be trivial). This idea is somewhat supported by an earlier RL paper with Ilya on it where it was necessary to pre-train on a large maths split before getting outcome and process rewards models to achieve improvements. What should the GPU-poor do? 1. Use the strongest available model to generate reasibubg answers 2. Verify and keep the correct answers 3. Do SFT on those correct answers There are now a few directions of evidence for this: a. Deepseek's R1 paper explicitly says that smaller models were stronger when trained on traces from bigger models than when they underwent RL themselves. b. The recent s1 paper from Stanford shows that having traces from a strong model + SFT can result in large (in domain) reasoning boosts. What if R1 is bad at your dataset/application? - Probably still you should sample R1 (or another reasoning model that reveals traces) and use verified samples for SFT. - Doing loops of RL is best when there are no existing models that can reach correct answers (and, as you can see from these graphs, with weaker models, you may not be able to recursively improve and increase pass@k). If you want to push the frontier, possibly your base model needs a lot of pre-training data for that domain AND/OR you need to do domain-specific pre-training. + Should DeepSeek have just done SFT with rejection sampling? + GRPO is quite sparse as a reward if you just look at correctness and format (which the DeepSeek paper says). This sparsity makes it wasteful (because many backward passes may not have any signal, particularly for harder problems), and the use of multiple summed rewards can result the model focusing more on one than the other (perhaps what is happening above). So, if DeepSeek could elicit improvements via format+correctness on GRPO, I'm guessing they could have done so via rejection sampling and SFT. With SFT you can be very sample efficient - whether that's with SFT or ORPO (or the backprop part of GRPO). Granted - and thanks to @willccbb for chatting a little on this - perhaps you can make the rewards functions more continuous with LLMs and other tools and kind of "guide" the model to better answers. Perhaps this is what o1/3 is doing. That's it, the video will post next week - and you can get notified either here or if you find Trelis Research on YouTube or Substack. btw, cheers to @willccbb , @UnslothAI and @huggingface for the nice work on GRPO scripts/libraries. P.S. The definition of GRPO includes sampling, which is confusing because ORPO and DPO don't (PPO does, but the "sampling" is represented via a model). It's more clear (although a bit imprecise) if we describe the comparison as rejection sampling + GRPO vs rejection sampling + SFT vs rejection sampling + ORPO/DPO.

Trelis Research@TrelisResearch

++ Reinforcement Learning for LLMs in 2025 ++ === How to elicit improved reasoning from models? - Is reasoning innately in pre-training datasets and just needs the right examples to be brought out? - Why does GPRO make sense, as opposed to Supervised Fine-tuning with the right examples? My general sense is that GRPO (or PPO or ORPO) may not offer all that much benefit over SFT. In fact, they generally are more complex. What matters is how the fine-tuning data is created. This is the first video in a series on Reinforcement Learning. Maybe you’re looking to directly dig into GRPO - but I think that’s the wrong way to look at things. A better - ground up approach - is to: a) start with careful performance measurement (there are gotchas even around how one marks answers correct or not), b) then carefully think about data preparation, c) then do Supervised Fine-tuning, and only then d) start to look at preference and reward methods. Definitely leave comments if i) you see things that can be improved or I’ve made mistakes on, or ii) you have a specific reasoning dataset in mind that would be useful to see a demo on in the future. --- + Timestamps --- 00:00 Introduction to Reinforcement Learning 00:56 Practical Programming for RL 01:59 Setting Up the Environment 02:40 Cloning and Configuring Repositories 04:10 Understanding the Dataset 05:03 Supervised Fine Tuning and Reinforcement Learning 08:54 Downloading and Preparing the Dataset 09:09 Installing Necessary Libraries 13:58 Implementing the Answer Checker 22:30 Running Inference and Evaluating Performance 28:53 Analyzing Results and Setting Baselines 31:03 Batch Inference Script Breakdown 38:00 Preparing for Reinforcement Learning 38:57 Understanding Think Tags in Dataset Generation 39:36 Improving Performance with Supervised Fine Tuning 41:09 Creating and Filtering the Dataset 41:20 Introduction to Preference Fine Tuning 42:21 Generating ORPO Pairs 46:03 Training the Model with Supervised Fine Tuning 49:26 Setting Up and Running the Training Script 50:47 Evaluating the Model's Performance 01:02:35 Exploring ORPO Training 01:07:49 Theory and History of Reinforcement Learning 01:13:54 Final Evaluation and Insights

English

409

61.8K

arlo_son me-retweet

Lifan Yuan@lifan__yuan·24 Oca

lessons learned: (1) *capable* (small) base models are good enough to start rl, where (2) reasoning patterns *tailored to each task* just emerge, e.g. self-verification for countdown and decomposition for multiplication. will keep working on demystifying long cot, stay tuned🫡

Jiayi Pan@jiayi_pirate

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: github.com/Jiayi-Pan/Tiny… Here's what we learned 🧵

English

137

24.5K

arlo_son me-retweet

Turing Post@TheTuringPost·29 Ara

10 Free Comprehensive Datasets for Supervised Fine-Tuning: ▪️ Awesome ChatGPT Prompts ▪️ FineWeb from @huggingface ▪️ FineWeb 2 ▪️ OpenO1-SFT ▪️ Cleaned Alpaca Dataset ▪️ LMSYS-Chat-1M ▪️ Dolma from @allen_ai Math datasets: ▪️ FineMath ▪️ QwQ-LongCoT-130K ▪️ GSM8K Save the list and check this out for the links: huggingface.co/posts/Kseniase…

English

115

22.4K

arlo_son@gson_AI·16 Ara

@winglian @casper_hansen_ thanks for letting me know ill see if i can filter that

English

Wing Lian (caseus)@winglian·16 Ara

@gson_AI @casper_hansen_ Looks like the model ends up going onto “alternatively, perhaps …” loops before eventually switching from English to Mandarin (?) in this particular dataset.

English

120

Casper Hansen@casper_hansen_·15 Ara

I'm trying to reproduce Chain of Continuous Thought to increase test-time compute. The authors conveniently leave out details about their dataset but mention it's a CoT dataset. Can anyone point me to a high quality CoT dataset?

English

199

21.4K

Jelajahi

@jiwoohong98 @Void13950782 @hazel_heejeong @cartinoe__5930 @sngwonlim @jinyeop_song @GoncaloSPaulo @YoungjaeYu3