Junhyuck Kim

29 posts

Junhyuck Kim

Junhyuck Kim

@jhyuckkim

Researcher @Krafton_AI (@PUBG) Prev @CambridgeMLG

Katılım Ağustos 2023
265 Takip Edilen84 Takipçiler
Sabitlenmiş Tweet
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
Almost all "flagship" models are now MoEs. But smaller models still prefer to be dense as they target memory-constrained scenarios where total params matter. So we ask: Can we leverage an MoE to produce dense models without having to train them from scratch? 🧵👇
Junhyuck Kim tweet media
English
1
8
50
3.2K
Junhyuck Kim retweetledi
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
BenchPress is here! A way to predict benchmarks without running them. Basically can run 5 "principal" benchmarks and estimate the rest within <3.9%. Kinda nuts it works so well. Evals are rank 2 lol
Dimitris Papailiopoulos tweet media
Yuchen Zeng@yzeng58

💻Tired of running so many slow, expensive benchmark evals across every checkpoint? Try ✨BenchPress✨ at microsoft.github.io/benchpress/: provide a few benchmark scores, then get predictions for the remaining ~100 benchmarks, with trust probabilities and calibrated 90% prediction intervals. How does this work? In his original post (x.com/DimitrisPapail…), @DimitrisPapail first tried the idea as a fun question: collect model-by-benchmark scores into a matrix, find its low-rank structure, and use matrix completion to predict missing benchmark scores from a few observed ones. We expanded this into a full system: a fully audited 84-model x 133-benchmark score matrix, an optimized matrix-completion predictor, and a reliability layer for trust probabilities and 90% prediction intervals. Beyond predicting missing scores, we also suggest practical seed benchmark sets. The five-probe set {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} recovers the rest of a model's public score profile with a MedAE of 3.93 points. A lower-cost set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} reaches 4.55 points. See more details below 🧵1/7 This work is with @DimitrisPapail at AI Frontiers, a boutique research lab inside @MSFTResearch.

English
6
6
63
12.9K
Junhyuck Kim retweetledi
Kangwook Lee
Kangwook Lee@Kangwook_Lee·
Okay, people keep asking me "what’s the API cost to run PUBG Ally?" $0. Nothing runs in the cloud. Ally runs fully on-device, using at most 1.5GB of VRAM, including model weights, KV cache, etc. 🤯
Kangwook Lee@Kangwook_Lee

x.com/i/article/2067…

English
3
8
45
7K
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
We see such compression-aware pretraining as a co-design direction worth exploring. Thanks to @sewon__min for introducing the work at ICLR, and to the authors @RyanYixiang @AkshitaB93 for the great work!
English
1
0
4
188
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, arxiv.org/abs/2605.06663). Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.
Junhyuck Kim tweet media
English
2
1
5
1.4K
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
7/ Benchmark numbers alone don’t tell the whole story, so we also conducted qualitative analysis. MoE→dense (DO-ACP) wins over dense→dense (D2D) on two fronts: it is more often fluent and gets more facts right. More details and examples in the paper!
Junhyuck Kim tweet media
English
1
0
0
129
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
6/ How does our best MoE→dense recipe compare to just pruning a dense model directly? Surprisingly, at matched total params for teacher and student, our MoE→dense (DO-ACP) outperforms dense→dense (D2D) pruning by +6.3pp avg accuracy at ~1.6× faster training wall-clock.
Junhyuck Kim tweet media
English
1
0
0
128
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
5/ Across 350 configurations on Qwen3-30B-A3B, a clear pattern emerged: diversity-aware selection (DO-ACP) with no merging (pure pruning) consistently wins after distillation. The pattern holds on DeepSeek and GPT-OSS MoE models too.
Junhyuck Kim tweet media
English
1
0
1
127
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
4/ Our intuition is that output diversity should matter for selection. Drawing inspiration from the D-Optimal criterion in experimental design, we introduce a diversity-aware scoring metric (DO-ACP) and compare with other expert scoring metrics.
Junhyuck Kim tweet media
English
1
0
1
126
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
We set up a pipeline that decouples these choices [number of experts / scoring / grouping / magnitude scaling] for systematic investigation.
Junhyuck Kim tweet media
English
1
0
1
121
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
3/ The design space is wider than it looks. E.g., from 128 experts, we can pick 8 and concatenate, or pick 32 and merge them into 8 groups of 4, etc. Both scoring and grouping metrics have multiple candidates from the expert-pruning/merging literature.
English
1
0
0
141
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
Almost all "flagship" models are now MoEs. But smaller models still prefer to be dense as they target memory-constrained scenarios where total params matter. So we ask: Can we leverage an MoE to produce dense models without having to train them from scratch? 🧵👇
Junhyuck Kim tweet media
English
1
8
50
3.2K
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
2/ The structure of MoE makes this natural. Per-expert computations are independent until the weighted sum. Concatenating their weights into a dense FFN preserves intermediate activations. The problem comes down to which experts give the best dense FFN init for distillation.
Junhyuck Kim tweet media
English
1
0
1
213
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
Will be presenting 3 papers at ICLR. If interested, please come by and chat! Memory Optimization Strategies for Reasoning Models Sat AM, Pav3 #615 Orak: Benchmark for LLM Agents on Video Games Sat AM, Pav4 #5101 Likelihood-Gated Policy Optimization Mon, SPOT workshop
English
0
1
13
584
Junhyuck Kim
Junhyuck Kim@jhyuckkim·
Letsgoooo 🔥 Proud to have worked on post-training for this release.
Kangwook Lee@Kangwook_Lee

My team has been cooking nonstop for a while... and I’m so excited to finally share what we’ve been building!!! Today, we’re releasing four open models, many of which are the best models of the same size 🥳!!! tldr; 1) Raon-Speech: 9B SOTA speech LLM 2) Raon-SpeechChat: 9B full duplex model 3) Raon-OpenTTS: 0.3B/1B open-data-open-weight SOTA TTS 4) Raon-VisionEncoder: 0.4B vision encoder trained only with public data huggingface.co/collections/KR… === 1) Raon-Speech (9B) Raon-Speech is a speech LLM (LLM + speech understanding + speech generation). It's a bilingual model (English/Korean), and it's ranked #1 on both leaderboards 😎 tldr; it's the best open-model alternative to ChatGPT voice mode. Model: huggingface.co/KRAFTON/Raon-S… Tech report: huggingface.co/KRAFTON/Raon-S… Web demo: raon.krafton.ai ("Speech Chat" menu here. "auto" is a bit unstable, so use "manual" and choose the language!) 2) Raon-SpeechChat (9B) While a speech LLM is useful, it’s kind of like a walkie-talkie. A full-duplex model is more like a phone, so it is even more useful in many applications. That’s why we also built and are releasing Raon-SpeechChat. Again, on several quantitative evaluation metrics, Raon-SpeechChat scored the best on average. Model: huggingface.co/KRAFTON/Raon-S… Tech report: huggingface.co/KRAFTON/Raon-S… Web demo: raon.krafton.ai ("Full Duplex" menu here.) 3) Raon-OpenTTS (0.3B, 1B) We’re also releasing Raon-OpenTTS, a state-of-the-art open-data, open-weight TTS model. Model + data: huggingface.co/KRAFTON/Raon-O… The 1B model and a detailed tech report are coming soon! 4) Raon-VisionEncoder (0.4B) Last but not least, we’re releasing Raon-VisionEncoder, a vision encoder trained from scratch using only public data. It closely matchs the SOTA vision encoder quality too! Model: huggingface.co/KRAFTON/Raon-V… Tech blog: krafton.ai/blog/posts/202… === That’s it! I’m incredibly proud of what my team has built! My AI research team at KRAFTON (@Krafton_AI), which undoubtedly is the most cracked team in Korea, has been cooking nonstop for a while for this 😅... This is just the beginning of our planned model releases, so stay tuned! ps1/ Ah, by the way, you may ask why “Raon”? “Raon” is an old Korean word meaning happy. And, well, we’re kRAftON :-) ps2/ KRAFTON is one of the four teams participating in Korea’s national frontier-model project, together with SK Telecom. We’re training something very exciting together... and more to come soon!

English
0
0
7
383