Junhyuck Kim

29 posts

Junhyuck Kim

@jhyuckkim

Researcher @Krafton_AI (@PUBG) Prev @CambridgeMLG

Katılım Ağustos 2023

265 Takip Edilen84 Takipçiler

Sabitlenmiş Tweet

Junhyuck Kim@jhyuckkim·9 Haz

Almost all "flagship" models are now MoEs. But smaller models still prefer to be dense as they target memory-constrained scenarios where total params matter. So we ask: Can we leverage an MoE to produce dense models without having to train them from scratch? 🧵👇

English

3.2K

Junhyuck Kim retweetledi

Dimitris Papailiopoulos@DimitrisPapail·4h

BenchPress is here! A way to predict benchmarks without running them. Basically can run 5 "principal" benchmarks and estimate the rest within <3.9%. Kinda nuts it works so well. Evals are rank 2 lol

Yuchen Zeng@yzeng58

💻Tired of running so many slow, expensive benchmark evals across every checkpoint? Try ✨BenchPress✨ at microsoft.github.io/benchpress/: provide a few benchmark scores, then get predictions for the remaining ~100 benchmarks, with trust probabilities and calibrated 90% prediction intervals. How does this work? In his original post (x.com/DimitrisPapail…), @DimitrisPapail first tried the idea as a fun question: collect model-by-benchmark scores into a matrix, find its low-rank structure, and use matrix completion to predict missing benchmark scores from a few observed ones. We expanded this into a full system: a fully audited 84-model x 133-benchmark score matrix, an optimized matrix-completion predictor, and a reliability layer for trust probabilities and 90% prediction intervals. Beyond predicting missing scores, we also suggest practical seed benchmark sets. The five-probe set {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} recovers the rest of a model's public score profile with a MedAE of 3.93 points. A lower-cost set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} reaches 4.55 points. See more details below 🧵1/7 This work is with @DimitrisPapail at AI Frontiers, a boutique research lab inside @MSFTResearch.

English

12.9K

Junhyuck Kim retweetledi

Kangwook Lee@Kangwook_Lee·2d

Okay, people keep asking me "what’s the API cost to run PUBG Ally?" $0. Nothing runs in the cloud. Ally runs fully on-device, using at most 1.5GB of VRAM, including model weights, KV cache, etc. 🤯

Kangwook Lee@Kangwook_Lee

x.com/i/article/2067…

English

Junhyuck Kim retweetledi

Kangwook Lee@Kangwook_Lee·6d

x.com/i/article/2067…

ZXX

136

43.7K

Junhyuck Kim@jhyuckkim·9 Haz

Please check out the paper for more details 🙂 Code: github.com/krafton-ai/moe… Paper: arxiv.org/abs/2605.28207 This is joint work with amazing collaborators: @JihunYun_ai, Haechan, @gmkim_ai, Joonghyun, @jaewoong_cho from @Krafton_AI

English

162

Junhyuck Kim@jhyuckkim·9 Haz

We see such compression-aware pretraining as a co-design direction worth exploring. Thanks to @sewon__min for introducing the work at ICLR, and to the authors @RyanYixiang @AkshitaB93 for the great work!

English

188

Junhyuck Kim@jhyuckkim·9 Haz

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, arxiv.org/abs/2605.06663). Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

English

1.4K

Junhyuck Kim@jhyuckkim·9 Haz

7/ Benchmark numbers alone don’t tell the whole story, so we also conducted qualitative analysis. MoE→dense (DO-ACP) wins over dense→dense (D2D) on two fronts: it is more often fluent and gets more facts right. More details and examples in the paper!

English

129

Junhyuck Kim@jhyuckkim·9 Haz

6/ How does our best MoE→dense recipe compare to just pruning a dense model directly? Surprisingly, at matched total params for teacher and student, our MoE→dense (DO-ACP) outperforms dense→dense (D2D) pruning by +6.3pp avg accuracy at ~1.6× faster training wall-clock.

English

128

Junhyuck Kim@jhyuckkim·9 Haz

5/ Across 350 configurations on Qwen3-30B-A3B, a clear pattern emerged: diversity-aware selection (DO-ACP) with no merging (pure pruning) consistently wins after distillation. The pattern holds on DeepSeek and GPT-OSS MoE models too.

English

127

Junhyuck Kim@jhyuckkim·9 Haz

4/ Our intuition is that output diversity should matter for selection. Drawing inspiration from the D-Optimal criterion in experimental design, we introduce a diversity-aware scoring metric (DO-ACP) and compare with other expert scoring metrics.

English

126

Junhyuck Kim@jhyuckkim·9 Haz

We set up a pipeline that decouples these choices [number of experts / scoring / grouping / magnitude scaling] for systematic investigation.

English

121

Junhyuck Kim@jhyuckkim·9 Haz

3/ The design space is wider than it looks. E.g., from 128 experts, we can pick 8 and concatenate, or pick 32 and merge them into 8 groups of 4, etc. Both scoring and grouping metrics have multiple candidates from the expert-pruning/merging literature.

English

141

Junhyuck Kim@jhyuckkim·9 Haz

English

3.2K

Junhyuck Kim@jhyuckkim·9 Haz

2/ The structure of MoE makes this natural. Per-expert computations are independent until the weighted sum. Concatenating their weights into a dense FFN preserves intermediate activations. The problem comes down to which experts give the best dense FFN init for distillation.

English

213

Junhyuck Kim@jhyuckkim·25 Nis

Will be presenting 3 papers at ICLR. If interested, please come by and chat! Memory Optimization Strategies for Reasoning Models Sat AM, Pav3 #615 Orak: Benchmark for LLM Agents on Video Games Sat AM, Pav4 #5101 Likelihood-Gated Policy Optimization Mon, SPOT workshop

English

584

Junhyuck Kim@jhyuckkim·10 Nis

@curious_queue @DimitrisPapail Will check those out, thanks for sharing!

English

Sourya Kakarla@curious_queue·9 Nis

@DimitrisPapail @jhyuckkim lmao it do be like that did you know about oracle? x.com/curious_queue/…

Sourya Kakarla@curious_queue

this is basically @steipete's oracle pattern, right? github.com/steipete/oracle i have been using a custom skill in codex/claude to consult `gpt-5.4-pro` for difficult tasks: github.com/search?q=repo%… `gpt-5.4-pro` is the best model out there that *anyone* can use

English

458

Junhyuck Kim retweetledi

Dimitris Papailiopoulos@DimitrisPapail·9 Nis

from our last night chat with @jhyuckkim 😂

Dimitris Papailiopoulos@DimitrisPapail

got scooped by Ant. Oh well :p cute idea

English

10.3K

Junhyuck Kim retweetledi

Dimitris Papailiopoulos@DimitrisPapail·8 Nis

x.com/i/article/2041…

ZXX

145

1.1K

475K

Junhyuck Kim@jhyuckkim·4 Nis

Letsgoooo 🔥 Proud to have worked on post-training for this release.

Kangwook Lee@Kangwook_Lee

My team has been cooking nonstop for a while... and I’m so excited to finally share what we’ve been building!!! Today, we’re releasing four open models, many of which are the best models of the same size 🥳!!! tldr; 1) Raon-Speech: 9B SOTA speech LLM 2) Raon-SpeechChat: 9B full duplex model 3) Raon-OpenTTS: 0.3B/1B open-data-open-weight SOTA TTS 4) Raon-VisionEncoder: 0.4B vision encoder trained only with public data huggingface.co/collections/KR… === 1) Raon-Speech (9B) Raon-Speech is a speech LLM (LLM + speech understanding + speech generation). It's a bilingual model (English/Korean), and it's ranked #1 on both leaderboards 😎 tldr; it's the best open-model alternative to ChatGPT voice mode. Model: huggingface.co/KRAFTON/Raon-S… Tech report: huggingface.co/KRAFTON/Raon-S… Web demo: raon.krafton.ai ("Speech Chat" menu here. "auto" is a bit unstable, so use "manual" and choose the language!) 2) Raon-SpeechChat (9B) While a speech LLM is useful, it’s kind of like a walkie-talkie. A full-duplex model is more like a phone, so it is even more useful in many applications. That’s why we also built and are releasing Raon-SpeechChat. Again, on several quantitative evaluation metrics, Raon-SpeechChat scored the best on average. Model: huggingface.co/KRAFTON/Raon-S… Tech report: huggingface.co/KRAFTON/Raon-S… Web demo: raon.krafton.ai ("Full Duplex" menu here.) 3) Raon-OpenTTS (0.3B, 1B) We’re also releasing Raon-OpenTTS, a state-of-the-art open-data, open-weight TTS model. Model + data: huggingface.co/KRAFTON/Raon-O… The 1B model and a detailed tech report are coming soon! 4) Raon-VisionEncoder (0.4B) Last but not least, we’re releasing Raon-VisionEncoder, a vision encoder trained from scratch using only public data. It closely matchs the SOTA vision encoder quality too! Model: huggingface.co/KRAFTON/Raon-V… Tech blog: krafton.ai/blog/posts/202… === That’s it! I’m incredibly proud of what my team has built! My AI research team at KRAFTON (@Krafton_AI), which undoubtedly is the most cracked team in Korea, has been cooking nonstop for a while for this 😅... This is just the beginning of our planned model releases, so stay tuned! ps1/ Ah, by the way, you may ask why “Raon”? “Raon” is an old Korean word meaning happy. And, well, we’re kRAftON :-) ps2/ KRAFTON is one of the four teams participating in Korea’s national frontier-model project, together with SK Telecom. We’re training something very exciting together... and more to come soon!

English

383

Junhyuck Kim@jhyuckkim·17 Mar

Had a lot of fun working on this project, really cool idea from Kangwook. Especially satisfying to see something that felt intuitive actually hold up in the results.

Kangwook Lee@Kangwook_Lee

x.com/i/article/2033…

English

2.1K

Keşfet

@JihunYun_ai @gmkim_ai @jaewoong_cho @Krafton_AI @sewon__min @RyanYixiang @AkshitaB93 @curious_queue