Jongho Park
129 posts

Jongho Park
@jon_ghoh
🧑💻 PhD student @berkeley_ai 👾 prev: researcher @Krafton_AI, M.S. @WisconsinCS


➿Looped Diffusion Language Models Looping has landed in dLLMs, and it is surprisingly effective! Accelerates training convergence 3.34x, improves GSM8K accuracy +8.5% on the same data, and enables test-time depth scaling. Check out our LoopMDM paper for more details!



The vibes in SF feel pretty frenetic right now. The divide in outcomes is the worst I've ever seen. Over the last 5yrs, a group of ~10k people - employees at Anthropic, OpenAI, xAI, Nvidia, Meta TBD, founders - have hit retirement wealth of well above $20M (back of the envelope AI estimation). Everyone outside that group feels like they can work their well-paying (but <$500k) job for their whole life and never get there. Worse yet, layoffs are in full swing. Many software engineers feel like their life's skill is no longer useful. The day to day role of most jobs has changed overnight with AI. As a result, 1. The corporate ladder looks like the wrong building to climb. Everyone's trying to align with a new set of career "paths": should I be a founder? Is it too late to join Anthropic / OpenAI? should I get into AI? what company stock will 10x next? People are demanding higher salaries and switching jobs more and more. 2. There’s a deep malaise about work (and its future). Why even work at all for “peanuts”? Will my job even exist in a few years? Many feel helpless. You hear the “permanent underclass” conversation a lot, esp from young people. It's hard to focus on doing good work when you think "man, if I joined Anthropic 2yrs ago, I could retire" 3. The mid to late middle managers feel paralyzed. Many have families and don't feel like they have the energy or network to just "start a company". They don't particularly have any AI skills. They see the writing on the wall: middle management is being hollowed out in many companies. 4. The rich aren’t particularly happy either. No one is shedding tears for them (and rightfully so). But those who have "made it" experience a profound lack of purpose too. Some have gone from <$150k to >$50M in a few years with no ramp. It flips your life plans upside down. For some, comparison is the thief of joy. For some, they escape to NYC to "live life". For others still, they start companies "just cuz", often to win status points. They never imagined that by age 30, they'd be set. I once asked a post-economic founder friend why they didn't just sell the co and they said "and do what? right now, everyone wants to talk to me. if i sell, I will only have money." I understand that many reading this scoff at the champagne problems of the valley. Society is warped in this tech bubble. What is often well-off anywhere else in the world is bang average here. Unlike many other places, tenure, intelligence and hard work can be loosely correlated with outcomes in the Bay. Living through a societally transformative gold rush in that environment can be paralyzing. "Am I in the right place? Should I move? Is there time still left? Am I gonna make it?" It psychologically torments many who have moved here in search of "success". Ironically, a frequent side effect of this torment is to spin up the very products making everyone rich in hopes that you too can vibecode your path to economic enlightenment.


Today we’re releasing EMO, a new mixture-of-experts (MoE) model trained so modular structure emerges directly from data without human-defined priors. EMO can use a small subset of its experts for a given task while keeping near full-model performance. 🧵



Apparently Claude is now so embedded in US military decision-making that they would exercise the Defense Production Act (or other means) to continue using it in this illegal war. Anthropic should never have agreed to the DoW-Palantir deal.





OpenAI has warned US lawmakers that its Chinese rival DeepSeek is using unfair and increasingly sophisticated methods to extract results from leading US AI models to train the next generation of its breakthrough R1 chatbot bloomberg.com/news/articles/…

🚨 Your AI is lying to you with complete confidence. Harvard & MIT just proved ChatGPT hallucinates 110% less when you force it to argue with itself. The technique is called "Recursive Meta-Cognition" and it's embarrassingly simple. Here's how to make AI actually think:

Not All Bits Are Equal: What We Learned From 1700 Experiments on Memory-Optimal Reasoning Given a fixed memory budget, how should you allocate across model weights, KV cache, and test-time compute to maximize accuracy in reasoning models? For example: would you choose a 32B, 4-bit model with a 14k token budget, or an 8B, 16-bit model with 30k tokens? We ran 1,700 experiments on the Qwen3 family to find out. We varied: - Model size (0.6B-32B), - Weight precision (4/8/16-bit via GPTQ), - Serial test-time compute (token budgets 2k→30k via budget forcing), - Parallel test-time compute (Maj@K, up to K=16), - KV cache compression (eviction: R-KV, StreamingLLM; quantization: HQQ at 2/4/8-bit). This is great work led by my Krafton/UW collaborators @jhyuckkim (Krafton), @ethan_ewer (undergrad!! at UW-Madison), @taehong_moon (Krafton), @jon_ghoh (UC Berkeley) Here is a summary of the main findings: At the 4B+8-bit threshold optimal mem strategy flips For models effectively smaller than 8-bit 4B, spend memory on more (or higher-precision) weights, not on longer generations. For larger models, do the opposite: allocate memory to longer generations until performance saturates. This threshold isn't arbitrary, i.e., it is right at the point where weights dominate KV cache/token count. The reasoning task matters (eg your mileage may vary, no universal recipe!) Math reasoning (eg AIME25): 4-bit quantization is almost always a bad idea. An 8B model at 16-bit outperforms a 14B model at 4-bit with similar memory spent. The numerical precision in weights seems to matter for "reasoning heavy" tasks. Almost as if the model’s capacity to utilize test time compute is decimated by quantization... Knowledge heavier tasks (GPQA-D): 4-bit is broadly memory-optimal. Here, parameter count matters more than precision. The interpretation is that here you want more effective weights to store things, and raw parameter count dominates test time compute. How does parallel test time compute (fancy for majority voting) factor in? Majority voting (Maj@K) increases KV cache linearly with K. It improves the mem–acc trade-off only when the model is >= 8-bit 4B effective size; the optimal K grows with the memory budget. Below that scale, serial test time compute should be preferred. Weight quantization alone isn't enough! Both KV cache eviction and KV quantization push the mem optimal Pareto frontier higher across all model sizes we tested. Should you prefer KV evict or quant? - Small models (<8-bit 4B): KV cache eviction wins - Large models (≥8-bit 4B): Both strategies competitive Latency/throughput note: End-to-end latency is dominated by generation length. When you care a lot about latency, 8-bit often sits at a better speed–accuracy point than 4-bit. Note on batching: When model weights, i.e., params, are amortized across concurrent generations (that is, batched inference), the tradeoff shifts, as you’d expected. At a batch size of 16 the 0.6B model never appears on the Pareto. The 4B-8bit model always appears no matter the batchsize in the ~1-2GB memory region (good model setting for mobile devices!) What This Means ***There is no universal memory-optimal strategy for reasoning models!*** The right choice depends on almost every parameter involved, but here is one way to choose: If effective size < 8-bit 4B --- Spend your bits on model capacity/precision over longer token budgets. --- Prefer 8-bit for math-heavy tasks. --- Use KV eviction over KV quantization. --- Stick to serial scaling; Maj@K is memory-inefficient here. If effective size ≥ 8-bit 4B ---Increase token budget till gains saturate. ---Maj@K always helps, so grow K with available memory. ---KV quantization is competitive with eviction; choose based on implementation and maybe taste 😊 Important Caveat: These findings are specific to the Qwen3 family on AIME25 and GPQA-D. The thresholds and strategies will vary with different architectures, training methods, and task distributions.













