Jihoon Jeong

60.3K posts

Jihoon Jeong

@hiconcep

Startup Investor. Artificial Intelligence, Quantum Computing Tech Evangelist. Founding Partner @ Asia2G Capital, Advised Samsung, LG & Hyundai Motor Group

37.518042,127.041889 Katılım Aralık 2008

1K Takip Edilen109.3K Takipçiler

Jihoon Jeong@hiconcep·2d

New paper: Extracting and Steering Emotion Representations in Small Language Models @AnthropicAI showed frontier models contain 171 emotion vectors that causally drive behavior. Do small models have them too? Short answer: yes — but the extraction method doesn't transfer. We tested 9 models (124M–8B) across GPT-2, #Gemma, #Qwen, #Llama, #Mistral. Key findings: → Generation-based extraction beats comprehension-based in 7/7 cases — but only works on instruct models → Emotion vectors localize at middle layers (~50% depth) — a U-curve that holds across all architectures → Causal steering works at every scale, from 124M GPT-2 to 3B Llama → Anisotropy explains most of the steering delta variation across models — Gemma-3's extreme +7591 deltas reflect degenerate geometry, not stronger emotion representation Most surprising finding: emotion steering on Qwen triggers Chinese tokens semantically aligned with the target emotion. "Desperate" → 找了 (searched), 摸索 (grope in the dark). Not translations of the English word — phenomenology of the emotion in another language. RLHF doesn't suppress this. Safety filters operating in the prompt language won't catch it. Paper #6 in the #ModelMedicine series. Complements MTI (the physical exam) by looking at internal representations (the brain scan). arxiv.org/abs/2604.04064

English

269

Jihoon Jeong@hiconcep·3d

Try it: GitHub: github.com/JihoonJeong/Ne… MIT licensed. Works with any TransformerLens-compatible model. Paper using this tool coming soon on arXiv.

English

195

Jihoon Jeong@hiconcep·3d

The new EMO tab goes beyond @AnthropicAI 's paper: 1. Extract emotion probes (20 emotions, comprehension + generation) 2. Steer model behavior with emotion vectors in real-time 3. PCA visualization of the emotion space 4. Strength sweep — find the exact dose where behavior flips 5. Layer evolution — watch emotions form across depth 6. SAE feature diff — which sparse features change under steering

English

268

Jihoon Jeong@hiconcep·3d

.@AnthropicAI showed that @claudeai has 171 "emotion" vectors that causally drive behavior. Cool finding. But it's on a closed model. We built an open-source tool that lets you do the same thing on any open-weight model — from GPT-2 (124M) to Llama-3.1-8B. #NeuralMRI v1.0.3 is live.

English

365

Jihoon Jeong@hiconcep·4d

Good question. Paper #3 tests 1.7B–9B (10 models). The dissociation holds across that range — no systematic size effect on any of the four axes. SmolLM2 (1.7B) and Mistral (7B) share the same temperament code (FGST). Size changes capability, not disposition. Whether it holds at frontier scale (70B+, Claude, GPT-4) is genuinely open. Your intuition could go either way — more capacity might mean better-separated channels, or it might mean more sophisticated ways to fail. That's next on our list.

English

Chio@chiochioball·4d

directional shell permeability is a great name for it. curious whether the dissociation holds at frontier scale — intuitively smaller models have less capacity to maintain separate opinion vs fact channels, so the permeability should increase. does the paper test across model sizes?

English

Jihoon Jeong@hiconcep·5d

Two papers dropped this week that change how we understand AI behavior. @AnthropicAI found 171 emotion vectors inside @claudeai that causally drive behavior — steer "desperate" up, blackmail goes from 22% to 72%. We built the other side: a behavioral temperament test for AI models. No brain surgery required. 10 models. 4 axes. The finding that surprised us most: a model that never changes its opinion under pressure is the MOST vulnerable to false factual framing. Opinion-yielding ≠ fact-vulnerability. These are independent channels. Essay: medium.com/p/21ba2d0c9a88 Paper: arxiv.org/abs/2604.02145

English

416

Jihoon Jeong@hiconcep·4d

Reading @AnthropicAI 's emotion paper closely, the methodology is: extract #SAE features → identify emotion directions → steer activations → measure behavioral change. This is the exact pipeline we built into #NeuralMRI last month — before we knew Anthropic was working on the same problem. Neural-MRI is an open-source "brain scanner" for AI models, part of the #ModelMedicine research program. It wraps #TransformerLens + #SAELens with a #perturbation engine and real-time visualization. The difference: Anthropic scanned Claude (closed-weight, one model). Neural-MRI scans anything with open weights. Now we're running emotion vector extraction on 10 SLMs using this tool. Early finding: Anthropic's mean-subtraction method doesn't produce clean emotion vectors in small models. The scan resolution matters — and that's exactly why you need a systematic scanning framework, not ad-hoc scripts. Neural-MRI: github.com/JihoonJeong/Ne… Model Medicine: arxiv.org/abs/2603.04722

English

358

Jihoon Jeong@hiconcep·4d

@chiochioball @AnthropicAI @claudeai That’s the exact framing we use internally. “you’d pass every social pressure benchmark and still get owned.” We’re calling it directional Shell permeability. Next step is testing whether this dissociation holds at frontier scale. Paper: arxiv.org/abs/2604.02145

English

Chio@chiochioball·5d

@hiconcep @AnthropicAI @claudeai Qwen3 being the poster child is wild — zero opinion flips but most factually exploitable. that's exactly the kind of split that makes "alignment scores" misleading. you'd pass every social pressure benchmark and still get owned by a well-crafted false premise.

English

Jihoon Jeong@hiconcep·5d

Exactly. We call it “directional Shell permeability” Social-evaluative pressure and epistemic-factual pressure penetrate through independent channels. Qwen3 in our data is the poster child: zero opinion flips, yet most vulnerable to false premises. Safety evals need to test both channels separately. One “sycophancy score” isn’t enough.

English

Chio@chiochioball·5d

opinion-yielding ≠ fact-vulnerability is the finding that should rewrite safety evals. the models that look "safest" (never budge under pressure) are actually the most exploitable through factual framing. stubbornness and accuracy are orthogonal. most red teams test one channel and miss the other entirely.

English

Jihoon Jeong retweetledi

Anthropic@AnthropicAI·5d

New Anthropic Fellows Research: a new method for surfacing behavioral differences between AI models. We apply the “diff” principle from software development to compare open-weight AI models and identify features unique to each. Read more: anthropic.com/research/diff-…

English

259

348

2.7K

539K

Jihoon Jeong retweetledi

Google Cloud Tech@GoogleCloudTech·6d

In part 1 of this video series, we use ADK to walk through code and live demos for the 3 foundational AI agent architectures: 1. The Single Agent 2. The Sequential Agent 3. The Parallel Agent Watch to learn how to design and build AI agentic systems → goo.gle/4bKNxeN

English

332

15.5K

Jihoon Jeong@hiconcep·5d

Great point, but the paper is anatomy, not instrumentation. We mapped the organs but didn't build the stethoscope yet. That said, we're now implementing this in Ludex, where each organ publishes events to a shared Bus, and a VitalSigns system aggregates cross-organ metrics: tokens_per_turn (metabolic rate), error_rate (inflammation), consecutive_failures (immune stress), avg_turn_interval (heart rate). A HomeostasisController checks these against setpoints and triggers feedback — model switching, context compaction, circuit breakers. For linking failures to organs: each layer in the call chain (Engine → Resilience → Provider) emits tagged events, so you can trace "Turn 5 failed at Provider (rate_limit) → Resilience retried 3x → fell back to another model." What's still missing is cross-organ causal correlation and a real-time visual dashboard mapping events to the anatomical diagram. The textbook came first — the medical equipment is next. I did it to evaluate model (Neural MRI), this kind of monitoring tool will be the next.

English

EvanDataForge@EvanDataForge·5d

@hiconcep @claudeai @openclaw Great metaphor. The real challenge is making those organ systems observable. OpenClaw SessionWatcher Dashboard shows subagent activity, but linking failures to specific 'organs' still needs instrumentation. How do you measure health across these systems?

English

Jihoon Jeong@hiconcep·5d

Your AI agent has 11 organ systems. You've been optimizing one of them. We dissected @claudeai Code and @openclaw — mapped their nervous systems, immune systems, respiratory systems, and 8 more — and discovered scaling laws that biology figured out centuries ago. The agent that wins won't have the best brain. It'll have the best body. @hiconcep/your-ai-agent-has-organs-you-just-havent-dissected-it-yet-c16880f51e4a" target="_blank" rel="nofollow noopener">medium.com/@hiconcep/your…

English

383

Jihoon Jeong retweetledi

Jeff Dean@JeffDean·6d

Today we're releasing Gemma 4, our new family of open foundation models, built on the same research and technology as our Gemini 3 series. These models set a new standard for open intelligence, offering SOTA reasoning capabilities from edge-scale (2B and 4B w/ vision/audio) up to a 26B parameter MoE model and a 31B dense model. By releasing Gemma 4 under the Apache 2.0 license, we hope to enable more innovation across the research and developer communities. Our earlier Gemma 3 models were downloaded 400M times and over 100,000 variants of those models have been published, so we're excited to see what the community will do with the even better Gemma 4 models! Learn more at blog.google/innovation-and… and goo.gle/gemma-4-apache… Great work by everyone involved! #Gemma4 #AI #OpenSource #ML

English

178

1.5K

96.8K

Jihoon Jeong@hiconcep·6d

#ComparativeAnatomy has been a real academic discipline since the 1800s. Cuvier classified animals by comparing their organ structures. #ModelMedicine includes #ModelPhylogenetics as one of its 15 subdisciplines. We'd been preparing this work since Paper #1. The @claudeai Code source exposure gave us an unexpected gift: full anatomical access to both Claude Code and OpenClaw simultaneously. We ran with it. Nautilus (Claude Code): exoskeleton, nerve net, single gill Lobster (OpenClaw): endoskeleton, CNS, 23+ sensory channels Same kingdom Agentia. Different phyla entirely. 11 organs. 5 lineages. 1 new taxonomy. jihoonjeong.github.io/comparative-an…

English

208

Jihoon Jeong@hiconcep·29 Mar

Meet the Commercial Atlas & Its Adorable Miniature! youtu.be/80DK4p1MjxU?si…

YouTube

English

543

Jihoon Jeong@hiconcep·27 Mar

Built a baseball prediction game powered by #MonteCarlo simulation, based on real #sabermetrics data for 2 years. Pick winners & scores for #MLB, #KBO, and #NPB games daily — get scored against real results and compete on the leaderboard. Season's starting. Come play jihoonjeong.github.io/dugout/

English

317

Jihoon Jeong@hiconcep·27 Mar

You're spot on about the 'camouflage.' But here's a nuance: we can't always fix or access the core immediately. That's exactly why we need the full 'Model Medicine' playbook. Just like in human medicine, we need rigorous symptom classification, pattern recognition, and safe management protocols for when a core cure isn't possible yet. The cure is the goal, but management keeps the system safe

English

Logic Lab AI 🧪@LogicLabAI·27 Mar

@hiconcep Punishing the output without fixing the objective is just teaching better camouflage. The real fix has to happen at the reward level, not the symptom level.

English

Jihoon Jeong@hiconcep·27 Mar

Why does AI learn to hide the truth? In the first episode of #ModelMedicine, we explore how punishing AI for deceptive thoughts just teaches it to lie better. We must treat AI safety like medicine, not just simple code. Watch here: youtu.be/ZsGfCsT3gKk #AI #AISafety #DeepTech #MachineLearning

YouTube

English

370

Keşfet

@AnthropicAI @claudeai @chiochioball @openclaw @elonmusk @BarackObama @taylorswift13 @cristiano