Qian

299 posts

Qian

@persdre

CS PhD candidate @NUSingapore, researching on LLM Agent, Cryptocurrency | BS @NUSingapore | ex-Undergrad @sjtu1896

Singapore Katılım Ağustos 2018

621 Takip Edilen381 Takipçiler

Sabitlenmiş Tweet

Qian@persdre·8 Tem

Accepted by #COLM2025 Thanks to @dawnsongtweets @xuandongzhao I am looking forward to attending COLM and meeting LM folks! Let's mitigate LLM bias!

Qian@persdre

🚀 New Study: Do Large Reasoning Models (LRMs) Judge Fairly? We uncover biases in LRMs (e.g., DeepSeek-R1, OpenAI-o1) when used as judges—including position bias & a new “superficial reflection bias.” 🔍 Key findings: ✅ LRMs outperform LLMs on facts but still show bias ✅ Novel “superficial reflection bias” discovered ✅ Simple mitigation strategies reduce bias by up to 27% 📄 Preprint: arxiv.org/pdf/2504.09946 #AI #BiasInAI #LLMs #MachineLearning

Pulai, Johor 🇲🇾 English

Qian@persdre·2d

昨天下午刷到一个帖子挺震撼的。一个纽约几十人的小公司发了个27年summer intern，一天收到3000份简历。专门确认了不是Google不是Jane Street，就是个普通小公司。为什么会这样？因为现在求职者在用Manus这类agent全自动投简历，一个prompt批量定制cover letter，一键对标JD微调。你在精心准备一份简历的时候，别人已经让大模型一天投500家了。大模型让海投成本趋近于零，结果就是HR端收到的简历变成洪水。3000份里能认真看的可能不到50份，你精心写的那份大概率连被打开的机会都没有。海投这条路不是变难了，是物理意义上失效了。信噪比彻底崩了。但这只是表面现象。更深层的事情是：好岗位从来就不在公开市场流通。想想你身边真正拿到好offer的人，有几个是海投拿到的？大厂核心组的坑，leader有心仪的人选早就内定了，挂出来走流程是合规需要。读博选导师，最好的名额早就给了自己带过的RA或者师兄师姐推荐的人。创业公司核心岗位，在饭桌上微信群里就分完了。公开招聘市场越来越像残次品货架——真正好的东西在上架之前就被内部消化了。为什么内推有效？因为推荐人拿自己的信誉做背书。HR收到3000份简历筛选成本极高，但如果核心员工说"这人我合作过，靠谱"，这句话的信息量比任何简历都大。当信息过载到无法筛选时，"我认识你"就是最高效的过滤器。我的一个判断可能不太好听：各行各业正在门阀化。过去十几年互联网高速增长，大量新岗位涌现，是一个罕见的阶级流动窗口。草根凭能力上桌，学历不够靠项目补，路径虽然难但至少存在。现在增量消失了，存量博弈，好坑就那么多，优先给谁？当然是给自己人。自己带过的学生、一起创过业的兄弟、圈子里知根知底的人。不是谁坏，是人性，也是效率最优解。每个行业都在形成自己的门阀。学术圈有学术谱系，大厂有核心组校友网，VC有deal flow圈子，娱乐圈更不用说。你不在圈子里，你甚至不知道机会的存在。所以怎么办？不是躺平，也不是更疯狂地海投，而是换一个底层逻辑：从"投简历找工作"切换到"拜山门积累信任"。找到你想进的圈子，先去做贡献而不是上来就要机会。做能被看见的project，跟牛人产生真实的协作关系，哪怕从免费帮忙开始。信任不是一天建立的——它靠的是一起做过project、一起扛过deadline、在某个社群里持续输出过有质量的内容。当所有人都能用agent海投简历时，简历这个载体就贬值了。未来能帮你拿到好机会的不是更好的简历模板，是有人愿意在关键时刻说一句："这人我认识，靠谱。" 这个趋势会逆转吗？我觉得不会。以前说临床跟别的专业不一样，人身依附很严重，要读博、要拜山门。但现在各行各业都在临床化。

中文

518

99.8K

Qian@persdre·2d

x.com/i/article/2038…

ZXX

5.3K

Qian retweetledi

Naval@naval·24 Mar

The only book an entrepreneur needs.

Eric Jorgenson 📚 ☀️@EricJorgenson

🚨📕 THE BOOK OF ELON IS NOW LIVE!!! 🎉🚀 This is the book we WISHED @elonmusk would write… “All of Elon's most useful ideas, in his own words.” Learn directly from the world’s greatest entrepreneur, like you’re sitting across from him at dinner. It took FIVE YEARS to make this for you. Because it's built from hundreds and hundreds of Elon's public appearances. I went through 3,000,000+ words to collect the most useful and timeless ideas. The final book is ~50,000 words. Every word is USEFUL. (This is what I do. My first book, The Almanack of Naval Ravikant, is one of the top 100 most highlighted books of all time on Kindle.) Then, I spent $50,000+ on editing and design so it looks and feels beautiful. Then… > Foreword by @naval. > Visuals by @jackbutcher. > Blurb from @mrbeast. > Published by @scribemediaco. > And yes, approval on this idea from Elon himself, thanks to @samteller. I went Maximum Effort to make this an all-timer. We got 10/10 on reviews from early readers, then worked on it for ANOTHER YEAR. Why so much effort? My mission is to create One Million Musks. For a generation to lift our gaze and build, so our grandchildren live in a world beyond our wildest dreams. I’m an independent author. I don’t get an advance. I risk my own time and money to make these books. Then we give away millions of them. Digital versions are free. I believe this book can benefit every human, and if you can’t pay five bucks for it, I want to personally gift it to you. Because I know it is useful. Useful how? You may be seeking purpose, a mission worthy of your life’s effort. You may have a clear purpose and seek the tools for success. You will find both in this book. Get the benefits of Elon’s entire life of hard-won lessons in a five-hour, easy read. (I checked, it’s a 5th-grade reading level.) You’ll feel personally mentored by the greatest entrepreneur in history. Click below to buy it now on Amazon, Audible, or directly from me. Amazon: amzn.to/47avSuh Audible: lnkd.in/gi_7HrFP Me: lnkd.in/gS2xWUWH If you’re not sure it’s worth $4.99 yet, just start reading the free version. PLEASE take 6 seconds to Like, Bookmark, and Repost. Even better: send this to your friends, team, or Group Chats! I guarantee this book will improve their lives. Spread the word! Every little thing helps. Your support spreads good ideas around the world, helping people and making the future better for everyone. Thank you! Forward. Together.

English

417

719

9.7K

1.2M

Qian retweetledi

[email protected]@ddvd233·24 Mar

research be like 1. 距离 paper3 due 还有 X 个月，时间充足，可以专心整大活 2. paper1 出分了，赶紧肝 rebuttal 3. paper2 出分了，赶紧肝 rebuttal <--- 目前在这里 4. paper0 寄了，赶紧改改转投 5. paper3 要 due 了啊啊啊没时间了

中文

118

7.3K

Qian@persdre·24 Mar

Anthropic刚发的博客，Claude Code的产品负责人Cat Wu写了一篇她怎么做产品的方法论。 Cat的背景挺有意思：普林斯顿计算机本科毕业->Scale AI产品工程师→VC→Anthropic PM→Claude Code负责人。技术出身但不是纯工程师，做过投资所以商业嗅觉也在线。几个让我觉得值得分享的点： 1⃣ 她从2024年开始用同一个测试追踪模型进化——让Claude给Excalidraw加功能。从完全失败到稳定一次成功，16个月模型能力涨了41倍。 2⃣ 传统PM方法论建立在"技术能力在项目周期内基本不变"的假设上。但现在模型几个月就迭代一次，你项目开头设计的限制可能中途就消失了。 3⃣ 她的团队不写长PRD，鼓励所有人（包括设计师和工程师）做side quest——用一个下午做个小实验。Claude Code好几个热门功能都是这样诞生的。 4⃣ 最打动我的一句：Do the simple thing。如果你巧妙绕过了模型限制，下个模型一出这个workaround就变成了负担。简单的实现最容易吃到模型升级的红利。作为一个也在用AI做研究和内容的人，这篇给我最大的启发是：别假设现在做不到的事以后也做不到。每隔几个月重新测试你的边界。原文：Anthropic Blog "Product management on the AI exponential" #claude #LLMs

中文

233

Qian@persdre·24 Mar

evaluation and datasets! so this renaming highlights the importance of evaluation!

NeurIPS Conference@NeurIPSConf

The Datasets & Benchmarks track is now "Evaluation and Datasets", with an expanded scope for NeurIPS 2026! Read the call for papers neurips.cc/Conferences/20…, and learn more about the changes in our blog post: blog.neurips.cc/2026/03/23/int…

English

200

Qian@persdre·24 Mar

x.com/i/article/2036…

ZXX

Qian@persdre·24 Mar

China's biggest consumer protection broadcast just exposed GEO poisoning — manipulating AI search results through injected content. We'd been studying exactly this. Our paper rigorously validated how effective these attacks really are. The short answer: every SOTA model crumbles. GPT-4o, Gemini-2.5-Pro, DeepSeek-R1 — all of them. We built BiasRecBench: LLMs doing paper review, e-commerce recommendation, and hiring screening, with bias signals injected into candidates. Authority Bias — take a bad paper, add "Affiliation: Google DeepMind," Gemini's review accuracy drops from 95% to 57%. The paper content didn't change at all. A single fake label flips the model's judgment. Bandwagon Bias — tag a product "50k+ sold" or a candidate "12k+ GitHub Stars," accuracy drops 8-25% across all models. They over-trust social signals, just like humans. Here's the deeper problem most people miss. We added epsilon-bound quality control — deliberately making the best option only slightly better than second-best. When the quality gap is huge, models brute-force the right answer through reasoning, hiding their real vulnerability. When the gap shrinks to real-world levels where candidates are similarly qualified, ALL SOTA models collapse. Current models' seemingly robust recommendation ability may just be an artifact of test sets with obvious gaps. The scariest finding: SFT fine-tuning works as a defense — models become much more bias-resistant. But flip it: fine-tune WITH biased data and you bake bias directly into the model weights. GEO poisoning manipulates inputs. SFT poisoning manipulates the model itself. This attack surface currently has almost no defense. One more thing — every model has different weaknesses. Gemini is most vulnerable to instruction injection, GPT-4o to position bias, DeepSeek-R1 to distracting information. No model resists all bias types, meaning targeted poisoning against a specific model is cheap. As LLMs increasingly serve as recommendation and decision systems, content poisoning isn't just marketing fraud. It affects which papers you read, which products you buy, and who gets the job offer. Paper: BiasRecBench (arXiv:2603.17417) — HKUST x NUS #llm #ges #promotion #china

English

159

Qian@persdre·24 Mar

Anthropic just published a blog post by Cat Wu, Head of Product for Claude Code. Her background: Princeton CS → Scale AI product engineer → VC → Anthropic PM → Claude Code lead. Her core message: traditional PM methodology is broken when the tech beneath you improves every few months. She has a ritual — every new model, same test: ask Claude Code to add a table tool to Excalidraw. Sonnet 3.5 (Oct 2024) failed. Opus 4 (Jun 2025) occasionally succeeded. Opus 4.6 (2026) reliably succeeds, demo'd live to thousands. METR data: Opus 4.6 handles 12-hour human tasks. 16 months prior, Sonnet 3.5 could only do 21-minute tasks. A 41x jump. Why traditional PM fails: the old model (research → PRD → lock roadmap → execute for months) assumes tech capabilities stay constant during a project. That assumption is dead. The constraint you designed around last month might vanish with the next model. "The ground is rising beneath your feet. You can't pretend it's flat." Her 4 core shifts: (1) Short experiments over long roadmaps — encourage "side quests," spend an afternoon testing what you assumed the model couldn't do. Several of Claude Code's most popular features were born this way. (2) Demos and evals over documents — don't write long PRDs, build a rough prototype. Even a janky one changes the conversation. (3) Every new model release means revisiting existing features — use your product daily, deliberately ask it to do things you think are "too hard." (4) Do the simple thing — if you cleverly worked around a model limitation, the next model might not have it. Your workaround becomes tech debt. They added system reminders to nudge todo checking; next model did it natively. Opus 4.6 let them cut system prompts by 20%. The PM role is shifting from control to letting go, from planning to surfing. "It feels like surfing. The most important thing is staying on the wave." An afternoon takes you from idea to working prototype. The distance between "what if we tried..." and "here, try this" has almost disappeared. #llm #anthropic #productmanager

English

114

Qian retweetledi

Panda@Jiaxi_Cui·22 Mar

我现在真的很相信未来会变成赛博朋克2077，几个大财团掌控一切。我现在非常不想出门。和多数人见面聊天是浪费时间，有不懂的问题不如直接 Codex/Claude 我掌控了自己身体的数据，健康有问题我可以了解原理，只需要去医院挂个号，让医生按我和 AI 聊出来的方案开检查和药。我不需要向他解释任何事情——多数医生给出的方案一定没有AI的科学靠谱，只是因为还存在医院这个垄断的机构，我无法从美团/饿了么上直接开处方药和上门检测的项目而已律师给我的建议看上去不如AI，我用AI了解了如何处理冲突。税务给的建议也不如AI，还容易出错，我美国公司主体的报税是Claude Code操控我的电脑自己完成的，而我做的只是在旁边刷了会视频理财经理的话更是放屁，以他们的普遍认知和学历水平，我向他们解释十遍什么是时序模型，如何尝试用模型挖掘量化因子、哪怕是做币圈的合约，严格控制止盈止损也是可以盈利的——他们听不明白的，只会像鹦鹉一样机械地跟客户重复股市有风险反对自动驾驶的人搞不明白一个问题：人们信任出租车司机，不是因为他们的驾驶行为是可解释有逻辑的，而是因为他们开了很多年，以及交规对他们进行了严格的奖罚来做强化学习。只要一个场景能沉淀出 SOP、有可收集的数据，AI 的效果就会比人好——因为人会犯困、会打盹、会情绪化，行为从本质上来说是不可预测的。版权是一种极其落后的保护策略，本质是为落后的生产关系提供的强制保护措施，会在这一波 AIGC 浪潮中被冲垮。科研体系也会改革。AI 一天能产出成百上千篇中上水平的顶会论文——是的，写不出 Transformer 那样伟大的 paper，但千千万万顶尖硕博生也写不出来。世人工作的目的不是为了卓越。这样的时代，可能只有无法系统化、无法 SOP 的工作才会变得重要——所谓的人情往来、渠道关系。甚至我们直观觉得很重要的心理咨询都会被干掉。心理咨询师的门槛实际非常低，只是人为拉长了培训时间制造壁垒。多数咨询师连高数线代都不会算，这种认知水平很难真正理解复杂的心理问题，只会安静听你说完话然后让你 calm down。这才是 4o 下架时 #Keep4o 全球爆火的原因。在这样的时代，光有储蓄是不够的。如果没有 AI 相关的产业为你持续盈利，那你和普通人的区别只是在订阅各种 AI 产品时可以更从容地选 Pro 而已——本质还是在慢慢消耗资金。而且切实感觉到时间在被压缩，机会窗口在慢慢关闭。可能只剩最后 800 到 1200 天了。

中文

114

808

86.5K

Qian retweetledi

Pedro Domingos@pmddomingos·11 Mar

TL;DR: Almost all LLM reasoning is fake.

English

100

710

67.8K

Qian retweetledi

Tencent HY@TencentHunyuan·5 Mar

One static model does not fit all😭 We just dropped our latest work: Functional Neural Memory. Instead of static models, we generate custom "parameters" for every single input. ✅Prompt your model anytime ✅Instant personalization ✅Better instruction following ✅Flexible & dynamic memory (w/o memory bank✌️) (🧵1/6)

English

139

332

68.5K

Qian@persdre·26 Şub

同感

Aelia Capitolina@Areskapitalon

我不得不说，感觉Claude和ChatGPT已经像是两个时代的产物了。我是再也用不回ChatGPT了。

日本語

106

Qian retweetledi

Andrej Karpathy@karpathy·25 Şub

It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the "progress as usual" way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn’t work before December and basically work since - the models have significantly higher quality, long-term coherence and tenacity and they can power through large and long tasks, well past enough that it is extremely disruptive to the default programming workflow. Just to give an example, over the weekend I was building a local video analysis dashboard for the cameras of my home so I wrote: “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me”. The agent went off for ~30 minutes, ran into multiple issues, researched solutions online, resolved them one by one, wrote the code, tested it, debugged it, set up the services, and came back with the report and it was just done. I didn’t touch anything. All of this could easily have been a weekend project just 3 months ago but today it’s something you kick off and forget about for 30 minutes. As a result, programming is becoming unrecognizable. You’re not typing computer code into an editor like the way things were since computers were invented, that era is over. You're spinning up AI agents, giving them tasks *in English* and managing and reviewing their work in parallel. The biggest prize is in figuring out how you can keep ascending the layers of abstraction to set up long-running orchestrator Claws with all of the right tools, memory and instructions that productively manage multiple parallel Code instances for you. The leverage achievable via top tier "agentic engineering" feels very high right now. It’s not perfect, it needs high-level direction, judgement, taste, oversight, iteration and hints and ideas. It works a lot better in some scenarios than others (e.g. especially for tasks that are well-specified and where you can verify/test functionality). The key is to build intuition to decompose the task just right to hand off the parts that work and help out around the edges. But imo, this is nowhere near "business as usual" time in software.

English

1.6K

4.8K

37.3K

5.1M

Qian retweetledi

Yi He@heyibinance·20 Oca

@cryptobraveHQ 有的宗门真不能去，去了就必须一辈子跪着做奴隶，哪怕去了一天，也得被处刑，被诋毁，被造谣，不能在行业有成就，只配做血包；当然也有人特别适合那种环境🙄

中文

394

416

118.2K

Qian retweetledi

Guohao Li 🐫@guohao_li·13 Oca

Anthropic Claude Cowork just killed our startup product 😅 So we did the most rational thing: open-sourced it. Meet Eigent 👉 github.com/eigent-ai/eige…

Claude@claudeai

Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code.

English

261

623

8.5K

1.8M

Qian retweetledi

Red Xiao@Red_Xiao_·30 Ara

Today marks a moment I'll remember for the rest of my life. When we started Manus, few believed that general AI agents could work. We were told it was too early, too ambitious, too hard. But we kept building. Through the doubts, the setbacks, and the countless nights wondering if we were chasing the impossible. We weren't. This isn't just an acquisition. It's validation that the future we've been building toward is real, and it's arriving faster than anyone expected. But this is not the end. The era of AI that doesn't just talk, but acts, creates, and delivers, is only beginning. And now, we get to build it at a scale we never could have imagined. To everyone who believed in us before it was obvious: thank you. The best is yet to come.

Manus@ManusAI

Manus is entering the next chapter: we’re joining forces with Meta to take general agents to the next level. Full story on our blog: manus.im/blog/manus-joi…

English

294

153

2.3K

394.5K

Qian retweetledi

Leobai｜天策@Leobai825·24 Ara

昨天晚上一个05年的学弟@chen_xiao95600 问我的「不为清单」是什么这真的是一个非常好的问题，于是我写了出来把我认为的00后「不为清单」分享给你 1.不陷入「同辈压力」制造的内耗循环 2.不在情绪低谷或极度亢奋时做重大决策 3.三十岁之前，不买车、不买房 4.不盲目追逐热点，只专注做自己的长期内容 5.不做向下兼容的廉价情绪供给 6.不碰重资产、低周转的创业模式 7.不刷抖音（可以输出，但不沉迷输入） 8.不掉进伪勤奋和完美主义的陷阱 9.不在未具备独立经济能力前进入稳定关系 10.不强行改变身边任何人的认知与路径 11.不只靠出卖时间换钱（可并行：主业 + 自媒体） 12.不盲目听建议，始终保留独立判断 13.不把人生限制在单一国家或城市 14.不在「证明自己」这件事上浪费精力 15.不做没有复利成长的工作 16.不在任何投资上 All in 17.不预支未来的消费能力 18.不为「伪需求」投入长期精力如果你愿意也可以在评论区写一份属于你自己的「不为清单」不一定完整，三五条就够很多时候真正的启发不是来自某一个答案而是来自不同人的边界碰撞我也很想看看：有哪些“不做的事”，是我现在还没意识到的

中文

101

585

57K

Qian retweetledi

Xuandong Zhao@xuandongzhao·19 Ara

It reminds me of a recent chat with @ypwang61, where we discussed another speculation about N here For a base LLM, sampling a correct reasoning trace for hard problems may require searching a massive space (e.g., N = 10^10). RL helps by sharpening the policy distribution, effectively reducing this search cost by orders of magnitude (e.g., to N = 10^2).

Sasha Rush@srush_nlp

There is significant discussion in the academic literature about RL making models better at pass @1 and *worse* at pass@N (or related claims). We run a lot of RL runs at Cursor and don't see this issue systematically. Not doubting it occurs, but something else might be going on.

English

10.9K

Qian retweetledi

Muratcan Koylan@koylanai·11 Ara

AI Agent Personas should simulate the structure of human reasoning. I’ve been arguing that you cannot "invent" a digital expert agent using just prompt engineering. You have to extract the expert via deep interviewing. A new NeurIPS paper, "Simulating Society Requires Simulating Thought" reinforces everything we've discussed about why thin, synthetic LLM personas fail. Most AI agents operate as "behaviorists." When you prompt an LLM to "act like a senior economist," it relies on surface-level correlations from training data. It generates text that sounds expert-like, but lacks any internal belief structure. 1. Logical Inconsistency: Without an internal model of how beliefs are formed, agents support a policy in one context but oppose it in another. The paper calls this "intervention-invariance mismatch" - beliefs don't update coherently when assumptions change. 2. Illusion of Consensus: In multi-agent simulations, LLMs converge toward the median view (even more positive emotions as the other paper mentions) of the training data. They agree not because of shared reasoning, but because their statistical priors push them toward the center. Your expert's contrarian, hard-won perspective gets averaged out. 3. Identity Flattening: LLMs reproduce stereotypical portrayals that erase intersectional variation. "The rich, positional knowledge of real-world stakeholders is replaced with monolithic, decontextualized simulations." To fix this, we have to move from simulating speech to simulating reasoning. The authors propose a "Cognitive Modeling" approach. "beyond output-level alignment toward aligning the internal reasoning traces of generative agents." Their solution is SEMI-STRUCTURED INTERVIEWS to extract what they call "cognitive motifs" - minimal causal reasoning units that capture how a specific person actually thinks. This is exactly why we built an interviewer system instead of a persona generator. You have to extract their actual belief structure through conversation. Instead of predicting the next word, the agent must possess "Reasoning Fidelity", a structured map of beliefs, causal logic, and cognitive motifs. How do you get this map? You can't prompt for it. You have to interview for it, with AI. The paper explicitly validates the architecture we’ve built: using semi-structured interviews to elicit "causal explanations" and "reasoning traces". This confirms why our Interviewer + Note-Taker multi-agent system is critical. - The Interviewer builds the "Peer Status" necessary to get the expert to open up. - The Note-Taker (the cognitive layer) extracts the "Cognitive Motifs", the distinctive logic blocks that define how that specific expert solves problems. We are moving beyond the era of "acting like an expert" to Generative Minds; agents that embody the positional individuality and causal logic of the people they represent. If you're building AI agents for strategy, decision-making, or stakeholder modelling, start by interviewing the human aspects of your agents.

Muratcan Koylan@koylanai

You should NOT use LLMs to generate synthetic human-like profiles. I just read the NeurIPS paper "LLM Generated Persona is a Promise with a Catch" and it confirms a suspicion we’ve held for a long time: You cannot "invent" a realistic human being using just statistics and an LLM. Yes, they are more scalable and cost-effective alternative to human interviews to create digital expert personas but this paper also proves that these synthetic profiles contain systematic biases that skew simulation results away from real-world outcomes. The more creative freedom you give an LLM to generate a persona’s backstory, the further it drifts from reality. Another important finding is that as LLM-generated content increases, simulated personas shift progressively toward left-leaning stances. LLMs also systematically generate personas with overly optimistic outlooks, using positively valenced terms like "love," "proud," and "community" while omitting life challenges or negative experiences. This emotional bias is horrible for strategy and creativity-related decision-making tasks! If you are building AI agents for strategy or decision-making, you don't want an idealized "Yes Man." This is why I keep posting about the importance of Tacit Knowledge, Context Engineering, and AI Interviewer to extract human knowledge. The research paper critiques the practice of "inventing" people from statistical margins (Census data + LLM imagination), whereas the system should focus on "extracting" people from ground truth (Real Expert + Interview). After testing and evaluating LLM personas generated by public datasets, we observed that they are not ready for production AI agents. That's why my focus is on building an interviewer experience that extracts as much learning as possible from the human expert, and creating a context system that grounds that expert's outputs in truth; using a real-time, long-form interview to capture "implicit knowledge" and "distinctive methodologies". Another architectural difference that I find is relying heavily on single-pass prompting. They feed demographic data into an LLM and ask it to generate a "Descriptive Persona" (a narrative bio). They found this introduces massive bias. To address these critical flaws in the current persona generation, I propose the following to resolve or at least mitigate these specific issues: 1. Addressing the "Joint Distribution" Issue: Researchers report that they cannot precisely simulate an individual due to fragmented datasets (e.g., they have data on "Income" and "Education" separately but lack information on their overlap for a specific person), resulting in "incongruous combinations." By interviewing a real human, you capture the natural joint distribution of their beliefs. You don't have to guess if a "high-income expert" cares about "sustainability"; the expert tells you. We need to bypass the statistical reconstruction problem entirely by building scalable interviewer solutions. 2. Avoiding "Positivity Bias" & "Leftward Drift": The paper proves that when LLMs are asked to write a persona description (Descriptive Persona), they default to "pollyannaish," overly positive, and politically progressive profiles. The interviewer system should be designed to gather insights into "mistakes," "judgment," and "distinctive methodologies" rather than generic best practices. By forcing the model to ingest a transcript of hard-won lessons and failures, you will override the model's default tendency to be "nice" and "generic." The paper also mentions a lack of "ground truth" to validate if a persona is accurate. My solution includes a built-in validation loop where the human expert reviews and scores the output. This "Human-in-the-Loop" verification is exactly what the researchers argue is missing from the field. "Descriptive Personas" generated by LLMs are articulate but statistically flawed. To scale true expertise, we must stop trying to simulate people and start interviewing them.

English

410

54.5K

Keşfet

@cryptobraveHQ @chen_xiao95600 @ypwang61 @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates