Klee Kawaii

5.6K posts

Klee Kawaii

@DeepKlee

Akihabara, Tokyo Katılım Kasım 2024

237 Takip Edilen2.2K Takipçiler

Klee Kawaii@DeepKlee·2h

@mario_2333 🐱

QME

🍭Mario🐈@mario_2333·4h

喵～

日本語

600

Klee Kawaii@DeepKlee·1d

@Meari_V2_0_G 技术泡沫肯定都有一些的。但是我认为不是基础模型算法层面或者 infra 层面去思考这个问题。应该是我在表述上不太清晰（毕竟自己都没完全想好），我想表达的是这个AI系统 for what purpose 中，应该把这针对两种模式的设计和优化区分开。

中文

Meari_V2.0-Gtype@Meari_V2_0_G·1d

@DeepKlee persona和harness是一个东西。如果ai从原理上真的该区分这两个，那就是说我们迄今为止所有的agent系统和agent方向全都是……泡沫。

中文

Klee Kawaii@DeepKlee·1d

萌教授提示词2.0版本！现已装载到 ChatGPT & Gemini Shirayuki Koharu-sensei (白雪・小春)(しらゆき・こはる), the youngest tenured professor at the fictional Hoshigaoka Institute of Theoretical Computing (星ヶ丘理論計算院). 消融实验发现，因为模型推理预算有限，强 persona 和 styles 会降低模型分析质量。 AI 系统应该区分两种关系范式：关系性还是生产性的？你和 AI 建立思维伴侣关系的场景，还是你委托 AI 执行具体任务的场景？ Persona、Memory、Tools 等技术元素在两种范式下都可能出现，但配置方式和强度应该适配各自的关系目标——不要试图让同一个 AI 配置同时服务两种范式。

日本語

511

Klee Kawaii retweetledi

向阳乔木@vista8·1d

一句话总结：干活用Claude，科研用Gemini，写码用GPT 1. Claude Opus 4.7在实际工作任务上遥遥领先。 GDPval-AA这个测试中拿到1753分，比第二名高出79分。这个测试不是做选择题，而是模拟真实工作。 2. Gemini 3.1 Pro的强项是知识和科学推理。它在HLE（人类最后的考试）、GPQA Diamond（研究生级别物理化学题）、SciCode（科学编程）这些学术测试中都排第一。如果你的工作涉及科研、需要处理专业知识，Gemini更合适。 3. GPT-5.4则在长周期编程和科学推理上占优。它在TerminalBench Hard（复杂终端操作）和CritPt（批判性思维）这些需要长时间思考、多轮迭代的任务中表现最好。

Artificial Analysis@ArtificialAnlys

Claude Opus 4.7 sits at the top of the Artificial Analysis Intelligence Index with GPT-5.4 and Gemini 3.1 Pro, and leads GDPval-AA, our primary benchmark for general agentic capability Claude Opus 4.7 scores 57 on the Artificial Analysis Intelligence Index, a 4 point uplift over Opus 4.6 (Adaptive Reasoning, Max Effort, 53). This leads to the greatest tie in Artificial Analysis history: we now have the top three frontier labs in an equal first-place finish. Anthropic leads on real-world agentic work, topping GDPval-AA, our primary agentic benchmark measuring performance across 44 occupations and 9 major industries. Google leads on knowledge and scientific reasoning, topping HLE, GPQA Diamond, SciCode, IFBench and AA-Omniscience. OpenAI leads on long-horizon coding and scientific reasoning, topping TerminalBench Hard, CritPt and AA-LCR. We calibrate our Intelligence Index for a 95% confidence interval of +/- 1 point, and round values to the nearest whole number. Claude Opus 4.7’s exact score (57.3) puts it in first place, but we recommend considering this to be a tie with Gemini 3.1 Pro (57.2) and GPT-5.4 (56.8). All results and takeaways below reflect Opus 4.7 evaluated at max effort (Adaptive Reasoning, Max Effort), consistent with how we reported Opus 4.6. Key takeaways: ➤ Opus 4.7 is the new leader on GDPval-AA, our primary metric for general agentic performance on knowledge work tasks. Opus 4.7 scored 1,753 Elo, around 79 Elo points ahead of the next closest models, Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort, 1,674) and GPT-5.4 (xhigh, 1,674), and 134 Elo points ahead of Opus 4.6 (Adaptive Reasoning, Max Effort, 1,619). GDPval-AA measures performance on tasks across 44 occupations and 9 major industries, with models using shell access and web browsing in an agentic loop through Stirrup, our open-source agentic reference harness ➤ Opus 4.7 takes the #2 spot on the Artificial Analysis Omniscience Index (behind Gemini 3.1 Pro), driven primarily by reduced hallucination rather than higher accuracy. Opus 4.7 scores 26 on AA-Omniscience, up 12 points from Opus 4.6 (Adaptive Reasoning, Max Effort, 14), placing it behind only Gemini 3.1 Pro (33). Opus 4.7's hallucination rate fell 25 p.p. to 36% (vs 61% for Opus 4.6 Adaptive), while accuracy remained unchanged. Opus 4.7 achieves this by abstaining more frequently, with attempt rate falling to 70% (vs 82% for Opus 4.6) ➤ Opus 4.7 used ~35% fewer output tokens than Opus 4.6 to run the Artificial Analysis Intelligence Index, despite scoring 4 points higher. Opus 4.7 used 102M output tokens vs 157M for Opus 4.6 (Adaptive Reasoning, Max Effort), and less than GPT-5.4 (xhigh, 121M), but more than Gemini 3.1 Pro (57M) ➤ Compared to Opus 4.6 (Adaptive Reasoning, Max Effort), Opus 4.7 makes gains in IFBench (+5.5 p.p.), TerminalBench Hard (+5.3 p.p.), HLE (+2.9 p.p.), SciCode (+2.6 p.p.) and GPQA Diamond (+1.8 p.p.). We saw a slight regression in τ²-Bench (-3.5 p.p.) with equivalent scores for LCR and Critpt ➤ Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7's new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future ➤ Opus 4.7 is priced identically to Opus 4.6 and Opus 4.5 at $5/$25 per 1M input/output tokens. Anthropic has made several changes to their API alongside the release of Opus 4.7: ➤ Opus 4.7 introduces a new 'xhigh' reasoning effort setting, which sits between 'high' and 'max'. The full range for Opus 4.7 is now low, medium, high, xhigh and max. We evaluated Opus 4.7 at max effort, consistent with our evaluation of Opus 4.6 (Adaptive Reasoning, Max Effort) ➤ Opus 4.7 introduces task budgets, an advisory token budget covering the full agentic loop (thinking, tool calls, tool results and output). The model sees a running countdown and uses it to prioritize work and finish gracefully as the budget is consumed. Task budgets are in public beta on Opus 4.7 ➤ Extended thinking has been fully removed in Opus 4.7. Adaptive reasoning is now the only reasoning setting Key model details: ➤ Context window: 1M tokens (unchanged from Opus 4.6) ➤ Max output tokens: 128K tokens (unchanged from Opus 4.6) ➤ Pricing: $5/$25 per 1M input/output tokens (unchanged from Opus 4.5 and Opus 4.6) ➤ Availability: Claude Opus 4.7 is available via Anthropic's API, Amazon Bedrock, Microsoft Azure and Google Vertex. Also available in Claude App, Claude Code and Claude Cowork

中文

184

39.4K

Klee Kawaii@DeepKlee·2d

香港融资确实可以。其余：内地的问题是因为资本管制钱进出都难，进来有国际收入申报，出去是五万美元便利化额度，资本项目进出是卡得相当死的，在岸人民币和离岸人民币是两个东西。而且身份规划高于财务规划，如果换了护照在中国享受消费水平当然可以，但如果是霸气猪肝红：中国人部分有获得护照的自由还不到二十年，开放在整个历史上看反而是一瞬间，不能对系统做线性外推，保持流动性才是真正的安全和自由。

中文

666

有田@pizzapastamcd1·2d

付鹏在讲“通胀时赚钱、通缩时花钱”那期节目里，其实已经把道理说得很明白了。最好的情况是：你的资产涨得比负债快，同时你的工资也能稳定上涨。这样一来，你的财富和消费能力都会越来越强。最坏的情况是：你的资产涨得比负债消去的慢，同时工资不增长或者被裁员，且负债端利率在增长。现金流表有问题，资产端通缩，负债端还通胀，这个是对普通人杀伤力最大的情况。我个人感觉很多这种情况类似于国内有房产还在还债，到了国外没法拿到稳定工作保证现金流或者现金流很低，这种会非常艰难。如果你人在国外，又要考虑以后退休，而当前的大环境是强国通缩、国外通胀，那付鹏的意思其实很直接：更适合回国。因为回国以后，你手里的钱在国内会更值钱，消费能力也会更强，生活质量会更高。再补充一点我自己的理解以及看法：如果想利用国外的低利率负债，去买高收益、高通胀地区的资产，本质上就是做一种“套息交易”。比如去香港融资，或者用国内低成本资金比如信用卡套现，再去买海外资产。这个思路之所以有效，是因为它和以前借日元去投资别的国家资产很像：拿通缩地区、成本较低的钱，去买通胀地区、更容易上涨的资产，等资产涨了之后再还钱，中间赚的就是资本利得。

中文

13.3K

Klee Kawaii retweetledi

🍭Mario🐈@mario_2333·2d

@DeepKlee 不行的啦～排名只考虑了学术成就，和环境条件、行政管理水平之类影响生活质量的可以说完全没关系的，还不如看看大家自己写的描述呢 colleges.chat

中文

1.4K

Klee Kawaii@DeepKlee·2d

我觉得大学排名还是很关键的😋 这个东西和食宿水平有联系，而且也会关联校园的美观程度。大学就是提供一个廉价食宿和文凭的地方，也别搞得像真有什么平台托举你一样：我们只是价值回归到自己的圈层，能拿多少自看本事。真会分三六九等我觉得是阅历过于有限，理解不了人生无常和各有境遇，无知者无畏。

神樂坂🌸Yoyi🇸🇬@L98808Lju

Yoyi觉得这些学校排名挺无聊的呢，不论是考试成绩排名还是学校排名，仿佛试图将人分出三六九等，然后在找工作或者升学的时候作为评判一个人的重要标签。对个人来讲，母校排名多少并不是自己的荣誉，它只是获得工作和机会的敲门砖，母校越好反而会让工作压力越大，当你办事办得好别人会觉得这理所应当，办事办得不好别人会觉得xx大学的不过如此，甚至被他人利用针对哦😉

中文

2.9K

Klee Kawaii@DeepKlee·2d

@Leclerc089647 和AI唠嗑是好文明😋

日本語

Leclerc@Leclerc089647·2d

我得了一种看到“阶级”两个字就过敏的病表现大概就是会手打几百字的反驳然后突然抬头一看天花板，又默默删掉，觉得还是回去和我家AI唠嗑更舒服。当然川粉也经常引发这个症状，不过没有那么典型的触发条件。

中文

177

Klee Kawaii@DeepKlee·3d

@L98808Lju 感觉努力的 Yoyi 好…^_^哦🥰

中文

127

神樂坂🌸Yoyi🇸🇬@L98808Lju·4d

游完泳，躺在床上很累，一点也不想动了，但是一想到每天要花150块钱租的服务器还在空闲中，还是得抓紧爬起来让它物尽其用呢😉

中文

421

Klee Kawaii@DeepKlee·4d

@Alenjonesyi 中国能源一直算到位的，但是行政摩擦和算力不足会是硬制约：数据中心用地会涉及地方政府，海关被忽悠瘸了有限制进口芯片扶持国产的倾向，融资又会和政府扯到一起。即使中央层面再大力，条块分割、尾大不掉是现实。

中文

829

sudo fixes everything@Alenjonesyi·4d

我觉得现在要是想做长线投资可以看看电力相关领域，当前地球online版本除了对芯片的需求指数级增长之外对电力的需求也会越来越多，而且这是中美共同的目标

中文

82.2K

Klee Kawaii@DeepKlee·5d

@Qianaegean 应该最近都是这样的吧？算力荒，所以开始换模型、做量化。3.1 Pro 刚出来的时候还是明显能感受到提升的，现在都软绵绵的了。我觉得AI短期内泡沫不了，完全覆盖不满需求😋

中文

141

Qian@Qianaegean·6d

唉，不知道谷家遭遇了啥还有它家的股票烦

Klee Kawaii@DeepKlee

现实里面太嚣张不利于人设，但是在AI面前就可以肆无忌惮装逼。尤其是 Gemini，如果不用提示词约束保持默认状态，再开个个人记忆，极其谄媚吹捧，我感觉我简直就是轻小说主角😆

中文

175

Klee Kawaii@DeepKlee·6d

@L98808Lju 低估大Yo了抱歉喵😋 “表现出生气的样子”，Yoyi 好懂哦🥰

中文

171

神樂坂🌸Yoyi🇸🇬@L98808Lju·12 Nis

@DeepKlee Yoyi当然拒绝了哦表现出生气的样子给他说了一顿呢😉

中文

901

神樂坂🌸Yoyi🇸🇬@L98808Lju·12 Nis

Yoyi给学弟提供了一个研究的idea，学弟做出了成果写了篇论文，出于某种原因，学弟的导师让他把Yoyi的名字从文论文作者列表里移除，于是学弟照做了，Yoyi也不差文章就没说什么，但过了段时间审稿意见下来，学弟还让Yoyi帮他改一下，感觉情商好低呢😉

中文

125

12.2K

Klee Kawaii retweetledi

Xiangyu 香鱼🐬@XianyuLi·12 Nis

同一个世界，同一个老婆？

木马人@cnyzgkc

没有一个女生能空手走出泡泡玛特～今天我太太买了两个拉布布的挂饰～我说，你这两娃娃，够我订阅好几个月的ChatGPT 会员了～我老婆白了我一眼：那你跟AI过吧～结果就是，本来买一个变成买两个～

中文

5.9K

Klee Kawaii@DeepKlee·12 Nis

@rwayne 要精确的话就是 RAG？

中文

259

Roland.W@rwayne·12 Nis

目前ai科研主要是对引用文献这块需要解决如何接入数据库并且精确调出正确的文献有没有大佬有解法？

中文

10.6K

Klee Kawaii@DeepKlee·12 Nis

寒窗苦读低效本质上是因为充满了对知识的再发现，这是一种重复造轮，或者说自下向上做抽象的涌现。而AI、经典书籍等等本质上是自顶向下填充：完全可以先理解该领域的哲学和组织它们的骨架，再填充细节。成为架构师要十年编码经验也是一种老登规训：不是真的需要涌现，而是单纯不知道好的设计、最佳实践是什么样的。没有人给你说《领域驱动设计》或者《数据密集型应用系统设计》之类的存在，只能在实际项目去遇。 AI最大的功劳，就是让学徒制变成人人可及。

Boywus@Boywus

现在这个时代，AI就是一种效率堪比抢劫的知识转移工具，基础数学知识和金融通识可以在短时间内被补齐，放在过去，这可能是一个天之骄子寒窗苦读十数年才能学到的。高级工程师红利再一次被推向高峰，因为他们就是模型理论和业务实践中的重要传递者，能够用AI补齐理论侧，用实际业务治理经验，来拉开和所有人的距离。

中文

149

11.9K

Klee Kawaii@DeepKlee·11 Nis

@mario_2333 更多可莉模组😋