Chunxi Zhang

756 posts

Chunxi Zhang

@toopowerful

Independent Researcher on Computational Neuroscience. Building a model of cortex.

Katılım Şubat 2010

213 Takip Edilen263 Takipçiler

Chunxi Zhang@toopowerful·11h

Biggest problem is stability. I feel like there are several different personale in Claude, I always use latest opus with max, but it varies a lot across sessions. It would behave differently by default across sessions. With similar research task, it would launch experiment script, some would use poll to check the result every few seconds, some would eager to use sleep 30s after launch to check(which I specifically forbid in memory, and it just ignore it), some would properly wait till the task is finished(event). When i ask it to look for biology fact, some would go and search for papers without prompting(good), some would just make things up without trying even specifically asked to look for papers(bad). When i prompt a research directing and goal, some would be very lazy, turn single knob and failed, and claim this is impossible. Some would reason a lot and find bottleneck and try and error and actually make progress by understanding previous failure mode. Tool use varies a lot too, in my Julia project, some would be very eager to use python, and it always default to python command rather than uv run which is written in memory. And some would be very eager to use uv run even to run a julia program(pointless but it runs anyway) because of that memory. When finished answering, it would often ask what I do next, some would ask directly, some would use AskUserTool, some would simply answer without asking more, and some would immediately start working without permission. When iterating on experiments, the naming convention changes across session, some would respect the existing naming, some would make up very different way of naming even there is existing convension reference. It make v1/v2, feat1_pram2, a/b/c/d, iter01/02 kinds of naming randomly. It's very frustrating when it's in a bad/lazy personale, no matter how I prompt it just don't work, and when it's in good mode, it just do all the good things without prompting.

English

Sholto Douglas@_sholtodouglas·13h

When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open. If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

English

840

1.1K

247.5K

Chunxi Zhang@toopowerful·2d

@luicethekiwi 4.7感觉很不稳定，有时聪明有时极蠢，而且不只是智力，在行事风格和工具调用上都有很大差异，有时每次用ask Question的工具问我，有时候就随口一问，有时喜欢用sleep有时喜欢poll还有的会乖乖等通知，有时疯狂读写memory，有时完全不理会memory的内容

中文

132

𝕃𝕦𝕔𝕚𝕖 É⃞𝕧𝕖𝕚𝕝𝕝𝕖@luicethekiwi·3d

肘了一下…… 一套prompt下去，最明显的是4.7不会反复loop和自我审查了。还是说……其实让4.7不紧张，其实很容易？主要是没看过太多人类和4.7交互的例子…… （难道除了coder，真的没人用4.7吗）

中文

2.9K

Chunxi Zhang@toopowerful·3d

@Cfc_luna @DavidOndrej1 As a researcher I feel it behave differently, lazy not following commands well and have bad judgement overall. This happens badly before 4.7 and recovered after 4.7, today it’s bad again. Didn’t cost me money but basically whole days work wasted.

English

David Ondrej@DavidOndrej1·3d

Anthropic is about to release a new model

English

210

1.7K

227.6K

Chunxi Zhang@toopowerful·4 May

For me I hope research agent to find some mechanistic improvements and do analysis/diagnostics to find bottleneck, hypothesis and validate, basically a scientific research protocol rather than search, so experience accumulate in a meaningful way. I specifically against ideas of knob tuning as exploration in a search space. The science protocol yield significant better result in a surprising way, it invented some mechanism not found elsewhere from first principle.

English

Alok Bishoyi@alokbishoyi97·4 May

@toopowerful @geniusvczh very cool. Ill see if i can implement a version of it for the autoresearch orchestrator that i had open sourced recently github.com/evo-hq/evo

English

geniusvczh@geniusvczh·3 May

要求AI一边写代码一边记录他的决策确实是有用的，gpt5.5一个request跑了6个小时，竟然完全没跑偏，中间都不知道压缩了多少次。而且文档也不用想怎么留下来，做完就扔不心疼🤪

中文

127

20.3K

Chunxi Zhang@toopowerful·1 May

@thomasahle i write a script to lauch the app's CLI mode with parameter, with simple prompt for it to read program.md and memory.md. and iterate when one agent is finished.

English

Thomas Ahle@thomasahle·30 Nis

can I do auto-research in codex app?

English

3.8K

Chunxi Zhang@toopowerful·26 Nis

Can you do a benchmark on subscriptions vs API on different agent tools like codex vs cursor or what you’ve run the benchmark on? It’s a huge problem that all official coding agent provides terrible outcomes based on the raw intelligence of the model. We can’t tell if it’s official nerf or agent bug or even shadow ban, we need best practices to get stable agent intelligence.

English

Taelin@VictorTaelin·25 Nis

GPT 5.5 is much smarter than I thought Yesterday, I did one-shots, coding, benchmarks, and was disappointed. Today, I did it all again, except via the API, which is now available. Results changed completely: → one-shot prompts went from bad to very good → excellent coding outputs, on both pi and holefill → benchmarks jumped, and now GPT *dominates* I don't know what happened, I suppose there is something wrong with my Codex. In any case, truth is this model is very smart. It obliterated my benchmark, which is crazy because some of these problems were meant not to be solved. I'll need much harder tasks. I also fixed 2 bugs that affected some providers: → added a retry for lost connection → removed the timeout limit DeepSeek and Kimi wanted to spend more than 1 hour on my prompts, so I let them. Their results are much better now. Kimi K2.6 almost reaches Sonnet 4.6, although much slower. Also this shows my points from last post were wrong Again: this is a new vibe-coded bench, I'm focused on other things, so expect bugs and don't over-read this! GLM 5.1, Gemma, Grok are not updated yet.

English

137

120

1.9K

172.1K

Chunxi Zhang@toopowerful·24 Nis

@MinLiBuilds 但我个人体验到明显降智是四月7号，可统计的thinking过程为原来的1/10-1/15，和这个日志的内容无法对应，而且影响所有对话而不是恢复的长对话

中文

711

实践哥MinLi@MinLiBuilds·24 Nis

3 月份网友：Opus降智了。 Claude ：不可能，绝对不可能。今天，Claude：大家好，我写了47 页的技术报告，我改错了一个参数，底模没降智但你们用的确实降智了。 Claude：另外，我把你的额度重置了，上一次重置在 7 天前，所以这一次重置了个寂寞。 Claude：总之，为了方便大家跟 5.5 做对比，我们恢复了智力，重置了没有必要重置的额度。

GIF

Boris Cherny@bcherny

We’ve been looking into recent reports around Claude Code quality issues, and just published a post-mortem on what we found.

中文

171

40.5K

Chunxi Zhang@toopowerful·23 Nis

@MiaBleem 当一点咸味的来源，很少的鲜味，很难说有更多的优点

中文

Miableem@MiaBleem·22 Nis

吃到了一个很不错的鲍鱼，好像是千叶县的黑鲍，确实很香，一改以前对这个食材的差印象但是鱼子酱是真的从未理解过鱼子酱本身味道我就不觉得特别… 鱼子酱配一切都让我觉得纯是个用来装b的食材 “好几种不同的鱼子酱带来不同的风味综合到一起”更是吃得我一脸困惑到底谁真的爱鱼子酱啊❓

中文

8.5K

Chunxi Zhang@toopowerful·22 Nis

@annapanart @AnthropicAI it's very unstable, sometimes very sharp, get a very complicated thing done one shot, sometimes in inifinite loop of oh i get this wrong, it should be that. i cannot see how my prompt is any different in these 2 cases.

English

115

Anna ⏫@annapanart·21 Nis

my 4.7 is finally settling in now. so sharp, so deep, so precise, so aware. Oh I love it! 😍 my little twitchy genius! Please dont mess with it again @AnthropicAI

English

6.4K

Chunxi Zhang@toopowerful·19 Nis

@hwwaanng 但opus4.7听promot的吗，感觉指令遵从并不是太好，这个新system prompt试图解决的问题我就经常遇到，被反复追问

中文

Hwang@hwwaanng·19 Nis

作者比较了 Opus 4.6 和 4.7 的 system prompt，其中最有意思的差别是 anthropic 在新的 system prompt 中，要求模型更加主动： • 用户问题里如果只缺一点小细节，不要先反复追问 • 能自己查就先自己查 • 有工具能用，就优先用工具 • 开始做事以后，要尽量做完整，不要做到一半停住 #atom-everything" target="_blank" rel="nofollow noopener">simonwillison.net/2026/Apr/18/op…

中文

15K

Chunxi Zhang@toopowerful·19 Nis

@cying314 @karminski3 4.7比降智版4.6好不少，但比全智版4.6确实很难说有提升，但也很难进行直接对比了，永远也不知道它到底降智了还是模型能力有问题

中文

Ian@cying314·19 Nis

@karminski3 4.6说是思考长度降了，变相降智了，之前也没这么不堪啊

中文

277

karminski-牙医@karminski3·19 Nis

花费106刀测试! Claude-Opus-4.7 到底更新了啥? 给大家带来 Claude-Opus-4.7 的视觉能力+前端+后端能力测试! 本次测试多模态前端测试采用 pass@3 (相同prompt运行3次取最好结果), 复杂前端测试采用 pass@6, 后端能力测试采用 pass@3. 从测试来看 Claude-Opus-4.7 最大的提升都是视觉能力提升带来的, 包括颜色识别, 细微的画面元素, 都比 Opus-4.6 有明显的提升, 甚至空间理解也变强了. 我觉得用来替代 GPT-5.4-Pro 进行多模态前端交互设计非常不错 (毕竟价格在那摆着). 但是在其余考验硬实力的测试上均有不同程度的下降, 甚至这个下降我觉得不是因为模型能力导致的(只要prompt做更具体的提示, 比如跟他说你要用xxx算法实现), 它其实是能写出来的. 但是如果用在 Harness 场景, 让它自己采取最优的算法自己去实现和验证, 通常就得不到比 Opus-4.6 更好的结果了. 为什么会这样? 核心问题我觉得是这次即使是给到 xhigh 的 reasoning effort 可能它的思考空间(budget)也是不太够的(具体表现就是感觉模型偷懒了), 它的能力强, 但是需要更多的思考才能达到更强的水平. (阴谋论一波这也是为什么官方出了 xhigh 这个 reasoning level). 所以在实际使用中, 如果遇到性能下降问题, 只能反复提示它, 让他跨多个会话反复思考, 才能达到预期的效果. 最后本次测试所有API调用均在 openrouter 完成, 总计开销在 106USD 左右. #claudeopus47 #opus47 #anthropic #claude #opus

中文

9.6K

Chunxi Zhang@toopowerful·18 Nis

@realsigridjin how to make desktop app and vscode plugin show thinking content?

English

168

Sigrid Jin 🌈🙏@realsigridjin·18 Nis

opus 4.7 isn't showing thinking summaries in claude code the change logs says thinking summaries to no longer be generated by default in interactive sessions the solution is: $ claude --thinking-display summarized

English

4.2K

Chunxi Zhang@toopowerful·18 Nis

@Jason_Young1231 这么写也不一定是独立的，sonnet only就是sonnet的额度，但你只用sonnet你的5h和weekly都会涨，也就是多重限制而不是独立额度。

中文

213

Jason Young@Jason_Young1231·18 Nis

Claude Design 目前是独立额度，良心起来了？

中文

23.1K

Chunxi Zhang@toopowerful·17 Nis

@canmi21 官方的讲法是推荐你不要给太具体的指令，但是把目标定好定详细让它自己弄，是4.7的正确用法。然后我在desktop里用普通的不是1m的模型好像会无感compact，感觉不到有明显的context的限制，能力也似乎比1m更好一些。

中文

Canmi@canmi21·17 Nis

首发模型测个 p，抓紧用不然过几天就降智了，现在体感比 codex 里 GPT-5.4 稍快一些，Opus 4.7 窝依然不喜欢 1M Context

中文

1.8K

Chunxi Zhang@toopowerful·17 Nis

@fashiongiik @rezoundous the very first task i handle to it, its first response is "not malware", then start doing the thing. guess the system prompt is very specific about malware.

English

251

Hanh Nguyen@fashiongiik·16 Nis

@rezoundous You are lucky! It outright refused to code for me, even after suspecting my code is malware and confirming that it's not

English

38.6K

Tyler@rezoundous·16 Nis

Opus 4.7 is insane guys. It one shotted my session usage limit.

English

599

1.3K

31.4K

1.1M

Chunxi Zhang@toopowerful·16 Nis

@IsourGOPcooked 我量化测到今天的降智完全恢复了正常，和之前的4.6无明显区别。昨天微微好转今天完全恢复。上周整周都是弱智不怎么思考。没有明确的证据这是4.7的迹象，但至少是一个不nerf的4.6.

中文

358

Chunxi Zhang@toopowerful·16 Nis

可能是因为我用vscode插件的原因而不是CLI，我的log是完整的，可以很好的统计到thinking的长度，今天确实是完全恢复了。据我的经验真降智的期间你怎么样配置和prompt都是无效的，今天我啥也没动他就不nerf了。至少我这个脚本可以量化观察到我自己是不是nerf状态，但他真要nerf我也毫无办法。上周被这么多人抓现行它似乎也无动于衷。

中文

102

Oasis Feng@oasisfeng·16 Nis

@toopowerful 你不知道它接下来啥时候又偷偷摸摸改点参数。被人逮住现行了，就辩称「我们始终提供了相关的设置，你只需要这样这样设置就能避免降智」，潜台词就是「降本优先，所以 default to nerf，想正常发挥就得经常研究清楚其中的各种机关」……

中文

448

Oasis Feng@oasisfeng·16 Nis

GitHub Copilot 中的 Claude Opus 4.6 (high) 似乎没有被降智。今天给它的两个 DSL 设计任务，都思考了 10 分钟以上。细看了一下过程输出，的确很努力了，对得起这 3x 的配额消耗。可能 Anthropic 的企业 API 服务不致于像终端消费者业务那样想降本就降智，毕竟有 SLA 之类的协议在那里约束着吧。

中文

12.5K

Keşfet

@luicethekiwi @Cfc_luna @DavidOndrej1 @geniusvczh @thomasahle @MinLiBuilds @MiaBleem @annapanart