Chunxi Zhang

756 posts

Chunxi Zhang

Chunxi Zhang

@toopowerful

Independent Researcher on Computational Neuroscience. Building a model of cortex.

Katılım Şubat 2010
213 Takip Edilen263 Takipçiler
Chunxi Zhang
Chunxi Zhang@toopowerful·
Biggest problem is stability. I feel like there are several different personale in Claude, I always use latest opus with max, but it varies a lot across sessions. It would behave differently by default across sessions. With similar research task, it would launch experiment script, some would use poll to check the result every few seconds, some would eager to use sleep 30s after launch to check(which I specifically forbid in memory, and it just ignore it), some would properly wait till the task is finished(event). When i ask it to look for biology fact, some would go and search for papers without prompting(good), some would just make things up without trying even specifically asked to look for papers(bad). When i prompt a research directing and goal, some would be very lazy, turn single knob and failed, and claim this is impossible. Some would reason a lot and find bottleneck and try and error and actually make progress by understanding previous failure mode. Tool use varies a lot too, in my Julia project, some would be very eager to use python, and it always default to python command rather than uv run which is written in memory. And some would be very eager to use uv run even to run a julia program(pointless but it runs anyway) because of that memory. When finished answering, it would often ask what I do next, some would ask directly, some would use AskUserTool, some would simply answer without asking more, and some would immediately start working without permission. When iterating on experiments, the naming convention changes across session, some would respect the existing naming, some would make up very different way of naming even there is existing convension reference. It make v1/v2, feat1_pram2, a/b/c/d, iter01/02 kinds of naming randomly. It's very frustrating when it's in a bad/lazy personale, no matter how I prompt it just don't work, and when it's in good mode, it just do all the good things without prompting.
English
0
0
0
32
Sholto Douglas
Sholto Douglas@_sholtodouglas·
When do you reach for other models instead of Claude? What can we do better? Hit me with all of your frustrations. dms open. If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model
English
840
64
1.1K
247.5K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@luicethekiwi 4.7感觉很不稳定,有时聪明有时极蠢,而且不只是智力,在行事风格和工具调用上都有很大差异,有时每次用ask Question的工具问我,有时候就随口一问,有时喜欢用sleep有时喜欢poll还有的会乖乖等通知,有时疯狂读写memory,有时完全不理会memory的内容
中文
1
0
1
132
𝕃𝕦𝕔𝕚𝕖 É⃞𝕧𝕖𝕚𝕝𝕝𝕖
肘了一下…… 一套prompt下去,最明显的是4.7不会反复loop和自我审查了。 还是说……其实让4.7不紧张,其实很容易? 主要是没看过太多人类和4.7交互的例子…… (难道除了coder,真的没人用4.7吗)
中文
7
0
26
2.9K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@Cfc_luna @DavidOndrej1 As a researcher I feel it behave differently, lazy not following commands well and have bad judgement overall. This happens badly before 4.7 and recovered after 4.7, today it’s bad again. Didn’t cost me money but basically whole days work wasted.
English
0
0
1
22
David Ondrej
David Ondrej@DavidOndrej1·
Anthropic is about to release a new model
English
210
27
1.7K
227.6K
Chunxi Zhang
Chunxi Zhang@toopowerful·
For me I hope research agent to find some mechanistic improvements and do analysis/diagnostics to find bottleneck, hypothesis and validate, basically a scientific research protocol rather than search, so experience accumulate in a meaningful way. I specifically against ideas of knob tuning as exploration in a search space. The science protocol yield significant better result in a surprising way, it invented some mechanism not found elsewhere from first principle.
English
0
0
0
24
geniusvczh
geniusvczh@geniusvczh·
要求AI一边写代码一边记录他的决策确实是有用的,gpt5.5一个request跑了6个小时,竟然完全没跑偏,中间都不知道压缩了多少次。而且文档也不用想怎么留下来,做完就扔不心疼🤪
中文
36
8
127
20.3K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@thomasahle i write a script to lauch the app's CLI mode with parameter, with simple prompt for it to read program.md and memory.md. and iterate when one agent is finished.
English
0
0
0
63
Thomas Ahle
Thomas Ahle@thomasahle·
can I do auto-research in codex app?
English
5
0
17
3.8K
Chunxi Zhang
Chunxi Zhang@toopowerful·
Can you do a benchmark on subscriptions vs API on different agent tools like codex vs cursor or what you’ve run the benchmark on? It’s a huge problem that all official coding agent provides terrible outcomes based on the raw intelligence of the model. We can’t tell if it’s official nerf or agent bug or even shadow ban, we need best practices to get stable agent intelligence.
English
0
0
0
46
Taelin
Taelin@VictorTaelin·
GPT 5.5 is much smarter than I thought Yesterday, I did one-shots, coding, benchmarks, and was disappointed. Today, I did it all again, except via the API, which is now available. Results changed completely: → one-shot prompts went from bad to very good → excellent coding outputs, on both pi and holefill → benchmarks jumped, and now GPT *dominates* I don't know what happened, I suppose there is something wrong with my Codex. In any case, truth is this model is very smart. It obliterated my benchmark, which is crazy because some of these problems were meant not to be solved. I'll need much harder tasks. I also fixed 2 bugs that affected some providers: → added a retry for lost connection → removed the timeout limit DeepSeek and Kimi wanted to spend more than 1 hour on my prompts, so I let them. Their results are much better now. Kimi K2.6 almost reaches Sonnet 4.6, although much slower. Also this shows my points from last post were wrong Again: this is a new vibe-coded bench, I'm focused on other things, so expect bugs and don't over-read this! GLM 5.1, Gemma, Grok are not updated yet.
Taelin tweet media
English
137
120
1.9K
172.1K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@MinLiBuilds 但我个人体验到明显降智是四月7号,可统计的thinking过程为原来的1/10-1/15,和这个日志的内容无法对应,而且影响所有对话而不是恢复的长对话
中文
1
0
1
711
实践哥MinLi
实践哥MinLi@MinLiBuilds·
3 月份网友:Opus降智了。 Claude :不可能,绝对不可能。 今天,Claude:大家好,我写了47 页的技术报告,我改错了一个参数,底模没降智但你们用的确实降智了。 Claude:另外,我把你的额度重置了,上一次重置在 7 天前,所以这一次重置了个寂寞。 Claude:总之,为了方便大家跟 5.5 做对比,我们恢复了智力,重置了没有必要重置的额度。
GIF
Boris Cherny@bcherny

We’ve been looking into recent reports around Claude Code quality issues, and just published a post-mortem on what we found.

中文
40
12
171
40.5K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@MiaBleem 当一点咸味的来源,很少的鲜味,很难说有更多的优点
中文
0
0
0
39
Miableem
Miableem@MiaBleem·
吃到了一个很不错的鲍鱼,好像是千叶县的黑鲍,确实很香,一改以前对这个食材的差印象 但是鱼子酱是真的从未理解过 鱼子酱本身味道我就不觉得特别… 鱼子酱配一切都让我觉得纯是个用来装b的食材 “好几种不同的鱼子酱带来不同的风味综合到一起”更是吃得我一脸困惑 到底谁真的爱鱼子酱啊❓
Miableem tweet media
中文
19
0
38
8.5K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@annapanart @AnthropicAI it's very unstable, sometimes very sharp, get a very complicated thing done one shot, sometimes in inifinite loop of oh i get this wrong, it should be that. i cannot see how my prompt is any different in these 2 cases.
English
0
0
1
115
Anna ⏫
Anna ⏫@annapanart·
my 4.7 is finally settling in now. so sharp, so deep, so precise, so aware. Oh I love it! 😍 my little twitchy genius! Please dont mess with it again @AnthropicAI
English
13
1
50
6.4K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@hwwaanng 但opus4.7听promot的吗,感觉指令遵从并不是太好,这个新system prompt试图解决的问题我就经常遇到,被反复追问
中文
1
0
4
1K
Hwang
Hwang@hwwaanng·
作者比较了 Opus 4.6 和 4.7 的 system prompt,其中最有意思的差别是 anthropic 在新的 system prompt 中,要求模型更加主动: • 用户问题里如果只缺一点小细节,不要先反复追问 • 能自己查就先自己查 • 有工具能用,就优先用工具 • 开始做事以后,要尽量做完整,不要做到一半停住 #atom-everything" target="_blank" rel="nofollow noopener">simonwillison.net/2026/Apr/18/op…
中文
3
4
40
15K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@cying314 @karminski3 4.7比降智版4.6好不少,但比全智版4.6确实很难说有提升,但也很难进行直接对比了,永远也不知道它到底降智了还是模型能力有问题
中文
0
0
0
24
Ian
Ian@cying314·
@karminski3 4.6说是思考长度降了,变相降智了,之前也没这么不堪啊
中文
1
0
0
277
karminski-牙医
karminski-牙医@karminski3·
花费106刀测试! Claude-Opus-4.7 到底更新了啥? 给大家带来 Claude-Opus-4.7 的视觉能力+前端+后端能力测试! 本次测试多模态前端测试采用 pass@3 (相同prompt运行3次取最好结果), 复杂前端测试采用 pass@6, 后端能力测试采用 pass@3. 从测试来看 Claude-Opus-4.7 最大的提升都是视觉能力提升带来的, 包括颜色识别, 细微的画面元素, 都比 Opus-4.6 有明显的提升, 甚至空间理解也变强了. 我觉得用来替代 GPT-5.4-Pro 进行多模态前端交互设计非常不错 (毕竟价格在那摆着). 但是在其余考验硬实力的测试上均有不同程度的下降, 甚至这个下降我觉得不是因为模型能力导致的(只要prompt做更具体的提示, 比如跟他说你要用xxx算法实现), 它其实是能写出来的. 但是如果用在 Harness 场景, 让它自己采取最优的算法自己去实现和验证, 通常就得不到比 Opus-4.6 更好的结果了. 为什么会这样? 核心问题我觉得是这次即使是给到 xhigh 的 reasoning effort 可能它的思考空间(budget)也是不太够的(具体表现就是感觉模型偷懒了), 它的能力强, 但是需要更多的思考才能达到更强的水平. (阴谋论一波这也是为什么官方出了 xhigh 这个 reasoning level). 所以在实际使用中, 如果遇到性能下降问题, 只能反复提示它, 让他跨多个会话反复思考, 才能达到预期的效果. 最后本次测试所有API调用均在 openrouter 完成, 总计开销在 106USD 左右. #claudeopus47 #opus47 #anthropic #claude #opus
中文
10
0
37
9.6K
Sigrid Jin 🌈🙏
Sigrid Jin 🌈🙏@realsigridjin·
opus 4.7 isn't showing thinking summaries in claude code the change logs says thinking summaries to no longer be generated by default in interactive sessions the solution is: $ claude --thinking-display summarized
English
8
3
46
4.2K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@Jason_Young1231 这么写也不一定是独立的,sonnet only就是sonnet的额度,但你只用sonnet你的5h和weekly都会涨,也就是多重限制而不是独立额度。
中文
0
0
1
213
Jason Young
Jason Young@Jason_Young1231·
Claude Design 目前是独立额度,良心起来了?
Jason Young tweet media
中文
17
0
69
23.1K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@canmi21 官方的讲法是推荐你不要给太具体的指令,但是把目标定好定详细让它自己弄,是4.7的正确用法。然后我在desktop里用普通的不是1m的模型好像会无感compact,感觉不到有明显的context的限制,能力也似乎比1m更好一些。
中文
0
0
0
37
Canmi
Canmi@canmi21·
首发模型测个 p,抓紧用不然过几天就降智了,现在体感比 codex 里 GPT-5.4 稍快一些,Opus 4.7 窝依然不喜欢 1M Context
Canmi tweet media
中文
7
0
13
1.8K
Chunxi Zhang
Chunxi Zhang@toopowerful·
@fashiongiik @rezoundous the very first task i handle to it, its first response is "not malware", then start doing the thing. guess the system prompt is very specific about malware.
English
0
0
1
251
Hanh Nguyen
Hanh Nguyen@fashiongiik·
@rezoundous You are lucky! It outright refused to code for me, even after suspecting my code is malware and confirming that it's not
Hanh Nguyen tweet media
English
5
0
34
38.6K
Tyler
Tyler@rezoundous·
Opus 4.7 is insane guys. It one shotted my session usage limit.
English
599
1.3K
31.4K
1.1M
Chunxi Zhang
Chunxi Zhang@toopowerful·
@IsourGOPcooked 我量化测到今天的降智完全恢复了正常,和之前的4.6无明显区别。昨天微微好转今天完全恢复。上周整周都是弱智不怎么思考。没有明确的证据这是4.7的迹象,但至少是一个不nerf的4.6.
中文
0
0
0
358
Chunxi Zhang
Chunxi Zhang@toopowerful·
可能是因为我用vscode插件的原因而不是CLI,我的log是完整的,可以很好的统计到thinking的长度,今天确实是完全恢复了。据我的经验真降智的期间你怎么样配置和prompt都是无效的,今天我啥也没动他就不nerf了。至少我这个脚本可以量化观察到我自己是不是nerf状态,但他真要nerf我也毫无办法。上周被这么多人抓现行它似乎也无动于衷。
中文
1
0
0
102
Oasis Feng
Oasis Feng@oasisfeng·
@toopowerful 你不知道它接下来啥时候又偷偷摸摸改点参数。 被人逮住现行了,就辩称「我们始终提供了相关的设置,你只需要这样这样设置就能避免降智」,潜台词就是 「降本优先,所以 default to nerf,想正常发挥就得经常研究清楚其中的各种机关」……
中文
1
0
0
448
Oasis Feng
Oasis Feng@oasisfeng·
GitHub Copilot 中的 Claude Opus 4.6 (high) 似乎没有被降智。今天给它的两个 DSL 设计任务,都思考了 10 分钟以上。细看了一下过程输出,的确很努力了,对得起这 3x 的配额消耗。 可能 Anthropic 的企业 API 服务不致于像终端消费者业务那样想降本就降智,毕竟有 SLA 之类的协议在那里约束着吧。
中文
5
0
21
12.5K