callightman

359 posts

callightman

@CallightmanCom

Beigetreten Mayıs 2013

124 Folgt40 Follower

callightman@CallightmanCom·22h

所以说低代码彻底死了是吗

中文

callightman@CallightmanCom·3d

@gsh1152539 @ZhiLiank83gzx 怎么哪都能看到你，就是女的，别嘴硬了

中文

2.5K

callightman@CallightmanCom·11 Mar

@loryoncloud 没有太多期望，就想对接下沉浸式翻译来着…

中文

Lory@loryoncloud·11 Mar

@CallightmanCom 你这模型跑得也太…

日本語

Lory@loryoncloud·9 Mar

#ClaudeCode #LLM #OpenClaw 难得的安利帖：让你的Token减少、并发相应更快的神器——oMLX（仅苹果M系列芯片可用） oMLX是专为Apple Silicon设计的开源MLX推理服务器，核心突破在于用SSD分页缓存解决了Mac跑本地大模型的致命痛点。传统工具如LM Studio在处理OpenClaw这类调用工具频繁的套壳时，每次都要重新计算20k+ token的系统提示词（相当于重读一篇万字长文）它真正的创新在于： 1. 分层缓存架构：热数据存RAM，冷数据存SSD（按LRU策略），让16G内存的Mac Mini也能跑多开 2. 智能前缀缓存：相同系统提示词只存一份，不同用户会话共享基础缓存 3. 分页缓存技术：将提示词拆解存储，动态内容变更时只需重算变动部分实测数据显示，它能将OpenClaw的响应速度提升5-10倍，且支持8倍并发。菜单栏就能直接管理，而且用起来还算丝滑。如果能把这个东西用好或者让龙虾改一改提高在自己身上的适应性，我觉得一定会带来更棒的使用体验的，尤其是多龙虾/Agents Team的用户应该能解决高并发响应慢的问题。 Github：github.com/jundot/omlx

中文

284

35.5K

callightman@CallightmanCom·11 Mar

@0xdizz 试试 liaobots 2 折多

中文

161

0x叮当猫@0xdizz·11 Mar

我的中转站最近变贵了，有没有推荐的好用的claude中转以后穷人连顶级模型都用不起了这可怎么办，感觉会是一个更坏的世界

中文

2.2K

callightman@CallightmanCom·11 Mar

@loryoncloud m3 max128，跑的 qwen 9b 和 27b

日本語

Lory@loryoncloud·11 Mar

@CallightmanCom 你可以看一下GitHub上的issue 也有可能是你的电脑运行了不符配置的大模型

中文

332

callightman@CallightmanCom·9 Mar

@llqoli 这个 ranking 居然没有把我的一千多评论的帖子排前边

中文

Ronnie W.@llqoli·9 Mar

在手機用 Codex 是更舒服的完成任務的辦法。

日本語

869

callightman@CallightmanCom·9 Mar

@terrywang @muuyuu666 这个要比 qwen3.5 9b 本身效果如何

中文

terrific@terrywang·9 Mar

@muuyuu666 核心问题是 Qwen3.5-4B 的参数不够。有条件的话把这台 Mac Mini 上的应用最小化，留出内存空间跑 Qwen3.5-9B 即便是 4-bit 6-bit 也会有更好的效果。记得在 Apple Silicon 上用 MLX 而不是 llama.cpp 那是给没有条件用 MLX 的 Linux / Windows 等用的。推荐这个 huggingface.co/Jackrong/MLX-Q…

中文

322

terrific@terrywang·9 Mar

llama.cpp 在同一台机器上的有效带宽利用 48.57 t/s x 4.21 GB/t = 204.4 GB/s llama.cpp 的 Q8_0 Q8_K_XL 采用特定的 Block 结构，在 Decode 阶段 GPU 的 SIMD 需要去实时反量化。而 MLX 的 8-bit 量化格式与 Apple MPS 原生指令高度契合，能利用硬件级高效指令瞬间完成反量化操作并直送计算单元。

terrific@terrywang

以 Qwen3.5 4B dense 模型为例参数 4.21B 使用 8-bit 量化，每个参数占 1 Byte 生成每个 token 至少需读取的模型数据 data per token 约 4.21GB MLX 有效带宽利用 89.28 t/s x 4.21 GB/t = 375.8 GB/s 是低阶版 M4 Max 32-core GPU 内存带宽物理极限 410 GB/s 的 91.4% 顶配 40-core 极限的 68.9%

中文

5.3K

callightman@CallightmanCom·9 Mar

@terrywang 今天折腾了下用 ollama 跑 mlx 版本的 qwen3.5 还没搞通…大佬有没有更简单的办法能跑起来的

中文

callightman@CallightmanCom·8 Mar

@Erwinminion @geekbb 主要还是针对 html 和日志，这两块能做到精准就可以了

中文

Erwin@Erwinminion·8 Mar

@geekbb Token 消耗大，根本原因不是文本长，而是 Agent 的感知链路设计得太烂——上来就把几万行的生肉 DOM 和全量日志往上下文里硬塞。搞个中间层做文本压缩，只是在给屎山代码喷香水。真正的终局解法是换掉落后的感知层，用高仿真的物理环境去做精准的数据提取，而不是在垃圾文本里花式玩压缩。

中文

1.5K

Geek@geekbb·8 Mar

能省这么多 token 吗，是在 Claude Code 前面再加一层代理做过滤、压缩和重组，把更短的结果送进模型上下文。主打收益是 token 消耗降低约 60% 到 90%，使用 Rust 实现，零依赖和低于 10ms 的额外开销。

Pedram Amini@pedramamini

Such a simple idea... RTK is a CLI wrapper for reducing the token cost generated from all of your favorite tools: github.com/rtk-ai/rtk Depicted are my personal savings from a few days of running the tool. I'm seeing friends with >60% efficiency scores. If you're curious about the other, few, global changes I make to my Claude Code setup, well read this blog my AI wrote about it: pedsidian.pedramamini.com/Claude/Blog/20… Covers LSP, memory, and more.

中文

181

48.2K

callightman@CallightmanCom·4 Mar

@MingtaKaivo @JoshKale 能跑到 140tps 怎么做到的，这个大小理论篇只能跑 20 tps 吧

中文

Mingta Kaivo 明塔开沃@MingtaKaivo·3 Mar

@JoshKale ran Qwen 30B on M4 Max at 140 tok/s. if M5 delivers 4x that bandwidth, local 70B inference stops being a science project and becomes a daily workflow

English

9.9K

Josh Kale@JoshKale·3 Mar

Apple just made the most important chip design change in Apple Silicon history. The M5 Pro and M5 Max are now 2 chips fused together. They broke the CPU and GPU apart onto separate pieces of silicon and reconnected them using the same packaging tech that goes into data center AI chips. The clever part: the M5 Pro and M5 Max use the EXACT SAME CPU. Identical. The only difference is how much GPU they bolt on: - 20 cores for Pro - 40 for Max Like Lego blocks for silicon Why this matters: smaller chips are cheaper to make, run cooler under heavy loads, and, most importantly, Apple can now scale the GPU up or down without redesigning the whole chip. This is how they'll build the Ultra. This is how they'll build whatever comes after Everyone's going to talk about speed today. The real story is that Apple just changed how it builds chips and this is the foundation for the next decade of Apple Silicon

Marques Brownlee@MKBHD

Finally, new M5 Pro and M5 Max Macbook Pros: apple.com/newsroom/2026/…

English

243

3.8K

589.7K

callightman@CallightmanCom·4 Mar

@cnfinancewatch mac 上能有优化的办法么

中文

327

华尔街观察 Xtrader@cnfinancewatch·4 Mar

通义千问新出的 Qwen3.5-35B-A3B，直接把单卡长上下文推理卷到了新高度。 350亿参数的模型，但每次生成只激活30亿参数，用的是门控增量混合架构：256个专家，每次只调用8个，每4层才做一次注意力计算，天生就省算力、省显存。最离谱的是长上下文表现：单张24GB显存的RTX 3090，直接开满 262K超长上下文，跑出 112 token/秒，而且从4K到262K上下文，速度几乎不掉！传统35B模型一上长上下文，KV缓存爆炸、速度暴跌，24GB显存根本扛不住。而这模型40层里只有10层用普通注意力，剩下30层是固定显存的循环结构，上下文再长，显存占用几乎不变。满跑262K上下文，总显存才 22.4GB，24GB显卡轻松拿下。更夸张的是社区力量： 5天、15块显卡，峰值速度冲到 176 tok/s。 48小时内，各路显卡都刷出了高分： • 优化前：默认50 tok/s左右 • 优化后：直接翻倍，5090跑到176 tok/s 核心就5个参数，把层全卸到GPU、压缩KV缓存、开满上下文、精简循环状态、开Flash Attention，一夜之间性能质变。总结一句话：一张消费级显卡、24GB显存、零API费用，就能跑35B级、262K上下文、百tok/s级的大模型。这不是小升级，是本地AI推理的里程碑式突破。

中文

150

36.1K

callightman@CallightmanCom·4 Mar

@Jimmy_JingLv 问题是已经不再需要什么桌面 APP 了

中文

1.5K