Z.ai

1.1K posts

Z.ai banner
Z.ai

Z.ai

@Zai_org

The AI Lab behind GLM models, dedicated to inspiring the development of AGI to benefit humanity. https://t.co/7a5aSCUNcZ https://t.co/x14hb3klXm

[email protected] Katılım Kasım 2023
258 Takip Edilen79.6K Takipçiler
Sabitlenmiş Tweet
Z.ai
Z.ai@Zai_org·
Scaling laws push model capability forward. But whether that capability becomes reliable in production depends on how we handle Scaling Pain. z.ai/blog/scaling-p… In our latest blog, we share how we debugged GLM-5 serving at scale: reproducing rare garbled outputs, repetition, and rare-character generation; tracing and eliminating KV Cache race conditions; fixing HiCache synchronization issues; and introducing LayerSplit for up to 132% throughput improvement. We hope these lessons help the community avoid similar pitfalls and build more robust inference infrastructure.
English
40
81
878
80.4K
Z.ai retweetledi
Zixuan Li
Zixuan Li@ZixuanLi_·
See you in Singapore. BTW, I'm starting to look more and more like our logo.
Zixuan Li tweet media
English
16
9
123
16.3K
Z.ai retweetledi
Z.ai for Startups
Z.ai for Startups@ZaiforStartups·
GLM models are now live on @tensorix_ai We’re partnering to bring cost-efficient frontier AI models to developers, startups, and enterprises across Europe and beyond — and to back the Sovereign AI ecosystem with serious inference muscle. Four GLM models are now available • GLM-5.1 → SOTA open-source performance advancing long-horizon AI agents to new levels • GLM-5 → New generation language base model • GLM-5-Turbo → agent-ready, built for coding and agentic use cases • GLM-5v-Turbo → multimodal reasoning across code, images, documents, and diagrams Go build something cool. Build with GLM on Tensorix: tensorix.ai
English
10
13
265
19.7K
Z.ai retweetledi
Zhihu Frontier
Zhihu Frontier@ZhihuFrontier·
🧵 Slime: The Most Elegant & Comfortable RL Training Framework Ever A deep dive into why Slime redefines LLM RL training with clean architecture & production-grade engineering ✨ Insights from Zhihu contributor Xavier 📌 What Is Slime In One Sentence? Slime is a streamlined RL training framework built on SGLang (Inference) + Megatron (Training) + Ray (Orchestration).It’s not just a simple stack—it stitches top-tier open-source projects together with perfectly polished interfaces.Core design philosophy: Fully decouple training & inference, connected via streamlined data flow. Compared to veRL / OpenRLHF: ✅ Native SGLang backend → high concurrency, continuous batching, prefix caching (no messy vLLM wrapper) ✅ Native Megatron backend → full TP/PP/EP/CP parallelism, seamless MoE training ✅ Lightweight Ray scheduling → Placement Group + Remote Actor (no bloated Ray Train) 🏗️ Global Architecture: 3 Modules, One Pipeline 🖥️ Ray Cluster Core Workflow:Data Buffer (Prompt Manager → Buffer & Filter)↔️ Rollout (SGLang → Sampling + RM Scoring + Filtering)↔️ Training (Megatron → Actor/Critic + PPO/GRPO) 🔁 Simplified Core Training Loop 1.Allocate GPU resources via Placement Group 2.Launch SGLang rollout engine 3.Initialize Megatron Actor/Critic models 4.Sync initial weights to SGLang 5.Repeat 3-beat cycle: Generate (SGLang) → Train (Megatron) → Sync Weights 🎯Elegance = ultra-simple top-level logic, all complexity encapsulated inside modules 🎛️ 4 Core Design Flexibilities ⚙️ Resource Scheduling: Colocate (shared GPU) / Disaggregate (separate GPU pools) 🔄 Training Mode: Synchronous / Asynchronous training 🧪 Sampling Logic: Standard sampling / Over-sampling / Multi-turn tool calling 🤖 Model Type: Dense / MoE, full tensor/pipeline/context parallel support 🔧 Plug & Play Customization (All Extensible) Slime lets you customize every component via CLI params—no need to fork the repo 🛠️ Key Customization Points ✅ Custom Reward Model: Write an async func to define your own reward logic (easiest entry) ✅ Custom Generate Func: Control multi-turn dialogue, tool calling & external API integration ✅ Custom Rollout Func: Fully take over sampling concurrency & filtering logic ✅ Custom DataSource: Fetch prompts from API / local files / dynamic data streams ✅ Dynamic Filter: Discard low-value sample groups (e.g., zero-variance GRPO samples) ✅ Custom Loss Function: Rewrite PPO/GRPO loss calculation freely All custom code loads dynamically via --custom-xxx-path config 📝 🚀 Ray GPU Scheduling Magic Two deployment modes for all cluster scales: 🔹 Colocate Mode: Train & inference share GPUs → high utilization, ideal for small 8-card servers 🔹 Disaggregate Mode: Independent GPU pools → train-infer overlap, perfect for multi-node clusters Slime stabilizes Ray Placement Group GPU mapping via IP/GPU ID sorting to guarantee reproducibility 🔒 ⚡ SGLang Rollout Engine Internals 3-layer abstraction:RolloutManager → RolloutServer → ServerGroup → SGLangEngine Standout design highlights: 🔸 Over-sampling + Dynamic Filter: Pre-sample extra data, filter invalid groups on the fly 🔸 Async Concurrent Sampling: Process completed groups immediately with FIRST_COMPLETED 🔸 Abort Mechanism: Stop redundant sampling once target data size is met, save compute 🔸 Singleton GenerateState: One-time tokenizer & connection initialization 🧠 Megatron Training Backend Native support for mainstream RL algorithms: ✅ GRPO: No Critic needed, group-wise reward normalization (most popular) ✅ PPO: Classic Actor-Critic with GAE advantage estimation ✅ REINFORCE++: Token-level baseline optimization Seamless support for Dense & large MoE models with full parallelism 📊 🔄 Weight Sync: The Hard Engineering Solved Two high-performance sync paths: 🔹 Colocate: IPC + Gloo → intra-node low-latency weight transfer 🔹 Disaggregate: NCCL Broadcast → cross-node distributed sync MoE OOM prevention: Chunked Bucket Weight Update → sync parameters in small batches, release memory instantly 🧩 💡 Core Takeaways ✨ Slime’s elegance lies in integrating mature top-tier stacks with clean decoupled design ✨ Minimal top-level logic, maximal internal engineering depth ✨ Fully pluggable customization for all RL scenarios (Math / Code / Agent / MoE) ✨ Optimized for both small single-node & large multi-node clusters 🔗Full article:zhuanlan.zhihu.com/p/203535706963… #LLM #RLTraining #SGLang #AIInfrastructure #MoE #MachineLearning
English
1
6
58
12.2K
Z.ai
Z.ai@Zai_org·
Technical highlights: CogViT Vision Encoder - Built with dual-teacher distillation: SigLIP2 for semantics, DINOv3 for texture. A two-stage recipe, masked modeling, then contrastive pretraining, with QK-Norm for attention stability at scale. Multimodal Multi-Token Prediction (MMTP) - Three ways to pass image tokens into the MTP head were compared. The chosen approach uses a shared token, removing the need to propagate visual embeddings across pipeline stages and improving training stability. Broad Training Across Perception, Reasoning, and Agent Capability - Vision and language are fused from pre-training onward, with emphasis on multimodal code. Joint RL across 30+ task categories yields consistent gains with weaker cross-domain interference than SFT. Multimodal RL at Scale - Infrastructure rebuilt along four axes: unified task and reward abstraction, full-pipeline asynchrony, fine-grained memory management for vision modules, and topology-aware partitioning for variable-length visual inputs.
English
1
0
28
5.5K
Z.ai
Z.ai@Zai_org·
GLM-5V-Turbo Tech Report: Toward a Native Foundation Model for Multimodal Agents This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks. arxiv.org/abs/2604.26752
English
32
133
908
63.9K
Z.ai
Z.ai@Zai_org·
As models, contexts, and workloads grow, hidden assumptions in inference infrastructure can surface as output anomalies. Reliability requires more than throughput, latency, and availability. It also requires preserving the correctness of model state behind every generation.
English
2
1
38
5.9K
Z.ai
Z.ai@Zai_org·
After fixing correctness issues, we turned to the next bottleneck: Prefill throughput and GPU memory pressure in long-context Coding Agent serving. To address this, we introduced LayerSplit, a layer-wise KV Cache storage scheme. Instead of duplicating all layers on every GPU, each GPU stores only a subset of layers. With communication overlapped by computation, LayerSplit improved throughput by up to 132%.
Z.ai tweet media
English
3
1
37
7.3K
Z.ai
Z.ai@Zai_org·
Scaling laws push model capability forward. But whether that capability becomes reliable in production depends on how we handle Scaling Pain. z.ai/blog/scaling-p… In our latest blog, we share how we debugged GLM-5 serving at scale: reproducing rare garbled outputs, repetition, and rare-character generation; tracing and eliminating KV Cache race conditions; fixing HiCache synchronization issues; and introducing LayerSplit for up to 132% throughput improvement. We hope these lessons help the community avoid similar pitfalls and build more robust inference infrastructure.
English
40
81
878
80.4K
Z.ai
Z.ai@Zai_org·
@deepseek_ai Really impressive work! If you need a higher rate limit to keep those evals moving forward, we are definitely here to support you.
Z.ai tweet media
English
29
29
1.6K
128.3K
DeepSeek
DeepSeek@deepseek_ai·
🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n
DeepSeek tweet media
English
1.6K
7.7K
45.3K
9.7M
Z.ai retweetledi
Zixuan Li
Zixuan Li@ZixuanLi_·
Truly sorry for any confusion or frustration caused by unclear, misleading, or inappropriate rules in our moderation system and on our pages. OpenClaw, Hermes, and SillyTavern are now explicitly marked as supported under the GLM Coding Plan. Other general-purpose tools will be analyzed on a case-by-case basis. A gentle reminder: please do not share your account or use your subscription as an API. If you follow the rules and still encounter 1313 errors, please reach out to us at user_feedback@z.ai.
Zixuan Li tweet media
English
63
31
691
114.3K
Z.ai
Z.ai@Zai_org·
Fantastic to see GLM being applied to such fresh, dynamic scenarios.
Jifan Yu@yujifan_0326

Doing some stress tests on OpenMAIC’s Interactive Simulation with a DNA Replication case. 💻 Both powered by @Zai_org — with GLM-5.1 and GLM-5V-Turbo each generating these complex pedagogical simulations in real time. Can you spot the difference? The "Turbo" is catching up surprisingly well... Wait, this looks... different from what we had before? Stay tuned for April 20! 🚀 🔗 Live Demo: open.maic.chat ⭐️ GitHub: github.com/THU-MAIC/OpenM…

English
18
21
337
38K
Z.ai
Z.ai@Zai_org·
GLM-5.1 Tool Calling Issue Fix & Chat Template Update If you are running GLM-5.1 with vLLM/SGLang and using tool calling, please update your chat template. huggingface.co/zai-org/GLM-5.… Issue When using tool calling, frameworks including vLLM automatically convert plain-text tool message content into an array of content parts (`[{"type": "text", "text": "..."}]`) before passing it to the chat template. The original template only supported string-formatted tool content, causing array-formatted tool outputs to render empty. As a result, the model does not receive tool results and repeatedly triggers the same tool call in a loop. Affected Models All GLM-5.1 variants deployed with vLLM or SGLang. Fix Simply replace your existing `chat_template.jinja` with the updated version from the repository.
English
56
83
1.1K
104.3K