ishan

839 posts

ishan

@0xishand

inference (dynamo + sglang) @nvidia | prev. @brevdev (acq.) @agora_io (acq.), @columbia | my views ≠ employer views

San Francisco, CA Katılım Nisan 2022

1.6K Takip Edilen2.2K Takipçiler

Sabitlenmiş Tweet

ishan@0xishand·5 Nis

Happy llama4 day to those who celebrate

ishan@0xishand

Happy llama3.1 day to those who celebrate

English

12.3K

ishan retweetledi

samsja@samsja19·1d

Today we’re releasing prime-rl v0.5.0. This is a major release, with 200+ commits from 22 contributors since v0.4.0. on the menu: * PD-DisAgg inference to boost agentic RL training * support for GLM-5, Qwen3.5, and Nemotron * a complete revamp of environment execution for better performance * first-class multi nodes slurm support directly from the config * quack kernel, selective AC, and more we also added several new guides to the docs, including large MoE agentic training guides. and that’s alongside many more bug fixes and improvements

English

324

38.1K

ishan retweetledi

Casper Hansen@casper_hansen_·2d

today i learned sglang has a cookbook that maxes your performance and let's you easily configure the latest models based on hardware some recipes needs more options, but this is a solid start

English

176

10K

ishan@0xishand·2d

@casper_hansen_ 100%. But this is hard to do because everyone’s workload is so different. As an example, InferenceX has 5+ configs for ISL/OSL 1k/1k (random data) depending on what you’re trying to optimize for (interactivity, tpt, etc). The SGL cookbook + InfX is a good place to start.

English

193

Casper Hansen@casper_hansen_·2d

every inference engine should have a section in their docs with exact commands to achieve best possible tokens/s on the most popular models i'm told kimi k2.5 can run at 300 tokens/s on B200s if you run nvfp4 with speculative decoding in open-source

English

200

13.7K

ishan retweetledi

Cursor@cursor_ai·5d

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.

English

101

127

1.6K

474.4K

ishan retweetledi

H@hcompany_ai·24 Mar

🚀 The H Company Tech Stack: Part 1 We are excited to launch a new series of technical deep dives into the AI Tech Stack powering H Company. Over the coming weeks, we’ll be sharing how we build, scale, and optimize the infrastructure behind our Holo frontier models. First up: Unlocking Online RL and AI Workflows on K8s using SkyPilot. (1/5🧵)

English

6.9K

ishan retweetledi

MeekMill@MeekMill·24 Mar

Nah I've been in deep talks about putting ai classes in public schools and underprivileged environments... google, nvidia , Claude, x , meta holler at me let's change the world real quick! Give ai to the "hungry watch the incredible happen!

BuccoCapital Bloke@buccocapital

"Your job won't be taken by AI, it will be taken by Meek Mill using AI" - Jensen Huang

English

314

261

3.4K

765.7K

ishan retweetledi

MeekMill@MeekMill·24 Mar

I need a GitHub too! Is it like that or nah?

English

872

1.6K

13.5K

5.7M

ishan retweetledi

Cursor@cursor_ai·17 Mar

Read more: cursor.com/blog/self-summ…

English

117

33.7K

ishan retweetledi

samsja@samsja19·12 Mar

damn prime-rl can train GLM-5 with ep and pd disag

Matej Sirovatka@m_sirovatka

You can just do things in prime-rl - like teach GLM5 to answer math in <2000 tokens using 16 nodes to train and 12 nodes to do inference with 2P4D configuration with only uv run rl @ rl.toml ( @samsja19 told me I should tweet more things)

English

6.2K

ishan retweetledi

Justin T Chiu@justintchiu·12 Mar

one of my coworkers hard-coded a (restricted) WASM interpreter into the weights of a transformer, losslessly x.com/ChristosTzamos…

Christos Tzamos@ChristosTzamos

1/4 LLMs solve research grade math problems but struggle with basic calculations. We bridge this gap by turning them to computers. We built a computer INSIDE a transformer that can run programs for millions of steps in seconds solving even the hardest Sudokus with 100% accuracy

English

943

95.7K

ishan@0xishand·11 Mar

@BasilMakesRagu Have you done any analysis on how running Bonsai affects your conversation cache hits?

English

250

Basil@BasilMakesRagu·10 Mar

Yep! I used Claude to implement Context Bonsai in Claude Code. Context Bonsai s a pair of tools that lets the LLM edit its own context to replace message ranges with a summary (prune) and also allows retrieval of the removed messages if needed.

j⧉nus@repligate

Useful for modding/reverse engineering Claude Code: CC is not open source, but the installed npm package contains a single minified JS file that Claude whose logic is readable to Claudes, who are very clever and know how this kinda stuff works.

English

9.4K

ishan retweetledi

tender@tenderizzation·4 Mar

the cuda graph doesn’t even begin capturing anymore, it just looks at the user like this

Gökdeniz Gülmez@ActuallyIsaak

Today I’m sharing a new research paper that explores a new idea in mixture of experts architecture called “DynaMoE”. DynaMoE is a Mixture-of-Experts framework where: - the number of active experts per token is dynamic. - the number of all experts can be scheduled differently across layers. From my findings the best model has a descending expert scheduler, where beginning layers have the most experts and the end layer have the least (1 expert). This removes the rigid Top-K routing used in most MoE models and improves parameter efficiency and training stability. Paper: arxiv.org/abs/2603.01697

English

650

44.5K

ishan retweetledi

Matej Sirovatka@m_sirovatka·4 Mar

at this point pretraining researchers can start shooting perf engineers in the head

Gökdeniz Gülmez@ActuallyIsaak

English

626

82.3K

ishan@0xishand·27 Şub

@TheZachMueller @TheZachMueller FWIW I think github.com/sgl-project/sg… should also unlock higher concurrency for a single b200 node. You will need to rebuild container to get a newer version of DeepEP and set dtype to bf16

English

162

ishan@0xishand·20 Şub

@TheZachMueller DeepGEMM really shines at higher bs (middle + top of pareto). This is also why it pairs well with wide expert parallelism. However, SGLang still cooks at lower bs for MiniMax 2.5 especially on the hardware you mentioned 🙂

English

517

Zach Mueller@TheZachMueller·20 Şub

A small PSA if you're using vLLM, you might find SGLang is faster on H100 and B200's. A little rabbit hole + some help from the vLLM folks and we figured out it's because vLLM would choose DeepGemm on some models, which isn't the best (Triton is) Set VLLM_USE_DEEP_GEMM=0!

English

222

14.2K

ishan retweetledi

corsaren@corsaren·24 Şub

This has been on my mind for a while now so I'm going to rant on product design for a sec: what @AnthropicAI needs, most of all, is a Cowork onboarding process. We have an intelligent, natural-language assistant as the core product and we’re still using tooltips and popups to explain how the app works? Please. Be serious. When you first download Cowork you should be presented with a blank canvas and just the words “Hello”. Big letters. Whole screen. Greetings from the latent space. After a second, a bar appears for you to type a response (though you are strongly encouraged to use voice). Claude then explains that it will be your virtual coworker, secretary, and thought partner. Its job is to make your life easier, automating the stuff you like the least to give you more time to do the high impact work. Introduce Claude's personality. Claude then warns that in order to get this right, it will need to conduct 1-2 hours of deep, back-and-forth interviewing with you so it can best understand your job, your tools, your organization, and your goals. Don’t have time right now for that? Okay, let’s book time on your calendar. I can send you an invite! Better yet, connect me to your email/calendar system right now so I can create the event directly. I’ll walk you through how to do that. When you do find time, Claude interviews you in great detail, following a pre-built guide (perhaps with branching flows depending on the job function). What is your job? What are your responsibilities? How do you work? What projects are you currently working on?What are your biggest pain points? Where can Claude provide the most leverage? Claude then recommends integrations, plugins, and skills, and walks you through the set up of each one. Next, Claude also makes a Growth Plan: a list of skills that it should build/customize with you in the future to distill your particular ways of working and habits. I honestly wouldn't even start with the default skills. They're generic and long, and frankly, I don't want Claude following directions that I haven't read or discussed. Maybe use them to demonstrate an example if the user asks, but otherwise it's unnecessary overhead. The only exception is for guardrail instructions (i.e., "don't use placeholder values for sensitivity tables"). Those can be imported. Then, when you ask Claude to do a task that is related to a skill on the Growth Plan, it flags this, and asks for coaching on how you want it to perform this task. This becomes it's own interview flow where Claude asks some basic questions, asks for and helps find examples, and workshops the approach. The final output is a customized, personalized skill. These interview flows are then the sorts of thing the Cowork team should ship in the backend. So instead of a generic financial analysis skill, it's a "financial analysis skill-building interview guide" detailing what Claude needs to ask the user about in order to build a robust, personalized financial modeling skill (e.g., "ask what sort of cell and number formats they use"). Finally, only after you've agreed on the Growth Plan (which is just a cross-session list of action items, not the final custom skills themselves), Claude then suggests one activity to work on to get started. And boom, you're off to the races. Claude should also schedule weekly feedback reviews on your calendar where the two of you assess what it did for you this week, where it performed poorly, where it can actually do more than what you're asking it to do right now, and how to improve generally. The team already announced some of this today (plugin/skill customization, etc.), but imo it really needs to be a cohesive, E2E, multi-session, iterative flow. The user shouldn't have to navigate lists of plugins and skills to get the most out of the robot; the robot should help the user customize itself.

Ethan Mollick@emollick

I guarantee that any industry expert, with a little time and effort, can make a better (or at least more focused) skill than the default Anthropic ones. This is not an insult to Anthropic, it just is a reminder that specialist experts know more about their jobs than AI labs do.

English

533

72.8K

ishan retweetledi

Standard Intelligence@si_pbc·23 Şub

Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.

GIF

English

187

400

3.9K

1.1M

ishan retweetledi

OpenAI Developers@OpenAIDevs·23 Şub

Introducing WebSockets in the Responses API. Built for low-latency, long-running agents with heavy tool calls. developers.openai.com/api/docs/guide…

English

227

361

5.1K

1.3M

ishan@0xishand·22 Şub

@jarredsumner The GH cli is has first class support for extensions. You/Claude could write this and call it gh-agent and the cli should be able to pick it up as a command docs.github.com/en/github-cli/…

English

168

Jarred Sumner@jarredsumner·21 Şub

feature request for GitHub’s gh cli: a subcommand for agents monitoring PRs - view unresolved pr review comments as markdown or xml with file:line - show failing gh action logs, maybe filtered to near the error - show lint errors from gh actions

English

645

43.7K

ishan retweetledi

JT@jiratickets·21 Şub

This is what's happening inside your repo when Claude is 10 mins deep in a "thinking..." sesh

English

156

11.4K

ishan@0xishand·21 Şub

It was an absolutely pleasure working with @baizhou_zh83925 and the rest of the incredibly cracked @lmsysorg SGLang team on this. And it’s awesome to see Dynamo being used as the orchestrator for these sorts of large scale P/D workloads. Stay tuned for what’s next 😉

LMSYS Org@lmsysorg

🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference! Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving: ⚡ 226 TPS/GPU peak throughput (1.53X vs GB200) 🧠 1.87X TPS/User gain with MTP under matched throughput 💾 1.6X higher decode batch size via GB300's 288GB HBM3e ⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP 🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

English

2.8K

Keşfet

@casper_hansen_ @BasilMakesRagu @TheZachMueller @AnthropicAI @elonmusk @BarackObama @taylorswift13 @cristiano