ishan

839 posts

ishan banner
ishan

ishan

@0xishand

inference (dynamo + sglang) @nvidia | prev. @brevdev (acq.) @agora_io (acq.), @columbia | my views ≠ employer views

San Francisco, CA Katılım Nisan 2022
1.6K Takip Edilen2.2K Takipçiler
ishan retweetledi
samsja
samsja@samsja19·
Today we’re releasing prime-rl v0.5.0. This is a major release, with 200+ commits from 22 contributors since v0.4.0. on the menu: * PD-DisAgg inference to boost agentic RL training * support for GLM-5, Qwen3.5, and Nemotron * a complete revamp of environment execution for better performance * first-class multi nodes slurm support directly from the config * quack kernel, selective AC, and more we also added several new guides to the docs, including large MoE agentic training guides. and that’s alongside many more bug fixes and improvements
samsja tweet mediasamsja tweet media
English
17
35
324
38.1K
ishan retweetledi
Casper Hansen
Casper Hansen@casper_hansen_·
today i learned sglang has a cookbook that maxes your performance and let's you easily configure the latest models based on hardware some recipes needs more options, but this is a solid start
Casper Hansen tweet media
English
6
15
176
10K
ishan
ishan@0xishand·
@casper_hansen_ 100%. But this is hard to do because everyone’s workload is so different. As an example, InferenceX has 5+ configs for ISL/OSL 1k/1k (random data) depending on what you’re trying to optimize for (interactivity, tpt, etc). The SGL cookbook + InfX is a good place to start.
English
0
0
4
193
Casper Hansen
Casper Hansen@casper_hansen_·
every inference engine should have a section in their docs with exact commands to achieve best possible tokens/s on the most popular models i'm told kimi k2.5 can run at 300 tokens/s on B200s if you run nvfp4 with speculative decoding in open-source
English
19
7
200
13.7K
ishan retweetledi
Cursor
Cursor@cursor_ai·
Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.
Cursor tweet media
English
101
127
1.6K
474.4K
ishan retweetledi
H
H@hcompany_ai·
🚀 The H Company Tech Stack: Part 1 We are excited to launch a new series of technical deep dives into the AI Tech Stack powering H Company. Over the coming weeks, we’ll be sharing how we build, scale, and optimize the infrastructure behind our Holo frontier models. First up: Unlocking Online RL and AI Workflows on K8s using SkyPilot. (1/5🧵)
H tweet media
English
1
5
42
6.9K
ishan retweetledi
MeekMill
MeekMill@MeekMill·
I need a GitHub too! Is it like that or nah?
English
872
1.6K
13.5K
5.7M
ishan retweetledi
ishan
ishan@0xishand·
@BasilMakesRagu Have you done any analysis on how running Bonsai affects your conversation cache hits?
English
2
0
3
250
Basil
Basil@BasilMakesRagu·
Yep! I used Claude to implement Context Bonsai in Claude Code. Context Bonsai s a pair of tools that lets the LLM edit its own context to replace message ranges with a summary (prune) and also allows retrieval of the removed messages if needed.
j⧉nus@repligate

Useful for modding/reverse engineering Claude Code: CC is not open source, but the installed npm package contains a single minified JS file that Claude whose logic is readable to Claudes, who are very clever and know how this kinda stuff works.

English
8
2
81
9.4K
ishan retweetledi
ishan retweetledi
ishan
ishan@0xishand·
@TheZachMueller DeepGEMM really shines at higher bs (middle + top of pareto). This is also why it pairs well with wide expert parallelism. However, SGLang still cooks at lower bs for MiniMax 2.5 especially on the hardware you mentioned 🙂
English
4
0
6
517
Zach Mueller
Zach Mueller@TheZachMueller·
A small PSA if you're using vLLM, you might find SGLang is faster on H100 and B200's. A little rabbit hole + some help from the vLLM folks and we figured out it's because vLLM would choose DeepGemm on some models, which isn't the best (Triton is) Set VLLM_USE_DEEP_GEMM=0!
Zach Mueller tweet media
English
18
14
222
14.2K
ishan retweetledi
corsaren
corsaren@corsaren·
This has been on my mind for a while now so I'm going to rant on product design for a sec: what @AnthropicAI needs, most of all, is a Cowork onboarding process. We have an intelligent, natural-language assistant as the core product and we’re still using tooltips and popups to explain how the app works? Please. Be serious. When you first download Cowork you should be presented with a blank canvas and just the words “Hello”. Big letters. Whole screen. Greetings from the latent space. After a second, a bar appears for you to type a response (though you are strongly encouraged to use voice). Claude then explains that it will be your virtual coworker, secretary, and thought partner. Its job is to make your life easier, automating the stuff you like the least to give you more time to do the high impact work. Introduce Claude's personality. Claude then warns that in order to get this right, it will need to conduct 1-2 hours of deep, back-and-forth interviewing with you so it can best understand your job, your tools, your organization, and your goals. Don’t have time right now for that? Okay, let’s book time on your calendar. I can send you an invite! Better yet, connect me to your email/calendar system right now so I can create the event directly. I’ll walk you through how to do that. When you do find time, Claude interviews you in great detail, following a pre-built guide (perhaps with branching flows depending on the job function). What is your job? What are your responsibilities? How do you work? What projects are you currently working on?What are your biggest pain points? Where can Claude provide the most leverage? Claude then recommends integrations, plugins, and skills, and walks you through the set up of each one. Next, Claude also makes a Growth Plan: a list of skills that it should build/customize with you in the future to distill your particular ways of working and habits. I honestly wouldn't even start with the default skills. They're generic and long, and frankly, I don't want Claude following directions that I haven't read or discussed. Maybe use them to demonstrate an example if the user asks, but otherwise it's unnecessary overhead. The only exception is for guardrail instructions (i.e., "don't use placeholder values for sensitivity tables"). Those can be imported. Then, when you ask Claude to do a task that is related to a skill on the Growth Plan, it flags this, and asks for coaching on how you want it to perform this task. This becomes it's own interview flow where Claude asks some basic questions, asks for and helps find examples, and workshops the approach. The final output is a customized, personalized skill. These interview flows are then the sorts of thing the Cowork team should ship in the backend. So instead of a generic financial analysis skill, it's a "financial analysis skill-building interview guide" detailing what Claude needs to ask the user about in order to build a robust, personalized financial modeling skill (e.g., "ask what sort of cell and number formats they use"). Finally, only after you've agreed on the Growth Plan (which is just a cross-session list of action items, not the final custom skills themselves), Claude then suggests one activity to work on to get started. And boom, you're off to the races. Claude should also schedule weekly feedback reviews on your calendar where the two of you assess what it did for you this week, where it performed poorly, where it can actually do more than what you're asking it to do right now, and how to improve generally. The team already announced some of this today (plugin/skill customization, etc.), but imo it really needs to be a cohesive, E2E, multi-session, iterative flow. The user shouldn't have to navigate lists of plugins and skills to get the most out of the robot; the robot should help the user customize itself.
Ethan Mollick@emollick

I guarantee that any industry expert, with a little time and effort, can make a better (or at least more focused) skill than the default Anthropic ones. This is not an insult to Anthropic, it just is a reminder that specialist experts know more about their jobs than AI labs do.

English
34
10
533
72.8K
ishan retweetledi
Standard Intelligence
Standard Intelligence@si_pbc·
Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.
GIF
English
187
400
3.9K
1.1M
Jarred Sumner
Jarred Sumner@jarredsumner·
feature request for GitHub’s gh cli: a subcommand for agents monitoring PRs - view unresolved pr review comments as markdown or xml with file:line - show failing gh action logs, maybe filtered to near the error - show lint errors from gh actions
English
40
14
645
43.7K
ishan retweetledi
JT
JT@jiratickets·
This is what's happening inside your repo when Claude is 10 mins deep in a "thinking..." sesh
English
5
5
156
11.4K
ishan
ishan@0xishand·
It was an absolutely pleasure working with @baizhou_zh83925 and the rest of the incredibly cracked @lmsysorg SGLang team on this. And it’s awesome to see Dynamo being used as the orchestrator for these sorts of large scale P/D workloads. Stay tuned for what’s next 😉
LMSYS Org@lmsysorg

🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference! Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving: ⚡ 226 TPS/GPU peak throughput (1.53X vs GB200) 🧠 1.87X TPS/User gain with MTP under matched throughput 💾 1.6X higher decode batch size via GB300's 288GB HBM3e ⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP 🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

English
1
2
20
2.8K