apaz

1.1K posts

apaz

@apaz_cli

https://t.co/EYtS07MR7w Making GPUs go brrr

Hiding in your wifi Bergabung Temmuz 2019

564 Mengikuti846 Pengikut

Tweet Disematkan

apaz@apaz_cli·6d

Releasing mlsweep, a sweep scheduler and visualizer for distributed ML training. It aims to make launching runs across groups GPUs frictionless and achieve near feature-parity with wandb. But you can use it with whatever frameworks or loggers you like, wandb included.

English

2.1K

apaz@apaz_cli·6d

English

2.1K

apaz@apaz_cli·2h

@tom_doerr It depends on what regime you're in. Generally you're bottlenecked either by loading the model (short context) or by loading the kvcache (long context). Rarely there can be cases where it can be flops. But usually it's not flops. Almost always it's not flops.

English

Tom Dörr@tom_doerr·2h

@apaz_cli Nvfp4 model inference works much better on a 5090. Even if it would work properly on the Sparks, the 5090 has the nvfp4 compute of 3 to 4 sparks

English

apaz@apaz_cli·3h

Since every strong open model is a big chungus MoE now (Qwen 3.5/Nemotron 120Bs), it's getting harder to come up with a use for small GPUs. You kinda need unified memory, like DGX Spark or Strix Halo. And still less efficient to run at bigger batch sizes due to being MoEs.

English

223

apaz@apaz_cli·2h

@tom_doerr Hmm. A DGX Spark should already be pretty good for that though, no? Unless you need all three of them for hosting things?

English

Tom Dörr@tom_doerr·2h

@apaz_cli I have three Sparks and was thinking about buying a GPU for high batched throughput with Qwen 3.5 27B

English

135

apaz@apaz_cli·2h

I think that for the most part from now on smaller GPUs are for things like data labeling/filtering. And some pretraining/finetuning experiments. Inference is for machines with unified memory. Either so they can fit the model, or so they can offload kvcache (Grace systems, etc).

English

apaz@apaz_cli·3h

This desperately needs to be run through slop guard.

Nous Research@NousResearch

Hermes Agent wrote a novel. "The Second Son of the House of Bells" runs 79,456 words across 19 chapters. The agent built its own pipeline to do it, using the ame modify-evaluate-keep/discard loop as @karpathy's Autoresearch but applied to fiction: world-building, chapter drafting, adversarial editing, Opus review loops, LaTeX typesetting, cover art, audiobook generation, and landing page setup. Book: nousresearch.com/bells Code: github.com/NousResearch/a…

English

apaz@apaz_cli·1d

@PavelSnajdr I'm honestly not sure I understand the context of what you mean by this

English

Pavel Snajdr@PavelSnajdr·1d

@apaz_cli this is gonna be so much fun when it breaks into mainstream :D when there's enough compute to support the inference :D

English

apaz@apaz_cli·1d

Labs can make sure their GPUs are always running because they have a backlog of experiments and related benchmarks to run. This can be done automatically. Whereas I think for individuals that automated research is probably best for gpu-warming. In any case, this will be popular.

English

107

apaz@apaz_cli·1d

I find myself brainstorming in claude code a lot. But over time this brainstorming has become more and more elaborate. It looks less like brainstorming, and more like research. Now I brainstorm in the form of elaborate multi-project repos. Some of which I release, like mlsweep.

English

133

apaz@apaz_cli·1d

NGL I have been screaming this from the rooftops for months.

DatologyAI@datologyai

New Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better model: ◾ 1.75x fewer tokens to reach the same domain loss ◾ 1B SPT model outperforms a 3B finetuned-only model ◾ +6pts MATH accuracy at 200B pretraining tokens ◾ Less forgetting of general knowledge Tested across chemistry, symbolic music, and formal math proofs. SPT wins on every metric. Led by @_christinabaek and @pratyushmaini, with the full Datology team.

English

1.6K

apaz@apaz_cli·1d

Realizing now that there are deep similarities between the architectures of openclaw and research generation agents. They handle permissions and keeping GPUs warm in pretty much the same manner. There is more work to be done here.

English

123

apaz@apaz_cli·2d

@nyxkrage The madlad actually did it. Well played.

English

Carsten Kragelund@nyxkrage·2d

Finally getting around to officially publishing ChastityBench! Benchmarking vision models on recognizing chastity cages without being directly prompted for it. Stop testing general vision capable models on more and more graphslop and OCR tasks.

English

530

apaz@apaz_cli·2d

@_ueaj @tenderizzation Along similar lines I am happy that Polar Express became the most popular method for computing the orthogonalization for Muon so that nobody actually had to implement my godforsaken fused zeropower_via_newtonschulz5() kernel. x.com/apaz_cli/statu…

apaz@apaz_cli

I just realized a better way to compute zeropower_via_newtonschulz5() for Muon. Here's a blueprint for how to write a kernel. It scales to large matrices way better than people think it does. But unfortunately writing this is significantly beyond my skill level. Muon enjoyers: @kellerjordan0 @leloykun @kalomaze @Kimi_Moonshot @Yuchenj_UW @YouJiacheng

English

ueaj@_ueaj·2d

@tenderizzation My favorite part of ML research is only having to know what is theoretically possible to implement efficiently in a kernel but never having to actually do that part

English

tender@tenderizzation·3d

please join our performance engineer prayer circle for attention residuals: 🕯 🕯 🕯️ 🕯 researcher 🕯️ 🕯no make softmax 🕯 🕯 into top-k 🕯 🕯 🕯 🕯

English

581

46.1K

apaz@apaz_cli·4d

Autoresearch turns out to not be the first project to do something like this. There were many similar projects before autoresearch, which I am now looking into because I am doing something similar. I like EvoX, but metaprompting seems like a lot of work.

English

169

apaz@apaz_cli·4d

@CLImeter Fuck off

English

CLImeter@CLImeter·5d

@apaz_cli mlsweep looks solid. When you're ready to let users pay for GPU sweeps per run, climeter.ai handles usage-based billing for CLI tools in 2 lines — meter per command or per job. Worth checking out when you get to monetization.

English

apaz@apaz_cli·4d

@_ueaj Gosplan is alive and well I see.

English

140

ueaj@_ueaj·4d

"Erm what about the users" man who gives a shit about them I want +10% GDP growth NOW. I want Claude Opus 6 to automate capital allocation. I want the 3rd world to be liberated from uncertainty and chaos. GPT 7 will end the concept of recessions. It will happen. Compute is too valuable to waste on users

ueaj@_ueaj

AI labs will do literally anything except dynamic pricing inshallah claude opus 5 will be economically valuable enough for this

English

2.1K

apaz@apaz_cli·6d

Next on the list of features will be a rollout viewer and node grouping to handle allocating whole NVL72 clusters. Check it out here. Comments, feature requests, and PRs welcome. github.com/apaz-cli/mlswe…

English

124

apaz@apaz_cli·6d

I find this format, this scheduler, and this experiment manager, to be vastly superior to even the paid offerings of other services. It can also do a lot of cool tricks that others can't to catch OOMs and get the fastest model running that will fit with certain settings.

English

137

Jelajahi

@tom_doerr @PavelSnajdr @nyxkrage @_ueaj @tenderizzation @elonmusk @BarackObama @taylorswift13