Marq

50K posts

Marq banner
Marq

Marq

@dev_null321

Somewhere between code and dialectics. Applied AI researcher, wannabe reverse engineer. Previously @Microsoft @Oracle

Austin, TX Katılım Aralık 2009
1.3K Takip Edilen1.3K Takipçiler
Marq retweetledi
Alex Ellis
Alex Ellis@alexellisuk·
NVLink installed and showing 14 GB/s bandwidth Next job: getting Qwen3.5 27B running across both 3090s without OOM (Codex is plugging away at it)
Alex Ellis tweet mediaAlex Ellis tweet media
English
29
11
303
28.4K
Marq
Marq@dev_null321·
I feel sad for people with macbook airs, please live a better life and get a pro .
English
0
0
0
27
Marq retweetledi
The Great Martis
The Great Martis@great_martis·
Sweet Jesus, Mother Mary of Bethlehem God help us all.
The Great Martis tweet media
English
152
287
2.6K
704.8K
Marq retweetledi
0xSero
0xSero@0xSero·
It's been running for 6 hours on my first prompt. I have 20+ steering prompts, 25% of them highly detailed. My GPUs have been at 100% utilisation this entire time, my room is 28c boiling hot. I can't describe how crazy it is that this is possible, I am a solo knucklehead imagine what a lab with 1000s of GPUs, and lab equipment can do. Imagine what a nationstate can do. Holy smokes, we are not ready for what is on the horizon.
0xSero tweet media
English
29
10
313
24.1K
Marq retweetledi
Avi Chawla
Avi Chawla@_avichawla·
The core engineering behind @UnslothAI has always been impressive! Instead of relying on PyTorch's default autograd for backpropagation, Unsloth built their own backprop kernels from scratch in OpenAI's Triton language (a Python-based language for writing GPU kernels without needing to write raw CUDA C++). One of the reasons to do this is that the default autograd runs each operation as a separate GPU call, and each call reads and writes data back to global memory before the next one can start. Across dozens of transformer layers, this back-and-forth becomes the real bottleneck. These hand-written kernels fuse operations like QKV projections and rotary position embeddings into single GPU calls, and recompute activations on the fly instead of storing them in memory. This allows Unsloth to deliver >2x faster training with 70% less VRAM without any accuracy loss. The loss curves match standard training runs down to the third decimal because the math is exact, not an approximation. All of these kernel optimizations were already available through Unsloth's Python library. But now Unsloth Studio puts a no-code web UI on top of that same engine, and there's a lot of solid engineering packed into this. > The inference engine has a sandboxed code execution layer where models can run Python and bash, compute results, and verify their answers before responding. This means the model can actually execute and validate code instead of just predicting what the output should look like. The tool calling implementation also has a self-healing mechanism. Failed calls get auto-corrected and retried, which is a practical pattern for agentic workflows. > Unsloth's Python library already had GRPO support (the RL technique behind DeepSeek-R1), and Studio now makes this accessible through the UI. PPO requires running a separate critic model alongside the policy model during training, and that critic is typically as large as the model being trained, effectively doubling the VRAM requirement. GRPO eliminates the critic model entirely by generating multiple completions per prompt and computing advantages from the relative quality within that group. This cuts VRAM by 40-60% compared to PPO. Combined with Unsloth's Triton kernels and QLoRA, training a reasoning model on an RTX 4090 or even a 3090 becomes realistic on hardware that most of us actually have. > In most fine-tuning workflows that I have run, the training step is actually the easy part. Getting raw data into a properly formatted dataset is where the real time goes. Unsloth Studio includes Data Recipes (built on NVIDIA's DataDesigner) that take raw PDFs/CSVs/DOCX files, and transform them into structured synthetic datasets through a visual node-based workflow, replacing the custom parsing scripts entirely. Once training is done, models can be exported directly to GGUF, safetensors, or other formats with automatic LoRA adapter merging into base weights. The whole system runs 100% offline with no telemetry. $ pip install unsloth $ unsloth studio setup $ unsloth studio More details in the post below by Unsloth👇 It's still in beta, but the engineering underneath is solid. For anyone working with open-source models locally, this is one of the more complete tools available right now. ____ Find me → @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
Avi Chawla tweet media
Unsloth AI@UnslothAI

Introducing Unsloth Studio ✨ A new open-source web UI to train and run LLMs. • Run models locally on Mac, Windows, Linux • Train 500+ models 2x faster with 70% less VRAM • Supports GGUF, vision, audio, embedding models • Auto-create datasets from PDF, CSV, DOCX • Self-healing tool calling and code execution • Compare models side by side + export to GGUF GitHub: github.com/unslothai/unsl… Blog and Guide: unsloth.ai/docs/new/studio Available now on Hugging Face, NVIDIA, Docker and Colab.

English
10
54
466
34.9K
Marq
Marq@dev_null321·
Funniest thing to do is take Claude codes code and give to regular Claude, it’s always like no this is awful 😂
English
0
0
0
41
Marq
Marq@dev_null321·
My startup is working to increase inference speeds.
Andrew Feldman@andrewdfeldman

GPUs are slow at AI inference because they hit the memory wall. Cerebras pioneered the SRAM based AI accelerator because GPUs were memory bandwidth constrained.   Let me explain.   There are two types of memory. Memory that can store a lot, but is slow. And memory that is fast, but can’t store much per square milimeter of silicon.   The former is called DRAM (or HBM) and the latter is SRAM. Graphics Processing Units use HBM. In fact, graphics was the perfect use case for HBM. It required a lot of data stored. But didn’t need it moved very often. This is why graphics processing units use HBM.   But AI inference has different characteristics than graphics. It moves data constantly from memory to compute. To generate each token, it needs to move all of the weights from memory to compute. And for the next token, it needs to do it again. For every single token in the answer. Because HBM is slow, moving data is time consuming. The GPU is waiting for data to get to it. It sits idle. Pulling power. Doing no work.   Cerebras chose to use SRAM so we could move data from memory to compute faster. Not a little bit faster but more than 2,600 times faster than NVIDIA Blackwell GPUs. As a result, we can generate tokens faster 15 times faster. This is why we are the fastest in the world.   But what about the weakness of SRAM? QSurely there is a tradeoff. SRAM can’t store very much data per square millimeter. This is why Cerebras went to wafer scale. By building a chip the size of a dinner plate, a chip that is 58 times larger than the largest GPU, Cerebras could stuff it to the gills with SRAM. We couldn’t make SRAM store more data per square millimeter, but we could provide more square millimeters by building a bigger chip.   If you build a solution with little chips and try to use SRAM you need to link thousands of them together to support a larger model. There simply isn’t enough room on the little chips for lots of SRAM and lots of compute cores. Thousdands of little chips connected together with cables, is slower and more power hungry than if all that traffic stayed on a big chip, or even several big chips.   And since communication between chips is slow, and communication on chip is fast, lots of little chips is slower at inference as well.

English
0
0
0
28
Marq retweetledi
𝚟𝚒𝚎 ⟢
𝚟𝚒𝚎 ⟢@viemccoy·
Please consider reading my newest article, Semiotic Triage: Overcoming the Type Error Tragedy. In it, I discuss how we might suspend our disbelief long enough to use faith as a generating function for enchantment, going beyond our present epistemic limitations.
𝚟𝚒𝚎 ⟢ tweet media
English
6
5
58
1.7K
Marq retweetledi
Sudo su
Sudo su@sudoingX·
you don't understand anon. i'm on a mission to find the collection of best small models that run full context on consumer hardware. because when you can orchestrate your own thinking across physical nodes locally, that's not a tool anymore. that's an extension of your mind. that's exactly where we are headed as a civilization. and most people haven't felt it yet.
English
35
24
401
8.8K
Marq
Marq@dev_null321·
I'm working on the coolest project I've ever coded, and probably the hardest, of course this is turning into a startup.
English
0
0
3
31
Marq retweetledi
hallerite
hallerite@hallerite·
I think my head of research is starting to develop schizophrenia
hallerite tweet mediahallerite tweet mediahallerite tweet mediahallerite tweet media
English
16
5
211
18.7K
Marq retweetledi
CLaE
CLaE@leafs_s·
A mathematical theory for understanding when abstract representations emerge in neural networks arxiv.org/abs/2510.09816
English
3
80
344
19.3K
Marq retweetledi
Joel - coffee/acc
Joel - coffee/acc@JoelDeTeves·
Hermes Agent really does start out a bit more of a blank slate than OpenClaw and it feels like it’s not as smart out of the box. But as you use it, it starts getting smarter and smarter, and it doesn’t take long, just some patience. By contrast OpenClaw seems to get dumber.
English
15
7
151
10.1K