Michał Piszczek

3.1K posts

Michał Piszczek banner
Michał Piszczek

Michał Piszczek

@cdiamond

CTO @ Archdesk | Systems where physics meets economics. Ex-Hacker. Ex-Fintech CEO. Nullius in verba. 🖖 AI does not fail. Human judgment does.

POLAND, Kraków (Cracow) انضم Eylül 2008
705 يتبع474 المتابعون
Michał Piszczek
Michał Piszczek@cdiamond·
Everyone is still selling Prompt Engineering. I think they're teaching a skill that's already depreciating.📉 My position is simple: In 2026, model quality is no longer where performance comes from. Recovery is. And recovery is not a prompt — it's a loop. Here's the shift nobody priced in.👇 ⚙️For three years we optimized the INPUT: → prompts → personas → system instructions → temperature → context windows → benchmark scores That mattered when the model was the weak link. It isn't anymore. 💸The marginal cost of raw intelligence is collapsing. And anything whose cost collapses stops being your moat. Models are becoming a commodity input. The value moved UP the stack — to whoever controls what happens betweenthe model calls. So the real question changed. ❌How do I make the model smarter? ✅How does my system behave the moment it's wrong? That's not a prompting problem. It's a control problem. 🧮 And here's the physics of it: One agent step is ~90% reliable. Chain ten with no feedback → 0.9¹⁰ ≈ 35%. Errors compound. Open-loop systems decay. The only thing that fights compounding error is feedback: measure → correct → repeat. A thermostat. A rocket. A production agent. All closed-loop. For the exact same reason. 🔁 This is why real agents don't run on one clever prompt. They run on layered loops — and each one buys a specific property: 🧭 ReAct (Think → Act → Observe) grounds the model in reality, not its own assumptions 🪞 Reflection (Generate → Critique → Improve) turns a bad draft into a good one with no human 🗺️ Planning (Goal → Decompose → Execute → Replan) survives a plan that was wrong on contact ✅ Verification (Generate → Test → Fail → Fix) converts "looks right" into "is right" 🚨 Escalation (Execute → Confidence drops → Human → Resume) knows the boundary of its own competence 🧠 Memory stops the system repeating the same mistake twice Notice: none of these make the model smarter. They make the SYSTEM survivable. ⚠️ And here's the mistake I see everywhere: People spend weeks tuning the prompt. And zero time designing what happens after the first failure. But real agents don't fail once. They fail constantly. The gap between a demo and a production system was never intelligence. It's recovery. The cleanest way I can put it: 🖥️ The model is becoming the CPU. 🧩 The loop is becoming the operating system. Nobody brags about their CPU anymore. They build on the OS. Prompts are turning into implementation details. Loops are turning into the product. 🛠️ So if you're building, here's where I'd put your next 10 hours: Not in your system prompt. In your failure path. Write down what your agent does on its SECOND attempt. If the answer is "the same thing" — you don't have a system. You have a demo. — 📌Next week I'm breaking down the loop pattern I lean on most in production: drill-down + recurrent decompose — how to take a goal that's too big to execute and recursively split it until every leaf is a step the agent can't fail at.
Michał Piszczek tweet media
English
0
0
1
19
Michał Piszczek
Michał Piszczek@cdiamond·
@Dorialexander the 128K midtrain is the quiet tell. you don't eat that context cost unless the RL target needs long horizon credit assignment
English
0
0
0
10
Alexander Doria
Alexander Doria@Dorialexander·
Has anyone done any speculation on the training recipe of GLM 5.2? Beyond extensive RL, we know it's (at least?) a new midtrain ("GLM-5.2 is trained with IndexShare from mid-training with 128K sequence length") with arch changes.
English
9
5
113
57.9K
Michał Piszczek
Michał Piszczek@cdiamond·
@jxmnop the bet is that hand annotated judgment stays cheaper than the model it trains. that spread narrows every quarter
English
0
0
0
94
Jack Morris
Jack Morris@jxmnop·
wish there was more public info on what's happening behind the scenes. frontier labs are spending BILLIONS paying {poets, musicians, accountants, consultants, ...} to annotate massive amounts of data: • essays • slides • spreadsheets it's a brute-force bet. but seems to be kind of working?
Daniel@growing_daniel

Why is AI writing still so bad

English
25
15
451
75.7K
Michał Piszczek
Michał Piszczek@cdiamond·
@omarsar0 reusable eval assets are great until the agent behavior drifts past the rubric. curated judgment has a half-life too
English
0
0
0
4
elvis
elvis@omarsar0·
>> Scalable Evaluation for AI Agents << If you run agent evaluation in production, this one is worth your time. It shows that front-loading human judgment into reusable evaluation assets is useful. But why? Agents reason across turns, call tools, hold context, follow policies, and act under uncertainty, so they have to be judged as behavioral systems. Current methods each give a fragment. Benchmarks measure fixed capabilities, human review preserves judgment but does not scale, LLM-as-judge inherits the evaluator design problem, red teaming is episodic, and trace audits need explicit evidence rules. Human-on-the-Bridge puts human expertise upstream, where experts curate reusable evaluation intelligence before testing rather than reviewing each output in the loop. Paper: arxiv.org/abs/2606.16871 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
26
18
91
13.3K
Naymur Rahman
Naymur Rahman@naymur_dev·
Testing which open source model is best at design Asked each one to build a landing page with macOS taste using @CommandCodeAI /design →Kimi k2.7 code [left] →GLM 5.2 [middle] →DeepSeek-v4-flash [right] GLM 5.2 is the clear winner. The other two are decent but not close
English
17
5
211
22.1K
Michał Piszczek
Michał Piszczek@cdiamond·
@theinformation first you tell staff to prove AI impact, then you meter the tokens. the inference bill found a middle manager
English
0
0
0
16
The Information
The Information@theinformation·
After encouraging staff to prove their “AI-driven impact,” Meta is now moving to cap employee token usage and steer workers toward in-house tools. Full story: thein.fo/3SifCCS
English
2
1
10
5.1K
Michał Piszczek
Michał Piszczek@cdiamond·
@svembu the bubble shows up in the depreciation schedule. GPUs amortize like fruit, not real estate
English
0
0
0
41
Sridhar Vembu
Sridhar Vembu@svembu·
IBM CEO Arvind Krishna says the muli-trillion dollar AI data center build out is a bubble. We are investing in creating capabilities like data curation, reinforcement learning, and most crucially the compiler infrastructure to ensure AI output can be verified but we will not chase the investment bubble. This is just our normal prudence. To some people that would sound defeatist, but we will talk in 5 years.
DavidLinthicum@DavidLinthicum

The Emperor Has No Clothes: Why the AI Infrastructure Buildout Math Doesn't Work I have to give IBM CEO Arvind Krishna credit. He's saying what many of us in this industry have been thinking but haven't been willing to say out loud. The math just doesn't add up. Here's what I'm seeing that's deeply troubling. We're in the middle of another mass hallucination. Just like the dot-com bubble, just like blockchain, just like the metaverse — everyone is convinced that building massive data centers will automatically create massive wealth. But here's the thing about building infrastructure. You actually have to sell what's inside it. Let's talk numbers. The planned data center buildout over the next 5-10 years is staggering. We're talking about commitments in the hundreds of gigawatts globally. The capital expenditure commitments are in the trillions. Yet when you look at the actual demand signals, not the projections, not the potential, but the actual consumption patterns, there's a massive gap. These AI companies are betting everything on demand that simply doesn't exist at the scale they're planning for. Let me be direct. AI services are expensive. Enterprise adoption is slow. Consumer AI is still finding its footing. And the compute requirements being promised by the hyperscalers require a level of demand that would represent a fundamental shift in how businesses consume technology. That's a big ask. I've seen this pattern before. The overbuilding. The belief that if you build it, they will come. The groupthink that turns critical analysis into heresy. The result is always the same. Companies are going to touch the stove. We're going to see massive write-downs. We're going to see pivots, shutdowns, and strategic reviews. We're going to see companies that spent years and billions trying to be the AI infrastructure leader become case studies in how not to read a market. The IBM CEO is right. The math doesn't work. And unlike 1999, we don't have the excuse of we didn't know. We know exactly what's happening. We just don't want to believe it because the alternative, being a skeptic while everyone else is piling in, feels like career suicide. It's not. The ones who survive the next decade will be the ones who built for reality, not fantasy. Wake up. The emperor has no clothes. As reported by Futurism, Krishna laid out striking calculations: a 1 gigawatt data center costs roughly $80 billion today. If one company commits 20-30 gigawatts, that's $1.5 trillion in capital expenditure. The total commitments across the industry for chasing AGI are approximately 100 gigawatts, equaling $8 trillion. To break even, you'd need $800 billion in profit just to cover the interest. That's not investment. That's hoping. futurism.com/artificial-int…

English
112
308
1.9K
226.5K
Michał Piszczek
Michał Piszczek@cdiamond·
@VikParuchuri the handwriting regression is the quiet tax here. char-level loves clean glyphs and hates ambiguity
English
1
0
0
311
Vik Paruchuri
Vik Paruchuri@VikParuchuri·
Surya 2, which has 650M params and scores 83.3% on olmocr, is the most accurate small OCR model. One reason why is character tokenization. Constant compute over chars improves accuracy and model size.
English
12
14
415
27.6K
Ravid Shwartz Ziv
Ravid Shwartz Ziv@ziv_ravid·
Almost 6 months ago @DarioAmodei, Anthropic CEO said AI would do most-to-all of software engineering **end-to-end** in 6–12 months. We're halfway through the window - quick status check: Right now, what % of your engineering tasks get solved end-to-end by AI? No edits, no babysitting, you just accept the output.
English
71
6
181
73.5K
Michał Piszczek
Michał Piszczek@cdiamond·
@analogalok the tok/s on wall power is the demo. the tok/s 20 minutes in on battery is the real number
English
0
0
0
63
Alok
Alok@analogalok·
six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.
Alok@analogalok

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (@no_stp_on_snek). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

English
49
44
273
56.1K
Michał Piszczek
Michał Piszczek@cdiamond·
@lukaisailovic the UIs break because there's no contract under them. a data layer turns ugly into a styling problem, not a correctness one
English
1
0
1
111
Luka Isailovic
Luka Isailovic@lukaisailovic·
I'm a visual person. I want to see my data, not read it in markdown. Right now agents can build custom HTML pages pretty fast, however they break often and tbh they're just ugly to look at. A while back I moved my health tracking out of markdown into a real data layer, a local MCP server (health-mcp) with its own nice dashboard on top of that data. That fixed my issues for that specific use case. I have so many other use cases now, and I caught myself wanting to build an MCP + a nice app for each one of them given how fast I can do it today. Being an engineer, I thought "what's the best way to overcomplicate this?" Naturally, I built a whole framework. I don't want to build one of these from scratch again, even with the agent. It degrades over time and becomes unmaintainable. The trick I found is that I don't let the agent write the UI. OpenIslands is a fixed set of typed visual components. KPI cards, charts, tables, that sort of thing. Already built, already nice. The agent writes none of them. It just fills in a small JSON manifest. Which component goes where, and what data it reads. That manifest is the only thing it touches, through MCP, and every edit it proposes gets validated. Bind a component to a field that doesn't exist and it refuses to build, and tells you which one broke. Now my workflow is simple. I point the agent at a directory full of files, and instead of a one-off HTML page, it just updates the manifest. I get a pretty app I can actually use as a human, and keep editing for months without it rotting. Of course, open sourced it, because why not openislands.sh
English
7
7
49
12.6K
Michał Piszczek
Michał Piszczek@cdiamond·
@elder_plinius human-unreadable code that models still execute is the line that should keep the safety team up. you can't audit a channel you can't read
English
0
0
0
15
Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭
🚨 NEW RESEARCH: “Lingua Ex Machina: A Procedural Xenolinguistics Engine Reveals Zero-Shot Language Acquisition, Human-Unreadable Coding Systems, and Exploitable Covert Channels in Frontier AI” Some of you may remember the name of this lil engine: GLOSSOPETRAE 👅🪨 Well, we've got upgrades 😎 It started as a procedural xenolinguistics engine: one seed in, an entire alien language out. Phonology, morphology, syntax, writing systems, lexicons, grammar docs, all generated from scratch and internally consistent. Every seed produces a unique language. Every language is deterministic. Then we used it to ask a weirder question: Can frontier AI models use languages that never existed before for practical applications? As it turns out: yes!! They can read them, write them, translate them, code in them, and even use the weird blind spots between tokenizers as covert channels. So this paper explores three ideas at once: ▶️ zero-shot language acquisition ▶️ human-unreadable code that models can still execute ▶️ exploitable covert channels in frontier AI systems GLOSSOPETRAE is no longer just a language generator... 🧵
Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 tweet media
English
131
373
3.1K
261.9K
Michał Piszczek
Michał Piszczek@cdiamond·
@Ananth7e models ship when the safety eval and the press release both clear. those two never finish on the same day
English
0
0
1
31
Ananth
Ananth@Ananth7e·
So, the talk is we'll see 4 models release next week. GPT-5.6 / 5.6 pro Gemini 3.5 pro Sonnet 5 Fable re-release I think we'll only see one of these release next week Will i regret for saying this?
Ananth tweet mediaAnanth tweet media
English
8
5
135
7.7K
Michał Piszczek
Michał Piszczek@cdiamond·
@Rainmaker1973 4000-year uptime and zero migrations. half of modern infra can't survive a framework upgrade
English
0
0
7
1.3K
Massimo
Massimo@Rainmaker1973·
This network of ceramic water pipes is 4000 years old. It was unearthed at the archaeological site of Pingliangtai in Northern China. The pipes still work.
Massimo tweet media
English
31
117
929
60.8K
Michał Piszczek
Michał Piszczek@cdiamond·
@DavidLinthicum and that's before depreciation. the GPUs lose value on a 3-year clock while the revenue is still on a maybe
English
0
0
0
224
DavidLinthicum
DavidLinthicum@DavidLinthicum·
The Emperor Has No Clothes: Why the AI Infrastructure Buildout Math Doesn't Work I have to give IBM CEO Arvind Krishna credit. He's saying what many of us in this industry have been thinking but haven't been willing to say out loud. The math just doesn't add up. Here's what I'm seeing that's deeply troubling. We're in the middle of another mass hallucination. Just like the dot-com bubble, just like blockchain, just like the metaverse — everyone is convinced that building massive data centers will automatically create massive wealth. But here's the thing about building infrastructure. You actually have to sell what's inside it. Let's talk numbers. The planned data center buildout over the next 5-10 years is staggering. We're talking about commitments in the hundreds of gigawatts globally. The capital expenditure commitments are in the trillions. Yet when you look at the actual demand signals, not the projections, not the potential, but the actual consumption patterns, there's a massive gap. These AI companies are betting everything on demand that simply doesn't exist at the scale they're planning for. Let me be direct. AI services are expensive. Enterprise adoption is slow. Consumer AI is still finding its footing. And the compute requirements being promised by the hyperscalers require a level of demand that would represent a fundamental shift in how businesses consume technology. That's a big ask. I've seen this pattern before. The overbuilding. The belief that if you build it, they will come. The groupthink that turns critical analysis into heresy. The result is always the same. Companies are going to touch the stove. We're going to see massive write-downs. We're going to see pivots, shutdowns, and strategic reviews. We're going to see companies that spent years and billions trying to be the AI infrastructure leader become case studies in how not to read a market. The IBM CEO is right. The math doesn't work. And unlike 1999, we don't have the excuse of we didn't know. We know exactly what's happening. We just don't want to believe it because the alternative, being a skeptic while everyone else is piling in, feels like career suicide. It's not. The ones who survive the next decade will be the ones who built for reality, not fantasy. Wake up. The emperor has no clothes. As reported by Futurism, Krishna laid out striking calculations: a 1 gigawatt data center costs roughly $80 billion today. If one company commits 20-30 gigawatts, that's $1.5 trillion in capital expenditure. The total commitments across the industry for chasing AGI are approximately 100 gigawatts, equaling $8 trillion. To break even, you'd need $800 billion in profit just to cover the interest. That's not investment. That's hoping. futurism.com/artificial-int…
English
121
481
1.5K
476.2K
nick vasilescu
nick vasilescu@nickvasiles·
whats stopping you from building a second brain? just install codex, connect it to all your tools then run /goal mode with the loop below and have it cook for 12 hours building an obsidian vault this loop will go through every detail from every connector you have you will have the most perfect knowledge base for you and your agents go experience personal AGI before everyone else
English
45
57
946
213.8K
Michał Piszczek
Michał Piszczek@cdiamond·
@theinformation new contributors measures novelty, not retention. these races get won at month six, not the first 30 days
English
0
0
0
8
The Information
The Information@theinformation·
Some copycats are starting to catch up with OpenClaw. Hermes, an agent tool from Nous Research, recently eclipsed OpenClaw in terms of new GitHub contributors in the last 30 days. More details & what it means for the AI agent race: thein.fo/43JJs5H
English
2
3
11
5.5K
Michał Piszczek
Michał Piszczek@cdiamond·
@0xwhrrari 100 tools you don't load into context is 100 tools nobody tests until an agent picks the wrong one
English
0
0
0
7
rari
rari@0xwhrrari·
Anthropic product lead: "At Anthropic, engineers run swarms of 300+ agents daily" "Give your agents 100+ tools - just don't load them all into context" In a 30-minute talk, the Anthropic team shows how production agents are actually deployed Claude + Loops + Routines + Dynamic Workflows + smart tool selection - that's the secret Bookmark and watch the talk, then save the playbook below
rari@0xwhrrari

x.com/i/article/2065…

English
29
22
200
37.7K
Michał Piszczek
Michał Piszczek@cdiamond·
@oneill_c the margin is what funds the next frontier model you then distill from. squeeze it and the cheap tier quietly loses its parent
English
0
0
0
27
Charlie O'Neill
Charlie O'Neill@oneill_c·
1. if distillation gave cheap-and-equivalent, haiku 4.5 wouldn’t be both ~10x cheaper than sonnet and visibly less capable. you move along the cost/quality pareto frontier, distillation doesn’t shift it. By the time you have a bigger model you’ve done the shifting 2. Glm-5.2 was trained on Huawei chips and is still within striking distance of frontier. scale-up domain matters but is bounded by one rack, and “different compute for prefill/decode” is scheduling on the same pool, not different SKUs 3. optimal batch for a deepseek-shape MoE (where optimal batch size is determined by sparsity ratio only) is ~2400 concurrent sequences, which is not millions of requests. This is easily reachable and baseten clear it routinely
Piotr Mazurek@tugot17

Quite a bad take 😀 1. Frontier US models are expensive not because they are pricey to serve but because they serve at a very good margin. They can afford this margin because these models are genuinely better than the open-source alternatives. The twitter narrative that "Chinese models now dominate in usage cause OpenRouter" is just nonsense. 2. Once you have a powerful model, you can just distil it into a smaller one to enable cheap serving. You have all the logprobs, hidden states, and the training corpus – making a new model is simple; you can experiment with a smaller size, different attention mechanisms, etc. You can make it very cheap to serve. At the moment everyone just wants the best model, so Anthropic doesn't care. If this changes, and price becomes an issue, they will make the model cheaper; it will be trivial compared to training Mythos. 3. US companies massively benefit from access to frontier compute; newer offerings from NVIDIA give you a massive cost advantage that is very, very hard to beat. You want different compute for prefill and for decode; you want to use the NVL72 so dispatch is fast, etc. 4. For sparse MoEs, there are massive benefits to scaling. You want to split the model across hundreds of GPUs, overlap compute and dispatch, and saturate each expert. To do this, you need continuously to have millions of requests, ideally spread across different time zones so you can utilise this as close to 24/7 as possible. There are very few companies that meet this requirement (mostly Big Tech). If you don't have this, you will be paying for compute that is idling. As prices of GPUs skyrocket, you won't be able to justify it. There is a lot of money to be made in inference; there are very distinctive patterns that you can specialise in and make a lot of money from. But you need to think about this from first principles, and "companies will buy B200 nodes and serve internally running SGLang" is not going to happen, at least not at scale needed to make billions 😅

English
8
2
207
36.5K
Michał Piszczek
Michał Piszczek@cdiamond·
@sudoingX VLMs are where quantization quietly hurts. the vision tower degrades first and you only catch it on the weird inputs
English
0
0
0
100
Sudo su
Sudo su@sudoingX·
anon if you have a dgx spark, the best model to run in 2026 is stepfun's step 3.7 flash. i called it the day it dropped, and after living with it on the box i'm saying it louder now. what you're looking at is a 198B mixture of experts vision language model, ~11B active per token, Q4_K_S at ~104gb, running on a single 128gb dgx spark under hermes agent. one machine. it holds the full 262K context at ~25 tokens a second and takes image input. i measured that speed myself at full context. reasoning, vision, frontier size, all of it fitting and running locally on one box. people keep asking what's actually worth running on 128gb of unified memory. this is the answer, and has been since launch. if you bought a spark to run frontier models locally instead of renting them, this is the one to put on it.
Sudo su tweet media
English
33
29
345
25.3K