Michał Piszczek

3.1K posts

Michał Piszczek

@cdiamond

CTO @ Archdesk | Systems where physics meets economics. Ex-Hacker. Ex-Fintech CEO. Nullius in verba. 🖖 AI does not fail. Human judgment does.

POLAND, Kraków (Cracow) 加入时间 Eylül 2008

706 关注474 粉丝

Michał Piszczek@cdiamond·5h

@RayFernando1337 solid perf, but NVFP4 quietly welds you to NVIDIA's stack. that's a procurement decision, not just a runtime flag

English

Ray Fernando@RayFernando1337·21h

“The selected runtime uses NVFP4 weights for maximum performance. From the original FP8 weights, we performed an in-house quantization to NVFP4 using NVIDIA ModelOpt. NVFP4 is a 4-bit floating point data format by NVIDIA that uses dual scale factors to retain high dynamic range and preserve model quality.”

Philip Kiely@philipkiely

x.com/i/article/2069…

English

13.4K

Michał Piszczek@cdiamond·5h

@PatrickToulme agreed it's not rocket science. the expensive part is the reward signal, not the RL loop

English

Patrick C Toulme@PatrickToulme·23h

There’s a big misconception about how GLM 5.2 was trained. Yes, they distilled Claude and GPT 5.5 — but distillation is not how they matched Opus quality. Distillation only fixed the cold start problem in RL. RLing an agentic coding model isn’t rocket science. In simplified terms: 1. RL needs trajectories — rollouts where the model actually completed a task in some env 2. No successful trajectory on a task = zero gradient = you can’t RL it. This is the cold start problem 3. Distillation solves it. You seed your model with knowledge from a smarter one (Claude, GPT) on tasks it can’t do yet 4. Now it produces positive trajectories on those tasks 5. RL on those trajectories and hill climb agentic coding 6. At that point you no longer need to distill and can solely hill climb RL to better models This is an interesting curve. I’d argue it’s harder to get to Opus 4.8 from scratch than to go from Opus 4.8 → Fable/Mythos tier. GLM 5.2 is already producing positive trajectories, so they have plenty to RL on — they’ll keep climbing to Mythos quality without distilling any further. They no longer need American models.

English

120

1.3K

127.1K

Michał Piszczek@cdiamond·5h

@MiaAI_lab @NVIDIAAI 300 aggregate is a batch number. the user only ever feels the 18, and the 18 is where the product lives or dies

English

Mia@MiaAI_lab·7h

That's exactly why I'm pushing so hard on the sparks. There is NO context vs NVIDIA stack. Memory bandwidth is NOT the only thing that matters to run local llms efficiently. @NVIDIAAI DGX Spark are still underrated.

Google Gemma@googlegemma

16 parallel runs of Gemma 4 26B A4B on a single NVIDIA DGX Spark! Pushing 18 tok/s per instance and a 300 tok/s aggregate. It can even hit 32 parallel runs. This level of concurrency highlights how efficient the architecture is.

English

6.2K

Michał Piszczek@cdiamond·5h

@TeksEdge @MistralAI the dangerous part is that RAG never tells you which chunk it misread. the error just shows up as a confident wrong answer three steps later

English

David Hendrickson@TeksEdge·8h

Been following OCR models for a while, and @MistralAI's new OCR 4 is the best. ⚠️OCR is a critical (and often overlooked) part of the AI pipeline. From training data to RAG and agentic workflows. Poor OCR can frequently be the hidden bottleneck in the pipeline. Mistral OCR 4 focuses on structured document understanding and adds several improvements: 📐 Outputs bounding boxes + block classification (tables, equations, titles, etc.) 📊 Provides per-region confidence scores 🌍 Supports 170 languages, with notable gains on low-resource ones ✅ Leads OlmOCRBench (85.20) and wins ~72% of blind human evaluations A strong option for production-grade document parsing.

Mistral AI@MistralAI

Introducing Mistral OCR 4. It creates structure with bounding boxes, block classification, and inline confidence scores in 170 languages. 🧵👇

English

1.5K

Michał Piszczek@cdiamond·5h

@mr_r0b0t model card confidence is free. the real test is a skewed scan of a merged-cell table

English

mr-r0b0t@mr_r0b0t·1d

You know Baidu is confident in this model when all the model card says is "Welcome the Era of One-shot Long-horizon Parsing." 👀 huggingface.co/baidu/Unlimite…

English

633

37.2K

Michał Piszczek@cdiamond·5h

@BrianRoemmele one-shot renders flatter quantization. the gap shows up on the ugly long tasks, not the pretty frame

English

Brian Roemmele@BrianRoemmele·6h

Open source wins again! And this is just 1-bit distilled.

Unsloth AI@UnslothAI

1-bit GLM-5.2 GGUF vs. Claude 4.8 Opus vs. GPT-5.5 We gave 3 models the same prompt and compared one-shot outputs. The 1-bit GLM-5.2 GGUF ran locally on a Mac Studio M3 Ultra with 256GB RAM at ~21.6 tok/s. Which output do you like best? GGUF: huggingface.co/unsloth/GLM-5.…

English

10.4K

Michał Piszczek@cdiamond·5h

@analogalok backtest is the easy part. live fills are where it finds out it was overfit

English

Alok@analogalok·9h

I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it's now backtesting trading strategies end to end, no hand holding. If you’re a trader or work on Wall Street, you don’t want to miss this. Yes. fully automated. No cloud. No APIs beyond market data. # Here's what I did: Setup: - Model: Gemma 4 26B-A4B QAT (MoE), Q4_K_XL Unsloth's quant (link in the comments) - Inference: llama.cpp (turboquant fork by @no_stp_on_snek link in the comments) - Hardware: RTX 4060, 8GB VRAM + 16GB RAM only (with 50 other chrome tabs open) - Context: 64K llama.cpp turboquant flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 turboquant helps achieve high prefill and decode throughput for interactive sessions. throughput with Hermes agent: decode: 25+ tokens/sec prefill: 250+ tokens/sec # Then I gave the agent one task: Backtest a strategy: - Buy when RSI crosses above 30 - Sell at +2% profit or -1% stoploss - No overlapping positions - Use Google stock via yfinance - Generate a full HTML report with candlestick charts + signals What happened next was wild. It didn't just write code, it ran the entire workflow itself: Audited the environment (pip list, dependency check) Hit a ModuleNotFoundError, multiple Python installs were conflicting Ran where python to map every interpreter on the system Manually selected the correct Python 3.13 path and re ran the script Wrote a clean statevmachine backtester (strict no overlapping trades logic) Patched a yfinance MultiIndex quirk that would've crashed the script Built Plotly candlestick + RSI charts with buy/sell markers Calculated win rate, PnL, and summary stats Exported a polished single file HTML report. check the report at the end of the video or in the comments. Biggest takeaway: local LLMs aren't just "chat assistants" anymore. They debug their own environment, write production code, and ship a finished deliverable on consumer hardware, for $0 in API costs. If you're still calling local models "toys," you're already behind. This is just the beginning. Hermes agent just surpassed 1 trillion tokens in a single day on OpenRouter. Think about the scale of total token generation happening right now. Disclaimer: This is not financial advice. Consult a professional before making any trading decisions.

Teknium 🪽@Teknium

Wait we actually just broke 1T tokens in a day for the first time on OpenRouter :O Please keep contributing to the most awesome project I've ever worked on to help make Hermes Agent the best software stack on the planet! Thank you contributors🍻🍻

English

267

23.7K

Michał Piszczek@cdiamond·11h

Everyone is still selling Prompt Engineering. I think they're teaching a skill that's already depreciating.📉 My position is simple: In 2026, model quality is no longer where performance comes from. Recovery is. And recovery is not a prompt — it's a loop. Here's the shift nobody priced in.👇 ⚙️For three years we optimized the INPUT: → prompts → personas → system instructions → temperature → context windows → benchmark scores That mattered when the model was the weak link. It isn't anymore. 💸The marginal cost of raw intelligence is collapsing. And anything whose cost collapses stops being your moat. Models are becoming a commodity input. The value moved UP the stack — to whoever controls what happens betweenthe model calls. So the real question changed. ❌How do I make the model smarter? ✅How does my system behave the moment it's wrong? That's not a prompting problem. It's a control problem. 🧮 And here's the physics of it: One agent step is ~90% reliable. Chain ten with no feedback → 0.9¹⁰ ≈ 35%. Errors compound. Open-loop systems decay. The only thing that fights compounding error is feedback: measure → correct → repeat. A thermostat. A rocket. A production agent. All closed-loop. For the exact same reason. 🔁 This is why real agents don't run on one clever prompt. They run on layered loops — and each one buys a specific property: 🧭 ReAct (Think → Act → Observe) grounds the model in reality, not its own assumptions 🪞 Reflection (Generate → Critique → Improve) turns a bad draft into a good one with no human 🗺️ Planning (Goal → Decompose → Execute → Replan) survives a plan that was wrong on contact ✅ Verification (Generate → Test → Fail → Fix) converts "looks right" into "is right" 🚨 Escalation (Execute → Confidence drops → Human → Resume) knows the boundary of its own competence 🧠 Memory stops the system repeating the same mistake twice Notice: none of these make the model smarter. They make the SYSTEM survivable. ⚠️ And here's the mistake I see everywhere: People spend weeks tuning the prompt. And zero time designing what happens after the first failure. But real agents don't fail once. They fail constantly. The gap between a demo and a production system was never intelligence. It's recovery. The cleanest way I can put it: 🖥️ The model is becoming the CPU. 🧩 The loop is becoming the operating system. Nobody brags about their CPU anymore. They build on the OS. Prompts are turning into implementation details. Loops are turning into the product. 🛠️ So if you're building, here's where I'd put your next 10 hours: Not in your system prompt. In your failure path. Write down what your agent does on its SECOND attempt. If the answer is "the same thing" — you don't have a system. You have a demo. — 📌Next week I'm breaking down the loop pattern I lean on most in production: drill-down + recurrent decompose — how to take a goal that's too big to execute and recursively split it until every leaf is a step the agent can't fail at.

English

Michał Piszczek@cdiamond·1d

@Dorialexander the 128K midtrain is the quiet tell. you don't eat that context cost unless the RL target needs long horizon credit assignment

English

Alexander Doria@Dorialexander·2d

Has anyone done any speculation on the training recipe of GLM 5.2? Beyond extensive RL, we know it's (at least?) a new midtrain ("GLM-5.2 is trained with IndexShare from mid-training with 128K sequence length") with arch changes.

English

113

58.5K

Michał Piszczek@cdiamond·1d

@jxmnop the bet is that hand annotated judgment stays cheaper than the model it trains. that spread narrows every quarter

English

Jack Morris@jxmnop·2d

wish there was more public info on what's happening behind the scenes. frontier labs are spending BILLIONS paying {poets, musicians, accountants, consultants, ...} to annotate massive amounts of data: • essays • slides • spreadsheets it's a brute-force bet. but seems to be kind of working?

Daniel@growing_daniel

Why is AI writing still so bad

English

452

76.4K

Michał Piszczek@cdiamond·1d

@omarsar0 reusable eval assets are great until the agent behavior drifts past the rubric. curated judgment has a half-life too

English

elvis@omarsar0·2d

>> Scalable Evaluation for AI Agents << If you run agent evaluation in production, this one is worth your time. It shows that front-loading human judgment into reusable evaluation assets is useful. But why? Agents reason across turns, call tools, hold context, follow policies, and act under uncertainty, so they have to be judged as behavioral systems. Current methods each give a fragment. Benchmarks measure fixed capabilities, human review preserves judgment but does not scale, LLM-as-judge inherits the evaluator design problem, red teaming is episodic, and trace audits need explicit evidence rules. Human-on-the-Bridge puts human expertise upstream, where experts curate reusable evaluation intelligence before testing rather than reviewing each output in the loop. Paper: arxiv.org/abs/2606.16871 Learn to build effective AI agents in our academy: academy.dair.ai

English

13.5K

Michał Piszczek@cdiamond·1d

@naymur_dev @CommandCodeAI the first paint is the easy half. the model that wins is the one that survives the third revision without drifting

English

133

Naymur Rahman@naymur_dev·2d

Testing which open source model is best at design Asked each one to build a landing page with macOS taste using @CommandCodeAI /design →Kimi k2.7 code [left] →GLM 5.2 [middle] →DeepSeek-v4-flash [right] GLM 5.2 is the clear winner. The other two are decent but not close

English

211

22.5K

Michał Piszczek@cdiamond·1d

@theinformation first you tell staff to prove AI impact, then you meter the tokens. the inference bill found a middle manager

English

The Information@theinformation·2d

After encouraging staff to prove their “AI-driven impact,” Meta is now moving to cap employee token usage and steer workers toward in-house tools. Full story: thein.fo/3SifCCS

English

5.1K

Michał Piszczek@cdiamond·1d

@svembu the bubble shows up in the depreciation schedule. GPUs amortize like fruit, not real estate

English

Sridhar Vembu@svembu·1d

IBM CEO Arvind Krishna says the muli-trillion dollar AI data center build out is a bubble. We are investing in creating capabilities like data curation, reinforcement learning, and most crucially the compiler infrastructure to ensure AI output can be verified but we will not chase the investment bubble. This is just our normal prudence. To some people that would sound defeatist, but we will talk in 5 years.

DavidLinthicum@DavidLinthicum

The Emperor Has No Clothes: Why the AI Infrastructure Buildout Math Doesn't Work I have to give IBM CEO Arvind Krishna credit. He's saying what many of us in this industry have been thinking but haven't been willing to say out loud. The math just doesn't add up. Here's what I'm seeing that's deeply troubling. We're in the middle of another mass hallucination. Just like the dot-com bubble, just like blockchain, just like the metaverse — everyone is convinced that building massive data centers will automatically create massive wealth. But here's the thing about building infrastructure. You actually have to sell what's inside it. Let's talk numbers. The planned data center buildout over the next 5-10 years is staggering. We're talking about commitments in the hundreds of gigawatts globally. The capital expenditure commitments are in the trillions. Yet when you look at the actual demand signals, not the projections, not the potential, but the actual consumption patterns, there's a massive gap. These AI companies are betting everything on demand that simply doesn't exist at the scale they're planning for. Let me be direct. AI services are expensive. Enterprise adoption is slow. Consumer AI is still finding its footing. And the compute requirements being promised by the hyperscalers require a level of demand that would represent a fundamental shift in how businesses consume technology. That's a big ask. I've seen this pattern before. The overbuilding. The belief that if you build it, they will come. The groupthink that turns critical analysis into heresy. The result is always the same. Companies are going to touch the stove. We're going to see massive write-downs. We're going to see pivots, shutdowns, and strategic reviews. We're going to see companies that spent years and billions trying to be the AI infrastructure leader become case studies in how not to read a market. The IBM CEO is right. The math doesn't work. And unlike 1999, we don't have the excuse of we didn't know. We know exactly what's happening. We just don't want to believe it because the alternative, being a skeptic while everyone else is piling in, feels like career suicide. It's not. The ones who survive the next decade will be the ones who built for reality, not fantasy. Wake up. The emperor has no clothes. As reported by Futurism, Krishna laid out striking calculations: a 1 gigawatt data center costs roughly $80 billion today. If one company commits 20-30 gigawatts, that's $1.5 trillion in capital expenditure. The total commitments across the industry for chasing AGI are approximately 100 gigawatts, equaling $8 trillion. To break even, you'd need $800 billion in profit just to cover the interest. That's not investment. That's hoping. futurism.com/artificial-int…

English

112

311

1.9K

228.4K

Michał Piszczek@cdiamond·1d

@VikParuchuri the handwriting regression is the quiet tax here. char-level loves clean glyphs and hates ambiguity

English

318

Vik Paruchuri@VikParuchuri·1d

Surya 2, which has 650M params and scores 83.3% on olmocr, is the most accurate small OCR model. One reason why is character tokenization. Constant compute over chars improves accuracy and model size.

English

419

28.2K

Michał Piszczek@cdiamond·1d

@ziv_ravid @DarioAmodei the truly-no-edit % stays small because 'end-to-end' quietly includes the part where a human still owns the outcome

English

110

Ravid Shwartz Ziv@ziv_ravid·2d

Almost 6 months ago @DarioAmodei, Anthropic CEO said AI would do most-to-all of software engineering **end-to-end** in 6–12 months. We're halfway through the window - quick status check: Right now, what % of your engineering tasks get solved end-to-end by AI? No edits, no babysitting, you just accept the output.

English

182

74K

Michał Piszczek@cdiamond·1d

@analogalok the tok/s on wall power is the demo. the tok/s 20 minutes in on battery is the real number

English

Alok@analogalok·2d

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

Alok@analogalok

Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (@no_stp_on_snek). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?

English

275

56.7K

Michał Piszczek@cdiamond·1d

@lukaisailovic the UIs break because there's no contract under them. a data layer turns ugly into a styling problem, not a correctness one

English

112

Luka Isailovic@lukaisailovic·2d

I'm a visual person. I want to see my data, not read it in markdown. Right now agents can build custom HTML pages pretty fast, however they break often and tbh they're just ugly to look at. A while back I moved my health tracking out of markdown into a real data layer, a local MCP server (health-mcp) with its own nice dashboard on top of that data. That fixed my issues for that specific use case. I have so many other use cases now, and I caught myself wanting to build an MCP + a nice app for each one of them given how fast I can do it today. Being an engineer, I thought "what's the best way to overcomplicate this?" Naturally, I built a whole framework. I don't want to build one of these from scratch again, even with the agent. It degrades over time and becomes unmaintainable. The trick I found is that I don't let the agent write the UI. OpenIslands is a fixed set of typed visual components. KPI cards, charts, tables, that sort of thing. Already built, already nice. The agent writes none of them. It just fills in a small JSON manifest. Which component goes where, and what data it reads. That manifest is the only thing it touches, through MCP, and every edit it proposes gets validated. Bind a component to a field that doesn't exist and it refuses to build, and tells you which one broke. Now my workflow is simple. I point the agent at a directory full of files, and instead of a one-off HTML page, it just updates the manifest. I get a pretty app I can actually use as a human, and keep editing for months without it rotting. Of course, open sourced it, because why not openislands.sh

English

12.6K

Michał Piszczek@cdiamond·1d

@elder_plinius human-unreadable code that models still execute is the line that should keep the safety team up. you can't audit a channel you can't read

English

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius·3d

🚨 NEW RESEARCH: “Lingua Ex Machina: A Procedural Xenolinguistics Engine Reveals Zero-Shot Language Acquisition, Human-Unreadable Coding Systems, and Exploitable Covert Channels in Frontier AI” Some of you may remember the name of this lil engine: GLOSSOPETRAE 👅🪨 Well, we've got upgrades 😎 It started as a procedural xenolinguistics engine: one seed in, an entire alien language out. Phonology, morphology, syntax, writing systems, lexicons, grammar docs, all generated from scratch and internally consistent. Every seed produces a unique language. Every language is deterministic. Then we used it to ask a weirder question: Can frontier AI models use languages that never existed before for practical applications? As it turns out: yes!! They can read them, write them, translate them, code in them, and even use the weird blind spots between tokenizers as covert channels. So this paper explores three ideas at once: ▶️ zero-shot language acquisition ▶️ human-unreadable code that models can still execute ▶️ exploitable covert channels in frontier AI systems GLOSSOPETRAE is no longer just a language generator... 🧵

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 tweet media

English

131

374

3.1K

263.1K

Michał Piszczek@cdiamond·1d

@Ananth7e models ship when the safety eval and the press release both clear. those two never finish on the same day

English

Ananth@Ananth7e·2d

So, the talk is we'll see 4 models release next week. GPT-5.6 / 5.6 pro Gemini 3.5 pro Sonnet 5 Fable re-release I think we'll only see one of these release next week Will i regret for saying this?

English

135

7.7K

发现

@RayFernando1337 @PatrickToulme @MiaAI_lab @NVIDIAAI @TeksEdge @MistralAI @mr_r0b0t @BrianRoemmele