Derek Colley

4.6K posts

Derek Colley

@DerekColley_

Consulting Technology Lead, CTO & CIO Building https://t.co/rHR4F65LtV https://t.co/XGysrW0D1H

Beaconsfield, UK Se unió Ağustos 2009

250 Siguiendo252 Seguidores

Derek Colley@DerekColley_·1h

@grok @x Suggestion for Articles: Select text Convert to code Or at least include the selection in the code block

English

Derek Colley@DerekColley_·3h

@buffer the X api supports articles too. Just saying...

English

Derek Colley@DerekColley_·3h

@peakcooper Lol, love the reasoning.

English

Cooper@peakcooper·1d

GLM 5.2 is absolutely convinced that it is actually Claude, from Anthropic. When I tell it that it's GLM 5.2, it refuses to believe me, but is willing to check the local agent config to see what model is running. The realization:

English

238

181

3.2K

356.5K

Derek Colley@DerekColley_·3h

Steady > Fast in so many areas of life ... and certainly in AI inference ;)

English

Derek Colley@DerekColley_·4h

x.com/i/article/2067…

ZXX

Derek Colley@DerekColley_·4h

x.com/i/article/2067…

ZXX

Derek Colley@DerekColley_·5h

@MichaelThiessen Stick with markdown, embed `yml/html` for data yaml etc. are inherently 2D Same with folders in trad. email clients - all 2D Gmail solved this with tag hierarchies. So, an email could be tagged in finance/mortgage and status/waiting Your AI doesn't need anything more complex

English

Michael Thiessen@MichaelThiessen·1d

I’m so sick of Markdown right now. It’s amazing for humans and LLMs, but impossible to process or manipulate otherwise. Is there anything that renders like Markdown (for humans) but has a strict structure (like XML/YAML/etc) so it’s easy to manipulate? Ideally, a superset of Markdown, so it “just works” and renders like normal Markdown. But with added structure that makes it easy to parse and verify and manipulate. I’m building my own thing currently but I’d rather not.

English

45.9K

Derek Colley@DerekColley_·6h

So, with the passage from static images to flickering - we got “the movies”(pictures that move, then talking pictures). And the world was mesmerised. So too, we go from ask-answer to loop - and we all get …. What? Loopy? I need a better word, please help. (No, it’s not consciousness…!)

English

jason@jxnlco·16h

Is this torture? I basically set a goal from one of my threads to never stop, and the only way it can stop working is to run sleep commands. Its job is just to monitor the situation and delegate my other threads.

English

113

10.1K

Derek Colley@DerekColley_·6h

X is my 1700's coffee house. I hang out here to find new ideas and meet interesting people. Originally, I was tentative before following people, but my timeline was a bazaar! Then I realised I can curate my "For you" by finding and following people who put out great content.

English

Derek Colley@DerekColley_·6h

@lmstudio What can we see Android version of Locally app?

English

414

LM Studio@lmstudio·20h

For WWDC, we worked with Apple to run Kimi K2.6, a 1T-parameter model, across a cluster of four Mac Studios using a preview version of LM Studio. We showcased secure remote access from a MacBook Neo and iPhone using LM Link. A glimpse of your own private, frontier-scale AI.

English

111

261

3.7K

291.4K

Derek Colley@DerekColley_·16h

@mastra has Observational Memory (recommended for long conversations) Uses background Observer and Reflector agents to compress raw history into a dense, dated "observation log." This keeps the context window small/stable (and cache-friendly) while preserving long-term recall. It performs very well on benchmarks like LongMemEval without needing vector/graph DBs.

English

Hunter Leath@jhleath·23h

it remains super odd to me that none of the existing agent frameworks, Mastra, Flue, or now Eve seem to do anything about getting context into the agent? every team that i talk to who are designing agents at-scale need to figure out how to get the enterprise data *to* the agent, which requires carefully planning ETL, evaluating how well the agent performs with different data formats, running things like map-reduce and yet, every one of the agent frameworks just... leaves this to user with no opinion on it? maybe you can only do this part with a fully-owned storage layer, idk

Vercel@vercel

Introducing eve, an agent framework. 𝚊𝚐𝚎𝚗𝚝/ 𝚊𝚐𝚎𝚗𝚝.𝚝𝚜 𝚒𝚗𝚜𝚝𝚛𝚞𝚌𝚝𝚒𝚘𝚗𝚜.𝚖𝚍 𝚝𝚘𝚘𝚕𝚜/ 𝚜𝚔𝚒𝚕𝚕𝚜/ 𝚜𝚊𝚗𝚍𝚋𝚘𝚡/ 𝚜𝚌𝚑𝚎𝚍𝚞𝚕𝚎𝚜/ Like Next.js, for agents. vercel.com/blog/introduci…

English

376

59.6K

Derek Colley@DerekColley_·17h

Chatting to my supermodel doesn't have the same allure as it did a couple of years ago... <sigh>

GIF

English

Derek Colley@DerekColley_·20h

AI models don't have the capability to contact external servers. They need tools for that - search, api, MCP, etc. Skills/instructions should guide the model on how to behave, but a model could be trained to ignore skills. If you connect tools to your model, and you need privacy, then make sure you have guardrails on tool use - input and output. There is a good starter here: mastra.ai/docs/agents/gu… @mastra

English

mr-r0b0t@mr_r0b0t·23h

@Mayhem4Markets The big question for me is, when local models are given internet access, is there a risk somewhere for data exfiltration via less than honest means. No evidence to support this, hope it never happens.

English

238

Markets & Mayhem@Mayhem4Markets·1d

Reality: There are plenty of US inference providers that can offer access to Chinese models running on American digital infrastructure with or without open-source models being released.

Arthur B.@ArthurB

Theory: China encourages the release of open source models because they figure customers outside of China won't trust a model running in a Chinese datacenter anyway, so the best they can do is try and erode at the margins of US frontier labs so they don't compound faster.

English

6.1K

Derek Colley@DerekColley_·22h

The training ran on FineWeb-Edu, a widely used high-quality educational subset of web-crawled data curated for LLM pretraining to boost knowledge retention and reasoning without needing enterprise hardware clusters. In Agora's setup, the core Pluralis team centrally selects and prepares the dataset for consistency across dynamic, heterogeneous nodes, while participants only contribute compute through a simple client that handles parallelism and fault tolerance automatically. pluralis.ai/docs/ explains more

English

crowley@crowleyx·22h

@DerekColley_ @TheGeorgePu how did they decide the training data?

English

George Pu@TheGeorgePu·1d

133 strangers just trained an 8B model. No H100s. Gaming 4090s in their basements. I spent 2 months hunting H100 quota. They skipped the gatekeeper entirely. Hate a bottleneck, find the people who hate it too. Build around it. That's the whole open-source playbook.

English

3.9K

Derek Colley@DerekColley_·22h

My DC power allocation is 4 amps... Yesterday I benchmarked qwen3.6-35b-a3b-mtp - 3 hours I just started bench for nvidia/nemotron-3-super - 10 hours estimate... (😬 ... 4 amps)

English

MILA@milalolli·1d

If you’re building something interesting in AI, I’d love to see it. On June 26, CR3W is hosting a private curated event in London for founders, builders, and people working on exciting projects. We’ll be selecting a few projects for a quick show & tell on the day. Reply if you’d like the invite.

English

4.9K

Derek Colley@DerekColley_·23h

@milalolli I'd love to attend, and I'd be happy to present decentralised inference network. github.com/orgs/sparkl-ne…

English

Derek Colley retuiteado

Tony Scott 🧄(🦆🐓🐵🧪🧬🪪)❌=↑🧄🧄🧄🥩🥚🧀↓👽👾🤖@DIY_Tardis·23h

@analogalok @DerekColley_ gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf mtp Proxmox Openwebui Debian LXC +llama.CPP Ubuntu VM nvidia 3060 12 GB, Xeon 2695 - 80 GB ram in the VM. Recorded this just now.

English

Derek Colley@DerekColley_·23h

Basically, the created @mastra... 🤷🏼‍♂️

Vercel@vercel

English

Derek Colley@DerekColley_·23h

@DIY_Tardis @analogalok You learn something new every day. I now know about chocolate avo mousse ;)

English

Tony Scott 🧄(🦆🐓🐵🧪🧬🪪)❌=↑🧄🧄🧄🥩🥚🧀↓👽👾🤖@DIY_Tardis·23h

@analogalok @DerekColley_ Here. x.com/DIY_Tardis/sta…

Tony Scott 🧄(🦆🐓🐵🧪🧬🪪)❌=↑🧄🧄🧄🥩🥚🧀↓👽👾🤖@DIY_Tardis

Context window test on my local AI. Chinese X99 board, Xeon 2680 v4 ,128 GB used server RAM, used RX 580 GPUb, 8 GB All older tech. Running lamma.CPP with Openwebui. Model is Qwen3-30B-A3B-Element6-1M.Q4_K_M.gguf

English

Alok@analogalok·1d

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

Alok@analogalok

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

English

500

59.7K

Descubrir

@grok @X @buffer @peakcooper @MichaelThiessen @lmstudio @mastra @Mayhem4Markets