Steven (Batman) Batchelor-Manning

3.8K posts

Steven (Batman) Batchelor-Manning

@S_BatMan

Founder | CTO | Building Exceptional Technology

Bratislava, Slovakia Katılım Eylül 2010

235 Takip Edilen1.4K Takipçiler

Sabitlenmiş Tweet

Steven (Batman) Batchelor-Manning@S_BatMan·4d

x.com/i/article/2057…

ZXX

192

814.7K

Steven (Batman) Batchelor-Manning@S_BatMan·8h

@bayendor The question will be did you get the right layers and right type of memory for your Hermes Agent, there a lot of good examples of Agentic memory out there all pushing in differernt directions x.com/S_BatMan/statu…

Steven (Batman) Batchelor-Manning@S_BatMan

x.com/i/article/2050…

English

david bayendor@bayendor·1d

Just finished wiring a 3-layer memory stack into Hermes Agent. 🧠 Layer 1: Honcho Session + peer memory on PostgreSQL. Handles context tracking, history, and multi-agent coordination. ~12K messages indexed across 4 containers. ⚡ Layer 2: hermes-lcm Working memory layer for conversational continuity between turns, with sensitive pattern redaction. 📚 Layer 3: GBrain Long-term knowledge graph powered by PGLite (WASM Postgres). 301 pages, 1,150 chunks. Hybrid search + multi-query expansion. Now letting the agents run on it for the next week to see how the memory behavior evolves 👀

English

279

12.2K

Steven (Batman) Batchelor-Manning@S_BatMan·8h

Super useful read if you are experimenting with Llama-server params a lot but are not sure what to tweak.

witcheer@witcheer

when I started tuning llama-server, I changed flags randomly until something worked (or didn't). ncmoe 30? ncmoe 10? why is it suddenly 5x slower? what even is the KV cache eating my VRAM for? so I measured everything. every flag, every ncmoe value, the exact VRAM cost per layer, the exact point where performance falls off a cliff. this is the reference I built for myself. 16 flags, each explained with the "when to change it" and "what breaks if you get it wrong" enjoy

English

Steven (Batman) Batchelor-Manning@S_BatMan·22h

@LottoLabs That's where machines like the spark have their strength. Running models 24/7 on negligible watts vs a desktop PC gpu stack. For them tasks where yoy can afford a little extra time for big savings.

English

Lotto@LottoLabs·1d

Another awesome feature is you can now derate your cards power and get the same TPS as non-MTP Huge efficiency gains for discrete memory bros

clem 🤗@ClementDelangue

llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀 Qwen3.6-27B dense generation below on A10G: From 25 tok/st to 45 tok/s (+78%)!

English

3.6K

Steven (Batman) Batchelor-Manning@S_BatMan·22h

@sudoingX I'm regularly finding it can double the perceived sharpness of any model just by using my harness over something like claude code. The answers given are only as good as the questions asked

English

Sudo su@sudoingX·1d

this is so annoying. most of the work done by claude code, my cursor reviewer finds it half complete and untested every single time. absolute slap on anthropic engineers face by cursor.

Sudo su@sudoingX

look at this. opus 4.7 max thinking on claude code, the moment my cursor reviewer pushes back. realizes it's fucked. same opus 4.7. same max thinking. same task. caught by the same model running in cursor's harness. cursor wins every single time and it's not close. claude code feels like a q4 quant of the same opus 4.7 max these days. half the rigor, missed context, sloppy pre-audits. idk if anthropic is nerfing the subscription model or if the harness is being built by retards. cursor has nailed something the claude code team hasn't, and i'm having to rework everything now. cursor is my primary frontier now. it's what i always wanted claude code to be.

English

4.6K

Steven (Batman) Batchelor-Manning@S_BatMan·22h

@LottoLabs I've found that using it for l9ng running tasks on qwen 27b particularly useful

English

151

Lotto@LottoLabs·1d

Wait maybe the gb10 isn’t that bad 😂

mr-r0b0t@mr_r0b0t

16 local AI agents streaming at once! MiniMax M2.7 NVFP4 — 2x GB10, no cloud APIs.

English

5.4K

Steven (Batman) Batchelor-Manning@S_BatMan·22h

Looking for speedy qwen goodness on your spark look no further:

mr-r0b0t@mr_r0b0t

@0xSero I gotchu! github.com/r0b0tlab/qwen3…

English

296

Steven (Batman) Batchelor-Manning@S_BatMan·22h

@_avichawla When picking the ideal memory solution for your agent you are spoilt for choice, with new systems and approaches arriving daily, I looked through 19 of the most popular to help others find what's best for them x.com/i/status/20505…

Steven (Batman) Batchelor-Manning@S_BatMan

x.com/i/article/2050…

English

Avi Chawla@_avichawla·1d

Build human-like memory for your Agents (open-source)! Every agentic and RAG system struggles with real-time knowledge updates and fast data retrieval. Graphiti solves these issues with its continuously evolving and temporally-aware knowledge graph. Like humans, Graphiti organizes an Agent's memories into episodes, extracts entities and their relationships from these episodes, and stores them in a knowledge graph: (refer to the image below as you read) 1) Episode subgraph: Captures raw data with timestamps, retaining every detail for easy historical lookup. 2) Semantic entity subgraph: Extracts entities (e.g., “Alice,” “Google”) and facts (“Alice works at Google”). Everything is versioned, so outdated info gets replaced. 3) Community subgraph: Groups related entities into clusters, with summaries for faster retrieval. Graphiti delivers up to 18.5% higher accuracy with 90% lower latency when compared to tools like MemGPT. It's fully open-source with 26k+ stars. I have shared the repo in the replies.

GIF

English

104

9.5K

Steven (Batman) Batchelor-Manning retweetledi

うたこ/utako@VRC@utako_vrchat·1d

@S_BatMan This model is tuned for Nvidia Spark, but I was able to achieve 120 tok/s on an RTX 5090 with --draft-max=3! A model supporting NVFP4 + MTP + GGUF simultaneously is rare! Would love to see a vision-capable version too.

English

Steven (Batman) Batchelor-Manning retweetledi

mr-r0b0t@mr_r0b0t·2d

It’s official! @NVIDIAAI / MiniMax-M2.7-NVFP4 Optimized specifically for your SM120/121 DGX Spark (GB10) and RTX 6000/5090 Blackwell tensor cores! Full native FlashInfer/CUTLASS Finalizing the benchmarks and documentation now 😁

English

160

10.6K

Steven (Batman) Batchelor-Manning@S_BatMan·2d

@KishanVavdara @0xSero Guessing others are already getting better results than this, but im enjoying the journey of doing it myself :)

English

Steven (Batman) Batchelor-Manning@S_BatMan·2d

@KishanVavdara @0xSero So far ive managed to get Deepseek V4 flash at Q2_K-XL on the spark to around 12 tokens/second for a small test context, long way to go fixing kernel issues with llama cpp before i think i can get that to a better speed and context length. but lets see.

English

0xSero@0xSero·17 May

It had to happen. - deepseek-v4-flash reap on 1 spark - integration with vLLM-studio - research multi-device inference - dynamo disaggregated inference

English

292

14.7K

Steven (Batman) Batchelor-Manning@S_BatMan·2d

Another intesting system worth a deep dive

Paul Iusztin@pauliusztin_

2 months ago, I started building unified memory layers with knowledge graphs. Here’s the most common question I’ve been asked: “How do you handle entity resolution and deduplication without corrupting the graph?” I didn’t want to give a high-level answer based on assumptions... So I spent a lot of time studying how systems like mem0, cognee, and Neo4j approach this problem. While experimenting with orchestration patterns using @PrefectIO to improve durability and reduce costs. I discovered most people treat entity resolution and deduplication as the same thing. But they’re not. The best memory systems separate naming from identity. Here’s what you have to do after your LLM extracts an entity: 1/ Resolution → "What should we call this?" This layer handles: Typos Acronyms Surface-form similarity Using exact, fuzzy, and semantic matching only against names of nodes of the same type. Examples: “NYC” ↔ “New York City” “JP Morgan” ↔ “JPMorgan Chase” At this stage, the system only updates canonical names used later for soft matching. No graph merges happen yet. Because similar names are NOT strong enough evidence that two entities are identical. For example: Apple: Company ≠ Fruit Jensen Huang: CEO of NVIDIA ≠ A doctor in Taipei with the same name 2/ Deduplication → "Is this the same real-world entity?" Now we embed the full entity context and compare it against existing nodes using semantic + fuzzy similarity across the full context. Based on the similarity score (0 → 1), there are 3 outcomes: High confidence (≥0.95) → auto-merge Medium confidence (>0.85) → human review Low confidence (≤0.85) → new node This is critical because false merges silently corrupt the graph. The smartest design decision I found was treating evidence strength as permission strength: Weak evidence earns a new node Strong evidence earns a merge Uncertain evidence earns a review queue This kept the graph clean as memory scales. But the graph model is only half the problem... Building KGs on top of unstructured data is expensive. These operations all cost money: LLM extraction Entity resolution Embeddings Deduplication If one downstream step fails and you replay the whole pipeline, you burn tokens recomputing work you already paid for. So the best architectures split workflows into checkpointed tasks with retries + caching. This is where tools like @PrefectIO fit extremely well for production memory systems. It lets you: Cache expensive extraction steps Retry only failed stages Batch embeddings efficiently Scale ingestion safely Add observability into every phase Without restarting from every failure from scratch. P.S. What’s been the hardest part for you when building long-term memory for agents via KG?

English

106

Steven (Batman) Batchelor-Manning@S_BatMan·2d

@0xVibly 👋 researching more than building atm

English

0xVibly@0xVibly·2d

Twitter is cool, but it’s 100x better when your feed is full of builders. I want to meet more people working on tech, AI, startups, design, SaaS, web dev, and programming. If that’s you, drop a hi 👋

English

Steven (Batman) Batchelor-Manning@S_BatMan·2d

@populartourist Great now i have lost another afternoon to good advice, this platform is plagued with meaningful helpful advice that makes me want to change stuff. I want time for my other hobbies god damn it

English

1.4K

wd 🔺@populartourist·3d

Hot take: Qwen3.6 27B adheres to instructions like rails if prompt is formatted in JSON. Thank you for your attention to this matter.

English

374

18.9K

Steven (Batman) Batchelor-Manning@S_BatMan·2d

@pupposandro @luceboxai A lot of people seem to be finding their way in through gateway projects such as @ollama and @lmstudio then the slippery slope of llama cpp, vllm, ... etc begins

English

452

Sandro@pupposandro·2d

Local AI is still too complex for 99% of people and the fix isn't better kernel tutorials or cleaner docs. it's solving the entire software stack Local LLMs are still waiting for their plug-and-play moment. Exactly what we're building @luceboxai for

Hikari∣LocalLLM⚡@Hikari_07_jp

Local LLM is incredibly complex. Hardware selection, quantization, harnesses, engines, tensor parallelism, unmodified models, MTP… Despite its complexity, local LLM is irresistibly fascinating. I started using X because there was almost no one close to me who could share this excitement with me.

English

14.5K

Steven (Batman) Batchelor-Manning@S_BatMan·2d

@DailyDoseOfDS_ There are a few different agentic memory approaches, make sure you use the right one to suit your agents needs x.com/S_BatMan/statu…

Steven (Batman) Batchelor-Manning@S_BatMan

x.com/i/article/2052…

English

Daily Dose of Data Science@DailyDoseOfDS_·2d

5 levels of evolution of AI Agents, explained visually:

English

Steven (Batman) Batchelor-Manning@S_BatMan·2d

Your Agent Struggling to remember ? Want to power up Hermes or OpenClaw? Learn about agentic memory systems, pick what suits you and unlock the super intelligence you need x.com/S_BatMan/statu…

Steven (Batman) Batchelor-Manning tweet media

Steven (Batman) Batchelor-Manning@S_BatMan

x.com/i/article/2050…

English

111

Steven (Batman) Batchelor-Manning@S_BatMan·3d

Drop on a quick /goal to regression test today's code and look forward to a morning of reading negative reviews of my work by an ai over coffee.

English

Steven (Batman) Batchelor-Manning@S_BatMan·3d

@spiritbuun ... thats is a very good point :) ill grab it into my onion of a llama franken cpp thanks :)

English

buun@spiritbuun·3d

@S_BatMan Nah, mainline doesn't accept slop like this. The beauty though is that with vibecoding we can all just have our own slop forks that do the niche things we want them to do.

English

buun@spiritbuun·3d

Was OOMing with my favorite quant + MTP + mmproj. I've added a new flag to my fork: --mmproj-gpu-swap. If you're tight on space, this puts mmproj on CPU until it's needed, and then temporarily destroys MTP in order to use it, then swaps it back. Virtually no discernable speed hit

English

913

Keşfet

@bayendor @LottoLabs @sudoingX @_avichawla @NVIDIAAI @KishanVavdara @0xSero @0xVibly