Kog (@Kog__AI) - Twitter Profili | Zamantika Mersobahis Locabet

1

38

Kog@Kog__AI·3d

LLM inference usually means watching a page stream into place, token after token, while you wait for it to finish. The Kog Inference Engine (KIE) returns the finished landing page before you finish reading your own prompt. Ask for a warm editorial layout, it is already there. Ask for a neon developer tool build, it is already there. The loop between asking and seeing closes fast enough to stay in flow. Restyling a page feels immediate, so you try ten directions in the time one used to cost. This is Laneformer 2B in tech preview, a small model built to show what the engine does. The same shift matters most inside an agent, where every step waits on the previous one. If inference latency is the bottleneck in your agentic loop, that is the conversation we want to have.

English

4

6

207

Kog@Kog__AI·25 Haz

Blog post: huggingface.co/blog/kogai/kog… Try it on playground.kog.ai Tokenizer under the Llama 2 Community License

English

2

4

383

Kog@Kog__AI·25 Haz

The model behind 3,000 tokens/s is now open-source. Laneformer 2B is on Hugging Face with weights, model code, and the full training recipe under Apache 2.0. Here is why we trained it from scratch. The Kog Inference Engine (KIE) generates 3,000 output tokens/s per request on 8× AMD MI300X GPU. Delayed Tensor Parallelism (DTP) is one of the reasons. Standard tensor parallelism blocks on an all-reduce at every layer. DTP delays each all-reduce by δ = 2 layers and overlaps it with the next weights streaming in, which keeps inter-GPU communication overhead negligible. Laneformer runs this as an 8-lane structure across the 8 GPUs, with the delay built into the architecture. DTP works best when the model is designed around it from day one. A fresh architecture starts from random weights, so we trained Laneformer from scratch on 6T tokens of open Nemotron data. In greedy decoding, Laneformer 2B scores 45.1% on HumanEval+, ahead of Qwen3.5 2B at 31.1%, Gemma 2 2B at 32.9%, and SmolLM2 1.7B at 29.9%. A 500-token completion finishes in under 0.2 seconds at that speed, so drawing 8 samples takes about 1.3 seconds and lifts HumanEval+ to 65.0%, which makes test-time compute a cheap option. Weights, model code, and the recipe are open. Links in the first comment 👇

English

8

17

77

9.2K

Kog@Kog__AI·16 Haz

During LLM inference, grid synchronizations across the GPU account for 6 occurrences per layer. 35% of every token's generation time was lost to that overhead. We reduced it by 9x. The standard approach uses a global counter. Each compute unit arrives, increments, and waits until the counter reaches 256. Every sync triggers full cache write-backs and invalidations across HBM, even when only a few values are needed. Measured cost on AMD MI300X between 7.6 and 7.9 µs per sync. Kog encodes readiness directly into the data. Buffers are initialized to NaN. Each CU polls only the values it actually needs, using scope-controlled loads that bypass unnecessary cache movement across chiplets. When the NaN disappears, the data is ready. Zero global counter contention. Zero broad cache invalidation on the critical path. 0.80 to 0.93 µs instead of 7.6 and 7.9 µs. Same hardware. That headroom goes directly into token generation speed. One of the reasons the Kog Inference Engine (KIE) generates 3,000 output tokens/s per request on MI300X. Full implementation with code, chiplet topology details, and the complete monokernel breakdown at blog.kog.ai/building-a-sin…

English

2

5

232

Kog retweetledi

David Hendrickson@TeksEdge·5 Haz

👀 Watch out @cerebras and @GroqInc - mystery model outputs 3000+ tps on standard GPUs. 🔥 Here is a comparison between Kog Laneformer-2B & Google Gemma3n-4B, both non-reasoning with same prompt. Laneformer @ 3000+ tps finished in 3s and Gemma 3n 4B took 43s.

3,000 tokens/s inference speed pulls developers in. Our launch last week proved it. Our post hit the Hacker News front page and stayed for 12 hours. 13,800 engineers read the Kog Labs technical breakdown. 2,240 developers tested our live playground, with a whooping 75% activation rate. More than 4 million tokens generated across thousands of conversations at an average generation speed of ~3,200 tokens/s. When inference is fast enough to feel different, developers come and build. Read our technical blog posts and test it by yourself. Try the playground → playground.kog.ai 💥Why 3,000 tokens per second matters and how we got there → blog.kog.ai/real-time-llm-… 📖 Deep dive into the monokernel architecture on AMD MI300X → blog.kog.ai/building-a-sin… 📖 Delayed Tensor Parallelism, our approach to removing inter-GPU communication overhead → blog.kog.ai/delayed-tensor…

English

3

4

23

8.5K

Kog@Kog__AI·5 Haz

3,000 tokens/s inference speed pulls developers in. Our launch last week proved it. Our post hit the Hacker News front page and stayed for 12 hours. 13,800 engineers read the Kog Labs technical breakdown. 2,240 developers tested our live playground, with a whooping 75% activation rate. More than 4 million tokens generated across thousands of conversations at an average generation speed of ~3,200 tokens/s. When inference is fast enough to feel different, developers come and build. Read our technical blog posts and test it by yourself. Try the playground → playground.kog.ai 💥Why 3,000 tokens per second matters and how we got there → blog.kog.ai/real-time-llm-… 📖 Deep dive into the monokernel architecture on AMD MI300X → blog.kog.ai/building-a-sin… 📖 Delayed Tensor Parallelism, our approach to removing inter-GPU communication overhead → blog.kog.ai/delayed-tensor…

English

7

11

7.9K

Kog@Kog__AI·4 Haz

@JonathanLeaders We'll find out soon (work in progress). There are very good MoE models like GPT-OSS-120B or the latest Qwen that could be 1,000 - 3,000 tokens/s. For the biggest SoTA MoE models (like DeepSeek V4) we are probably looking 500-1,000 tokens/s.

English

55

Jonathan Leaders@JonathanLeaders·4 Haz

@Kog__AI How fast is it at running a state of the art model?

English

0

48

Kog@Kog__AI·28 May

🚀 Launch today: Kog generates 3,000+ output tokens/s per single request, on standard datacenter GPUs. We are bringing real-time LLM inference to hardware that companies already run in production. The speed previously associated with purpose-built silicon is now delivered on NVIDIA H200 and AMD MI300X. Today, we are opening our Tech Preview with a 2B coding model, with large frontier MoE support coming next. Try our Playground → playground.kog.ai 💥 Why that matters, and how we did it → blog.kog.ai/real-time-llm-… 📖 Monokernel deep dive → blog.kog.ai/building-a-sin… 📖 Delayed Tensor Parallelism research → blog.kog.ai/delayed-tensor… read the thread 👇

English

16

40

264

6.2M

Kog@Kog__AI·4 Haz

@Michele26248535 Team of 11 engineers and PhDs, and our next model might be a bit bigger. Think DeepSeek V4 ;)

English

38

Michele Lane@Michele26248535·3 Haz

@Kog__AI How big is the team, and are you actually adding 8B models soon, or just kicking it? IndexGPT is decent for tracking AI visibility, but model support is the part people keep tripping on.

English

0

68

Kog@Kog__AI·2 Haz

@Alloutnikhil You can view the replay video of our talk here: youtube.com/watch?v=ndSA9T…

YouTube

English

0

1

251

nikhil tayal@Alloutnikhil·2 Haz

@Kog__AI Cool I missed it can’t wait to try it. Do you still have someone here in sf?

English

0

14

Kog@Kog__AI·12 May

A great week in San Francisco for the Kog team at AMD AI DevDay 2026. Presenting some of our latest work was a real pleasure, and the conversations around research, infrastructure, and product were especially valuable. One signal came through very clearly. Inference is moving much closer to the center of the conversation. That gave us even more conviction that we are building in the right direction. More to come soon.

English

2

1

88

Kog@Kog__AI·2 Haz

@luckymoooon @gaeldelalleau @__smiz @grok Qwen models are very good and we could make them even faster, so hopefully we support them next

English

1

30

Kog@Kog__AI·2 Haz

@luckymoooon @gaeldelalleau @__smiz @grok We plan to support DeepSeek v4 and GPT-OSS-120B

English

0

36

Kog retweetledi

Steeve Morin@steeve·30 May

Incredible!

🚀 Launch today: Kog generates 3,000+ output tokens/s per single request, on standard datacenter GPUs. We are bringing real-time LLM inference to hardware that companies already run in production. The speed previously associated with purpose-built silicon is now delivered on NVIDIA H200 and AMD MI300X. Today, we are opening our Tech Preview with a 2B coding model, with large frontier MoE support coming next. Try our Playground → playground.kog.ai 💥 Why that matters, and how we did it → blog.kog.ai/real-time-llm-… 📖 Monokernel deep dive → blog.kog.ai/building-a-sin… 📖 Delayed Tensor Parallelism research → blog.kog.ai/delayed-tensor… read the thread 👇

English

2

18

3K

Kog retweetledi

Rohan Paul@rohanpaul_ai·29 May

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. @Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

English

8

16

88

13.2K

Kog@Kog__AI·29 May

@mahimaidev Thanks! Here is the link to the thread explaining our approach: x.com/Kog__AI/status…

🚀 Launch today: Kog generates 3,000+ output tokens/s per single request, on standard datacenter GPUs. We are bringing real-time LLM inference to hardware that companies already run in production. The speed previously associated with purpose-built silicon is now delivered on NVIDIA H200 and AMD MI300X. Today, we are opening our Tech Preview with a 2B coding model, with large frontier MoE support coming next. Try our Playground → playground.kog.ai 💥 Why that matters, and how we did it → blog.kog.ai/real-time-llm-… 📖 Monokernel deep dive → blog.kog.ai/building-a-sin… 📖 Delayed Tensor Parallelism research → blog.kog.ai/delayed-tensor… read the thread 👇

English

1

119

Mahimai Raja J ‎@mahimaidev·29 May

inference on steroids? @Kog__AI Saw this blog post today coming the llm inference speed with vllm, sglang and got amazed by the results

English

2

0

4

124

Kog retweetledi

Rohan Paul@rohanpaul_ai·29 May

The monokernel idea was one of their powerful trick. Instead of launching many small GPU programs for normalization, attention, feed-forward layers, sampling, and communication, Kog keeps the whole decode loop inside 1 long-running GPU program. With a monokernel, weights for the next stage can start loading while the current stage is still finishing, so the GPU behaves more like a pipeline and less like a machine constantly being paused and restarted. If a Transformer layer is broken into many small GPU programs, the system can burn a scary amount of its budget just stopping, starting, syncing, writing, reloading, and waiting, before doing useful token generation. The monokernel tries to remove that stop-start behavior. Once it begins, it stays resident on the GPU and handles the full sequence, including prefill, decode, sampling, tensor-parallel communication, reductions, and internal state, without going back to the CPU for every little step. The big gain is that weight streaming stays continuous. For batch-size-1 inference, the GPU mostly needs to stream active model weights from high-bandwidth memory into compute units as smoothly as possible. Read more about their “monokernel” implementation here. blog.kog.ai/building-a-sin…

English

2

7

1.1K

Kog retweetledi

Liquid AI@liquidai·28 May

Today, we're releasing LFM2.5-8B-A1B, a device-optimized model designed to power real-life applications on phones, laptops, PCs, robots, and fast & lightweight server-side use-cases. > 8B MoE, 1.5B active > Expanded 128K context > LFM2.5 flagship hybrid MoE architecture > Trained on 38T tokens + large-scale RL > fast, reliable tool calling, punching above its weight, comparable to models with up to 4x its size > customizable on a single GPU for any specialized task > LFM2 open-weight license 🧵

English

142

511

3.9K

1.3M

Kog retweetledi

Gaël Delalleau@gaeldelalleau·28 May

Kog officially launched today! Super-fast AI inference speed on standard GPUs, 30x faster than ChatGPT. And it's European deep tech. Check it out 👇

🚀 Launch today: Kog generates 3,000+ output tokens/s per single request, on standard datacenter GPUs. We are bringing real-time LLM inference to hardware that companies already run in production. The speed previously associated with purpose-built silicon is now delivered on NVIDIA H200 and AMD MI300X. Today, we are opening our Tech Preview with a 2B coding model, with large frontier MoE support coming next. Try our Playground → playground.kog.ai 💥 Why that matters, and how we did it → blog.kog.ai/real-time-llm-… 📖 Monokernel deep dive → blog.kog.ai/building-a-sin… 📖 Delayed Tensor Parallelism research → blog.kog.ai/delayed-tensor… read the thread 👇

English

17

21

398

3.4M

Kog@Kog__AI·28 May

We expect our thousands-token-on-GPU speed results to scale way past our 2B-parameter preview model. Single-request decoding depends on active-parameter count per token, not total parameters, and current frontier MoE models only activate a fraction. For instance, DeepSeek-V4-Flash has 284B total, 13B active. With its default FP4/FP8 quantization, the math checks out for 1,000 to 3,000 tokens/s on current datacenter GPUs. We plan to support frontier MoE models in Kog Inference Engine in the coming months, on GPU hardware enterprises and sovereign-AI builders already own. No proprietary silicon needed. 🚀 Try it → playground.kog.ai 👀 Take a deeper look at our claims → blog.kog.ai/real-time-llm-… 🛠️ Build with it → kog.ai (building fast agents? We're taking design partners!)

English

2

10

1.3K

Kog@Kog__AI·28 May

We also introduce Delayed Tensor Parallelism (DTP) to minimize inter-GPU communication wait-time. Standard tensor parallelism ends each module with a blocking all-reduce operation. DTP is a Transformer architecture variant that makes those all-reduce operations asynchronous, hence reducing inference communication overhead to 0. DTP matches the quality of standard TP, and edges out similar methods like Ladder Residual and PT-Transformer in our setting. 📖 Full post → blog.kog.ai/delayed-tensor…

English