Marktechpost AI

18

32

100.9K

Marktechpost AI@Marktechpost·1d

Why are we still running 7B–27B autoregressive decoder models for what is fundamentally a text classification problem? Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size It is a 300M parameter safety moderation model that runs 16x faster than the current generation of guardrail models. Here's what's actually is interesting to learn: 1. It's an encoder, not a decoder Most guardrail models (LlamaGuard4, WildGuard, ShieldGemma) generate safety verdicts autoregressively — one token at a time. That's slow by design. GLiGuard reframes the whole thing as a text classification problem. One forward pass. Done. 2. Four moderation tasks. Zero added latency. It evaluates all four simultaneously in a single pass: → Safety classification (safe / unsafe) → Jailbreak strategy detection (11 strategies) → Harm category detection (14 categories) → Refusal detection (compliance / refusal) More safety dimensions = no extra compute. That's the architectural win. 3. The benchmark numbers are hard to ignore → 87.7 avg F1 on prompt classification — within 1.7 points of the best model (PolyGuard-Qwen at 89.4) → 82.7 avg F1 on response classification — second only to Qwen3Guard-8B (84.1) → 26ms latency vs. 426ms for ShieldGemma-27B at sequence length 64 → 133 samples/sec throughput vs. 8.2 at batch size 4 → Outperforms LlamaGuard4-12B, ShieldGemma-27B, and NemoGuard-8B — all 23–90x larger 4. It runs on a single GPU At 0.3B parameters, individual developers and smaller teams can deploy and fine-tune it without heavy infrastructure. Full analysis: marktechpost.com/2026/05/13/fas… Paper: arxiv.org/pdf/2605.07982 Model weights on HF: huggingface.co/fastino/gligua… GitHub Repo: github.com/fastino-ai/GLi… Technical details: pioneer.ai/blog/gliguard-… @fastinoAI

English

9

18

68.6K

Marktechpost AI@Marktechpost·1d

Most LLM pre-training efficiency work either changes the tokenizer, the architecture, or the inference behavior. Nous Research just showed you don't have to touch any of them. They released Token Superposition Training (TST) — a two-phase modification to the standard pre-training loop that averages s contiguous token embeddings into a single latent s-token in Phase 1, trains with a multi-hot cross-entropy loss against the next bag of tokens, then reverts to standard next-token prediction in Phase 2 from the same checkpoint, with the TST code fully removed. Here's what's actually interesting: → Each TST step is kept equal-FLOPs to baseline by increasing data sequence length by s× — not the batch size → 3B dense: loss 2.676 in 247 B200-hrs vs 443 B200-hrs for baseline at matched loss (~1.8x faster) → 10B-A1B MoE: 4,768 B200-hrs vs 12,311 B200-hrs at matched loss (~2.5x faster) → Optimal range: bag size s ∈ [3–8] at 270M, s ∈ [6–10] at 600M, s = 16 at 10B; step ratio r ∈ [0.2, 0.4] → Re-initializing the embedding or LM head at the phase boundary breaks it entirely — loss went from 2.676 to 2.938, worse than the 2.808 baseline Full analysis: marktechpost.com/2026/05/13/nou… Paper: arxiv.org/pdf/2605.06546 Project page: nousresearch.com/token-superpos… @NousResearch

English

6

10

444

Marktechpost AI@Marktechpost·12h

Supertone just released Supertonic v3 — an on-device text-to-speech model that runs entirely via ONNX Runtime, no cloud, no API call. Here's what's actually interesting: 1. 31 languages, ~99M parameters→ v2 had 5 languages at 66M params → v3 adds 26 more languages including Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese, and more → Total ONNX asset size: 404 MB → Still smaller than 0.7B–2B class open TTS systems 2. v2-compatible ONNX interface→ Existing integrations upgrade to v3 without changing inference code → Same 4 ONNX files: duration_predictor, text_encoder, vector_estimator, vocoder 3. New expression tags→ , , inline in text → No separate model, no preprocessing needed 4. Text normalization that actually works→ Tested against ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft → All four failed on financial expressions ($5.2M), phone numbers, dates, and technical units (2.3h, 30kph) → Supertonic passed all four categories 5. Runs on CPU, no GPU required→ Competitive WER/CER range vs. VoxCPM2 across supported languages → Demo live on Hugging Face: Supertone/supertonic-3 Install: pip install supertonic Full analysis: marktechpost.com/2026/05/15/sup… GitHub Repo: github.com/supertone-inc/… HF Space: huggingface.co/spaces/Superto… @Supertone_ai

English

@NousResearch x.com/Marktechpost/s…

6

15

402

Marktechpost AI@Marktechpost·1d

Most LLM pre-training efficiency work either changes the tokenizer, the architecture, or the inference behavior. Nous Research just showed you don't have to touch any of them. They released Token Superposition Training (TST) — a two-phase modification to the standard pre-training loop that averages s contiguous token embeddings into a single latent s-token in Phase 1, trains with a multi-hot cross-entropy loss against the next bag of tokens, then reverts to standard next-token prediction in Phase 2 from the same checkpoint, with the TST code fully removed. Here's what's actually interesting: → Each TST step is kept equal-FLOPs to baseline by increasing data sequence length by s× — not the batch size → 3B dense: loss 2.676 in 247 B200-hrs vs 443 B200-hrs for baseline at matched loss (~1.8x faster) → 10B-A1B MoE: 4,768 B200-hrs vs 12,311 B200-hrs at matched loss (~2.5x faster) → Optimal range: bag size s ∈ [3–8] at 270M, s ∈ [6–10] at 600M, s = 16 at 10B; step ratio r ∈ [0.2, 0.4] → Re-initializing the embedding or LM head at the phase boundary breaks it entirely — loss went from 2.676 to 2.938, worse than the 2.808 baseline Full analysis: marktechpost.com/2026/05/13/nou… Paper: arxiv.org/pdf/2605.06546 Project page: nousresearch.com/token-superpos… @NousResearch

QME

89

Nous Research@NousResearch·2d

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

English

147

414

3.7K

422K

Marktechpost AI@Marktechpost·1d

Full analysis: marktechpost.com/2026/05/13/fas… Paper: arxiv.org/pdf/2605.07982 Model weights on HF: huggingface.co/fastino/gligua… GitHub Repo: github.com/fastino-ai/GLi… Technical details: pioneer.ai/blog/gliguard-…

English

@ModelScope2022 x.com/Marktechpost/s…

114

Marktechpost AI retweetledi

Hugging Face@huggingface·3d

We've just hit 1M open datasets on the Hugging Face Hub 🎉 Open models need open data. Today we hit that milestone, together with the most incredible community in AI! 🤗 Onwards to the next million 🚀

English

30

76

616

57.9K

Marktechpost AI@Marktechpost·2d

A 103B medical LLM just got open sourced — and it only activates 6.1B parameters at inference time Meet AntAngelMed — a 103B-parameter medical LLM that only activates 6.1B parameters at inference time. Here's what's actually super interesting: 1. The architectureIt uses a 1/32 activation-ratio MoE built on Ling-flash-2.0. You get 103B total parameters worth of knowledge capacity, but inference cost stays proportional to 6.1B active parameters — matching roughly 40B dense model performance. 2. The training pipelineThree stages: → Continual pre-training on medical corpora (encyclopedias, web text, academic publications) → SFT with mixed general + clinical instruction data → GRPO-based reinforcement learning with task-specific reward models for safety, diagnostic reasoning, and hallucination reduction 3. Inference numbers→ 200+ tokens/s on H20 hardware → ~3× faster than a 36B dense model → 128K context length via YaRN extrapolation → FP8 + EAGLE3 boosts throughput over FP8 alone: +71% on HumanEval, +45% on GSM8K, +94% on Math-500 4. Benchmark results→ #1 open-source on OpenAI's HealthBench — also surpasses several proprietary models → Top-level on MedAIBench (China's national medical AI benchmark) → #1 overall on MedBench across all 5 dimensions: knowledge QA, language understanding, language generation, complex reasoning, and safety & ethics Full analysis: marktechpost.com/2026/05/12/mee… Model Weighs on HF: huggingface.co/MedAIBase/AntA… GitHub Repo: github.com/MedAIBase/AntA… Technical details: modelscope.cn/models/MedAIBa… @AntGroup #OpenSource #llm #medicalai

QME

1

137

ModelScope@ModelScope2022·3d

The world's first open-source 100B medical LLM is here 🏥 try AntAngelMed free on ModelScope now! 👉Demo: modelscope.cn/studios/MedAIB… Built by Ant Group and Zhejiang Provincial Health Commission. Ranks #1 on HealthBench, MedAIBench, and MedBench — beating all open-source models and several top closed-source ones. 100B total params, only 6.1B active — matches ~40B dense model performance at 200+ tokens/s. 3-stage training with GRPO-based RL for empathy, clinical reasoning, and safety. 🤖 Download model: modelscope.cn/models/MedAIBa…

English

7

17

90

4.5K

Marktechpost AI retweetledi

Marktechpost AI@Marktechpost·4d

Feedforward layers account for 80%+ of LLM compute — and for any given token, most of that computation lands on zero-value activations. Sakana AI and NVIDIA research team released TwELL and a set of CUDA kernels that finally make that sparsity exploitable on modern GPUs. Here's the part that is very interesting: Sparse ops have mostly run slower than dense ops on NVIDIA GPUs. The overhead from converting activations to sparse format cancelled every theoretical saving. That's the paradox this new esearch fixes. Here's the breakdown: → TwELL (Tile-wise ELLPACK): A new sparse format built directly into the matmul kernel epilogue. No extra kernel launch. No extra global memory read. No synchronization overhead. → Fused inference kernel: Takes gate activations in TwELL format and performs up + down projections together. The hidden state is never written to global memory. → Hybrid sparse format for training: Routes rows into compact ELL or dense backup dynamically — handles the non-uniform sparsity patterns that make training hard without becoming brittle. → The training recipe: Two changes only — replace SiLU with ReLU, add L1 regularization at coefficient 2×10⁻⁵. Same LR, same optimizer, same batch size. → 2B model results on H100 PCIe: 🟢 +20.5% inference throughput 🟢 +21.9% training step throughput 🟢 −17.0% energy per token 🟢 Accuracy: 49.1% dense → 48.8% sparse → It scales the right way: Average non-zero activations drop from 39 (0.5B) to 24 (2B). Gains grow with model size — not shrink. All kernels are open and released. So, basically it's not about smaller models. It's about skipping the computation that was always wasted. Full Analysis with Visuals/Guide: marktechpost.com/2026/05/11/sak… Paper: arxiv.org/pdf/2603.23198 Repo: github.com/SakanaAI/spars… Technical details: pub.sakana.ai/sparser-faster… @SakanaAILabs @NVIDIAAI @nvidia

GIF

English

3

12

26

115.4K

Marktechpost AI retweetledi

Marktechpost AI@Marktechpost·4d

9 vector databases compared in 2026 — here's where each one actually stands: ↳ @pinecone ↳ Milvus / @zilliz_universe ↳ @qdrant_engine ↳ @weaviate_io ↳ @pgvector ↳ @MongoDB Atlas ↳ @trychroma ↳ @lancedb ↳ @AIatMeta Faiss Full guide: marktechpost.com/2026/05/10/bes…

English

@JulieKallini x.com/Marktechpost/s…

10

18

110.4K

Marktechpost AI@Marktechpost·4d

Meta just made byte-level LLMs 92% cheaper to run at inference. No tokenizer. No subword vocabulary. Just raw bytes — and now, parallel generation. Here's how BLT-Diffusion works: > Standard BLT generates 1 byte at a time (slow) > BLT-D generates a full block of bytes in parallel per step > BLT-S uses BLT's own decoder as a speculative drafter — no extra model > BLT-DV drafts via diffusion, verifies autoregressively — same weights Result: up to 92% memory-bandwidth reduction vs BLT. Translation quality holds. Full analysis: marktechpost.com/2026/05/11/met… Paper: arxiv.org/pdf/2605.08044 @AIatMeta @JulieKallini @ArtidoroPagnoni @TomLimi @gargighosh @LukeZettlemoyer @XiaochuangHan @sriniiyer88 @ChrisGPotts @stanfordnlp

QME

2

7

605

Julie Kallini ✨@JulieKallini·4d

Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/

English

14

110

742

94.2K

Marktechpost AI@Marktechpost·4d

Meta just made byte-level LLMs 92% cheaper to run at inference. No tokenizer. No subword vocabulary. Just raw bytes — and now, parallel generation. Here's how BLT-Diffusion works: > Standard BLT generates 1 byte at a time (slow) > BLT-D generates a full block of bytes in parallel per step > BLT-S uses BLT's own decoder as a speculative drafter — no extra model > BLT-DV drafts via diffusion, verifies autoregressively — same weights Result: up to 92% memory-bandwidth reduction vs BLT. Translation quality holds. Full analysis: marktechpost.com/2026/05/11/met… Paper: arxiv.org/pdf/2605.08044 @AIatMeta @JulieKallini @ArtidoroPagnoni @TomLimi @gargighosh @LukeZettlemoyer @XiaochuangHan @sriniiyer88 @ChrisGPotts @stanfordnlp

English

@SakanaAILabs @nvidia x.com/Marktechpost/s…

5

28

66.6K

Marktechpost AI retweetledi

OpenAI@OpenAI·4d

Today we’re launching the OpenAI Deployment Company to help businesses build and deploy AI. It's majority-owned and controlled by OpenAI. It brings together 19 leading investment firms, consultancies, and system integrators to help organizations deploy frontier AI to production for business impact. openai.com/index/openai-l…

English

672

1.5K

11.4K

7.8M

Marktechpost AI@Marktechpost·4d

Feedforward layers account for 80%+ of LLM compute — and for any given token, most of that computation lands on zero-value activations. Sakana AI and NVIDIA research team released TwELL and a set of CUDA kernels that finally make that sparsity exploitable on modern GPUs. Here's the part that is very interesting: Sparse ops have mostly run slower than dense ops on NVIDIA GPUs. The overhead from converting activations to sparse format cancelled every theoretical saving. That's the paradox this new esearch fixes. Here's the breakdown: → TwELL (Tile-wise ELLPACK): A new sparse format built directly into the matmul kernel epilogue. No extra kernel launch. No extra global memory read. No synchronization overhead. → Fused inference kernel: Takes gate activations in TwELL format and performs up + down projections together. The hidden state is never written to global memory. → Hybrid sparse format for training: Routes rows into compact ELL or dense backup dynamically — handles the non-uniform sparsity patterns that make training hard without becoming brittle. → The training recipe: Two changes only — replace SiLU with ReLU, add L1 regularization at coefficient 2×10⁻⁵. Same LR, same optimizer, same batch size. → 2B model results on H100 PCIe: 🟢 +20.5% inference throughput 🟢 +21.9% training step throughput 🟢 −17.0% energy per token 🟢 Accuracy: 49.1% dense → 48.8% sparse → It scales the right way: Average non-zero activations drop from 39 (0.5B) to 24 (2B). Gains grow with model size — not shrink. All kernels are open and released. So, basically it's not about smaller models. It's about skipping the computation that was always wasted. Full Analysis with Visuals/Guide: marktechpost.com/2026/05/11/sak… Paper: arxiv.org/pdf/2603.23198 Repo: github.com/SakanaAI/spars… Technical details: pub.sakana.ai/sparser-faster… @SakanaAILabs @NVIDIAAI @nvidia

QME