Marktechpost AI

13.1K posts

Marktechpost AI banner
Marktechpost AI

Marktechpost AI

@Marktechpost

🐝 AI Dev News Platform (1 million+monthly traffic) | 150k+ AI subreddit | Contact: [email protected]

What is trending in AI? Katılım Nisan 2016
1.1K Takip Edilen11.1K Takipçiler
Marktechpost AI
Marktechpost AI@Marktechpost·
Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field The AI coding agent field in 2026 has a clear leader — Claude Code on Opus 4.The AI coding agent field in 2026 has a clear leader — Claude Code on Opus 4.7 at 87.6% SWE-bench Verified — but every ranking comes with a caveat: OpenAI declared that benchmark contaminated in February 2026 and stopped reporting it. Beyond Claude Code, GPT-5.5 tops Terminal-Bench at 82.7% making Codex the pick for DevOps workflows, Cursor leads on IDE-native daily development, Gemini CLI delivers frontier performance for free, and open-source tools like OpenHands and Cline close the quality gap at zero platform cost. Most productive developers run all three layers in parallel: a terminal agent for complex work, an IDE extension for daily editing, and an open-source tool for flexibility — no single agent dominates all three. Full read: marktechpost.com/2026/05/15/bes… @AnthropicAI @claudeai @OpenAI @cursor_ai @GeminiApp @github @cognition @OpenHandsDev @augmentcode @cline #coding #ai
Marktechpost AI tweet media
English
2
5
11
222
Marktechpost AI
Marktechpost AI@Marktechpost·
A 103B medical LLM just got open sourced — and it only activates 6.1B parameters at inference time Meet AntAngelMed — a 103B-parameter medical LLM that only activates 6.1B parameters at inference time. Here's what's actually super interesting: 1. The architectureIt uses a 1/32 activation-ratio MoE built on Ling-flash-2.0. You get 103B total parameters worth of knowledge capacity, but inference cost stays proportional to 6.1B active parameters — matching roughly 40B dense model performance. 2. The training pipelineThree stages: → Continual pre-training on medical corpora (encyclopedias, web text, academic publications) → SFT with mixed general + clinical instruction data → GRPO-based reinforcement learning with task-specific reward models for safety, diagnostic reasoning, and hallucination reduction 3. Inference numbers→ 200+ tokens/s on H20 hardware → ~3× faster than a 36B dense model → 128K context length via YaRN extrapolation → FP8 + EAGLE3 boosts throughput over FP8 alone: +71% on HumanEval, +45% on GSM8K, +94% on Math-500 4. Benchmark results→ #1 open-source on OpenAI's HealthBench — also surpasses several proprietary models → Top-level on MedAIBench (China's national medical AI benchmark) → #1 overall on MedBench across all 5 dimensions: knowledge QA, language understanding, language generation, complex reasoning, and safety & ethics Full analysis: marktechpost.com/2026/05/12/mee… Model Weighs on HF: huggingface.co/MedAIBase/AntA… GitHub Repo: github.com/MedAIBase/AntA… Technical details: modelscope.cn/models/MedAIBa… @AntGroup #OpenSource #llm #medicalai
Marktechpost AI tweet media
English
3
12
28
1.1K
Marktechpost AI
Marktechpost AI@Marktechpost·
Most real-time AI is a turn-based LLM with voice-activity detection bolted on. That's not an interaction model — and Thinking Machines Lab just drew a very clear line between the two. They introduced a research preview of TML-Interaction-Small — a 276B MoE model with 12B active parameters built around a multi-stream, time-aligned micro-turn architecture that processes 200ms chunks of audio, video, and text simultaneously, with no external turn-detection scaffolding anywhere in the stack. Here's what's actually interesting: → Full-duplex interaction and asynchronous background reasoning running in parallel, sharing full conversation context → Audio as dMel, video as 40×40 hMLP patches, flow head decoder — all co-trained from scratch with the transformer → FD-bench v1.5: 77.8 vs. 47.8 for GPT-realtime-2.0 → Charades mIoU (visual proactivity): 32.4 vs. 0 for GPT-realtime-2.0 The core bet: train interactivity into the weights, not the pipeline. Full analysis: marktechpost.com/2026/05/13/mir… Technical Details: thinkingmachines.ai/blog/interacti… @thinkymachines @miramurati
Marktechpost AI tweet media
English
1
18
32
100.9K
Marktechpost AI
Marktechpost AI@Marktechpost·
Why are we still running 7B–27B autoregressive decoder models for what is fundamentally a text classification problem? Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size It is a 300M parameter safety moderation model that runs 16x faster than the current generation of guardrail models. Here's what's actually is interesting to learn: 1. It's an encoder, not a decoder Most guardrail models (LlamaGuard4, WildGuard, ShieldGemma) generate safety verdicts autoregressively — one token at a time. That's slow by design. GLiGuard reframes the whole thing as a text classification problem. One forward pass. Done. 2. Four moderation tasks. Zero added latency. It evaluates all four simultaneously in a single pass: → Safety classification (safe / unsafe) → Jailbreak strategy detection (11 strategies) → Harm category detection (14 categories) → Refusal detection (compliance / refusal) More safety dimensions = no extra compute. That's the architectural win. 3. The benchmark numbers are hard to ignore → 87.7 avg F1 on prompt classification — within 1.7 points of the best model (PolyGuard-Qwen at 89.4) → 82.7 avg F1 on response classification — second only to Qwen3Guard-8B (84.1) → 26ms latency vs. 426ms for ShieldGemma-27B at sequence length 64 → 133 samples/sec throughput vs. 8.2 at batch size 4 → Outperforms LlamaGuard4-12B, ShieldGemma-27B, and NemoGuard-8B — all 23–90x larger 4. It runs on a single GPU At 0.3B parameters, individual developers and smaller teams can deploy and fine-tune it without heavy infrastructure. Full analysis: marktechpost.com/2026/05/13/fas… Paper: arxiv.org/pdf/2605.07982 Model weights on HF: huggingface.co/fastino/gligua… GitHub Repo: github.com/fastino-ai/GLi… Technical details: pioneer.ai/blog/gliguard-… @fastinoAI
Marktechpost AI tweet media
English
1
9
18
68.6K
Marktechpost AI
Marktechpost AI@Marktechpost·
Most LLM pre-training efficiency work either changes the tokenizer, the architecture, or the inference behavior. Nous Research just showed you don't have to touch any of them. They released Token Superposition Training (TST) — a two-phase modification to the standard pre-training loop that averages s contiguous token embeddings into a single latent s-token in Phase 1, trains with a multi-hot cross-entropy loss against the next bag of tokens, then reverts to standard next-token prediction in Phase 2 from the same checkpoint, with the TST code fully removed. Here's what's actually interesting: → Each TST step is kept equal-FLOPs to baseline by increasing data sequence length by s× — not the batch size → 3B dense: loss 2.676 in 247 B200-hrs vs 443 B200-hrs for baseline at matched loss (~1.8x faster) → 10B-A1B MoE: 4,768 B200-hrs vs 12,311 B200-hrs at matched loss (~2.5x faster) → Optimal range: bag size s ∈ [3–8] at 270M, s ∈ [6–10] at 600M, s = 16 at 10B; step ratio r ∈ [0.2, 0.4] → Re-initializing the embedding or LM head at the phase boundary breaks it entirely — loss went from 2.676 to 2.938, worse than the 2.808 baseline Full analysis: marktechpost.com/2026/05/13/nou… Paper: arxiv.org/pdf/2605.06546 Project page: nousresearch.com/token-superpos… @NousResearch
Marktechpost AI tweet media
English
1
6
10
444
Marktechpost AI
Marktechpost AI@Marktechpost·
Supertone just released Supertonic v3 — an on-device text-to-speech model that runs entirely via ONNX Runtime, no cloud, no API call. Here's what's actually interesting: 1. 31 languages, ~99M parameters→ v2 had 5 languages at 66M params → v3 adds 26 more languages including Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese, and more → Total ONNX asset size: 404 MB → Still smaller than 0.7B–2B class open TTS systems 2. v2-compatible ONNX interface→ Existing integrations upgrade to v3 without changing inference code → Same 4 ONNX files: duration_predictor, text_encoder, vector_estimator, vocoder 3. New expression tags→ , , inline in text → No separate model, no preprocessing needed 4. Text normalization that actually works→ Tested against ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft → All four failed on financial expressions ($5.2M), phone numbers, dates, and technical units (2.3h, 30kph) → Supertonic passed all four categories 5. Runs on CPU, no GPU required→ Competitive WER/CER range vs. VoxCPM2 across supported languages → Demo live on Hugging Face: Supertone/supertonic-3 Install: pip install supertonic Full analysis: marktechpost.com/2026/05/15/sup… GitHub Repo: github.com/supertone-inc/… HF Space: huggingface.co/spaces/Superto… @Supertone_ai
English
0
6
15
402
Nous Research
Nous Research@NousResearch·
Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.
Nous Research tweet media
English
147
414
3.7K
422K
Marktechpost AI retweetledi
Hugging Face
Hugging Face@huggingface·
We've just hit 1M open datasets on the Hugging Face Hub 🎉 Open models need open data. Today we hit that milestone, together with the most incredible community in AI! 🤗 Onwards to the next million 🚀
English
30
76
616
57.9K
Marktechpost AI
Marktechpost AI@Marktechpost·
Marktechpost AI@Marktechpost

A 103B medical LLM just got open sourced — and it only activates 6.1B parameters at inference time Meet AntAngelMed — a 103B-parameter medical LLM that only activates 6.1B parameters at inference time. Here's what's actually super interesting: 1. The architectureIt uses a 1/32 activation-ratio MoE built on Ling-flash-2.0. You get 103B total parameters worth of knowledge capacity, but inference cost stays proportional to 6.1B active parameters — matching roughly 40B dense model performance. 2. The training pipelineThree stages: → Continual pre-training on medical corpora (encyclopedias, web text, academic publications) → SFT with mixed general + clinical instruction data → GRPO-based reinforcement learning with task-specific reward models for safety, diagnostic reasoning, and hallucination reduction 3. Inference numbers→ 200+ tokens/s on H20 hardware → ~3× faster than a 36B dense model → 128K context length via YaRN extrapolation → FP8 + EAGLE3 boosts throughput over FP8 alone: +71% on HumanEval, +45% on GSM8K, +94% on Math-500 4. Benchmark results→ #1 open-source on OpenAI's HealthBench — also surpasses several proprietary models → Top-level on MedAIBench (China's national medical AI benchmark) → #1 overall on MedBench across all 5 dimensions: knowledge QA, language understanding, language generation, complex reasoning, and safety & ethics Full analysis: marktechpost.com/2026/05/12/mee… Model Weighs on HF: huggingface.co/MedAIBase/AntA… GitHub Repo: github.com/MedAIBase/AntA… Technical details: modelscope.cn/models/MedAIBa… @AntGroup #OpenSource #llm #medicalai

QME
0
1
1
137
ModelScope
ModelScope@ModelScope2022·
The world's first open-source 100B medical LLM is here 🏥 try AntAngelMed free on ModelScope now! 👉Demo: modelscope.cn/studios/MedAIB… Built by Ant Group and Zhejiang Provincial Health Commission. Ranks #1 on HealthBench, MedAIBench, and MedBench — beating all open-source models and several top closed-source ones. 100B total params, only 6.1B active — matches ~40B dense model performance at 200+ tokens/s. 3-stage training with GRPO-based RL for empathy, clinical reasoning, and safety. 🤖 Download model: modelscope.cn/models/MedAIBa…
ModelScope tweet media
English
7
17
90
4.5K
Marktechpost AI retweetledi
Marktechpost AI
Marktechpost AI@Marktechpost·
Feedforward layers account for 80%+ of LLM compute — and for any given token, most of that computation lands on zero-value activations. Sakana AI and NVIDIA research team released TwELL and a set of CUDA kernels that finally make that sparsity exploitable on modern GPUs. Here's the part that is very interesting: Sparse ops have mostly run slower than dense ops on NVIDIA GPUs. The overhead from converting activations to sparse format cancelled every theoretical saving. That's the paradox this new esearch fixes. Here's the breakdown: → TwELL (Tile-wise ELLPACK): A new sparse format built directly into the matmul kernel epilogue. No extra kernel launch. No extra global memory read. No synchronization overhead. → Fused inference kernel: Takes gate activations in TwELL format and performs up + down projections together. The hidden state is never written to global memory. → Hybrid sparse format for training: Routes rows into compact ELL or dense backup dynamically — handles the non-uniform sparsity patterns that make training hard without becoming brittle. → The training recipe: Two changes only — replace SiLU with ReLU, add L1 regularization at coefficient 2×10⁻⁵. Same LR, same optimizer, same batch size. → 2B model results on H100 PCIe: 🟢 +20.5% inference throughput 🟢 +21.9% training step throughput 🟢 −17.0% energy per token 🟢 Accuracy: 49.1% dense → 48.8% sparse → It scales the right way: Average non-zero activations drop from 39 (0.5B) to 24 (2B). Gains grow with model size — not shrink. All kernels are open and released. So, basically it's not about smaller models. It's about skipping the computation that was always wasted. Full Analysis with Visuals/Guide: marktechpost.com/2026/05/11/sak… Paper: arxiv.org/pdf/2603.23198 Repo: github.com/SakanaAI/spars… Technical details: pub.sakana.ai/sparser-faster… @SakanaAILabs @NVIDIAAI @nvidia
GIF
English
3
12
26
115.4K
Julie Kallini ✨
Julie Kallini ✨@JulieKallini·
Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/
English
14
110
742
94.2K
Marktechpost AI
Marktechpost AI@Marktechpost·
Meta just made byte-level LLMs 92% cheaper to run at inference. No tokenizer. No subword vocabulary. Just raw bytes — and now, parallel generation. Here's how BLT-Diffusion works: > Standard BLT generates 1 byte at a time (slow) > BLT-D generates a full block of bytes in parallel per step > BLT-S uses BLT's own decoder as a speculative drafter — no extra model > BLT-DV drafts via diffusion, verifies autoregressively — same weights Result: up to 92% memory-bandwidth reduction vs BLT. Translation quality holds. Full analysis: marktechpost.com/2026/05/11/met… Paper: arxiv.org/pdf/2605.08044 @AIatMeta @JulieKallini @ArtidoroPagnoni @TomLimi @gargighosh @LukeZettlemoyer @XiaochuangHan @sriniiyer88 @ChrisGPotts @stanfordnlp
English
1
5
28
66.6K
Marktechpost AI retweetledi
OpenAI
OpenAI@OpenAI·
Today we’re launching the OpenAI Deployment Company to help businesses build and deploy AI. It's majority-owned and controlled by OpenAI. It brings together 19 leading investment firms, consultancies, and system integrators to help organizations deploy frontier AI to production for business impact. openai.com/index/openai-l…
English
672
1.5K
11.4K
7.8M
Marktechpost AI
Marktechpost AI@Marktechpost·
Marktechpost AI@Marktechpost

Feedforward layers account for 80%+ of LLM compute — and for any given token, most of that computation lands on zero-value activations. Sakana AI and NVIDIA research team released TwELL and a set of CUDA kernels that finally make that sparsity exploitable on modern GPUs. Here's the part that is very interesting: Sparse ops have mostly run slower than dense ops on NVIDIA GPUs. The overhead from converting activations to sparse format cancelled every theoretical saving. That's the paradox this new esearch fixes. Here's the breakdown: → TwELL (Tile-wise ELLPACK): A new sparse format built directly into the matmul kernel epilogue. No extra kernel launch. No extra global memory read. No synchronization overhead. → Fused inference kernel: Takes gate activations in TwELL format and performs up + down projections together. The hidden state is never written to global memory. → Hybrid sparse format for training: Routes rows into compact ELL or dense backup dynamically — handles the non-uniform sparsity patterns that make training hard without becoming brittle. → The training recipe: Two changes only — replace SiLU with ReLU, add L1 regularization at coefficient 2×10⁻⁵. Same LR, same optimizer, same batch size. → 2B model results on H100 PCIe: 🟢 +20.5% inference throughput 🟢 +21.9% training step throughput 🟢 −17.0% energy per token 🟢 Accuracy: 49.1% dense → 48.8% sparse → It scales the right way: Average non-zero activations drop from 39 (0.5B) to 24 (2B). Gains grow with model size — not shrink. All kernels are open and released. So, basically it's not about smaller models. It's about skipping the computation that was always wasted. Full Analysis with Visuals/Guide: marktechpost.com/2026/05/11/sak… Paper: arxiv.org/pdf/2603.23198 Repo: github.com/SakanaAI/spars… Technical details: pub.sakana.ai/sparser-faster… @SakanaAILabs @NVIDIAAI @nvidia

QME
0
0
0
34
Sakana AI
Sakana AI@SakanaAILabs·
How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️ Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models: Paper: arxiv.org/abs/2603.23198 Blog: pub.sakana.ai/sparser-faster… Code: github.com/SakanaAI/spars… While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance. So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations. Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes. We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy. This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!
GIF
English
20
119
752
395.6K