
Rafael
4.3K posts

Rafael
@EffortDefines
Building LaunchApp, the natural language engineering platform for building 0 to 1000.


NEWS: Globe Telecom said it has successfully tested Starlink’s Mobile service in the Philippines, allowing phones to connect in areas with no signal. The pilot was done in Rizal, Batangas, and Bataan, where users were able to send messages, make calls, and use data even without nearby cell towers. "This will be our lifeline, especially during disasters and our complementary coverage in areas where terrestrial network is not available," said Joel Agustin, Senior Vice President for Service Planning and Engineering at Globe. "The service will also address the connectivity requirements of GIDA (Geographically Isolated and Disadvantaged Areas) communities and strengthen coverage across the country's territorial boundaries," he added.










Pocket (@heypocket) is your notetaker for real world meetings. In the last 5 months, the team has delivered over 30k units with a $27M annualized run rate, growing 50% month over month. Congrats on the launch, @AkshayNarisetti and @gabrieldymowski! ycombinator.com/launches/PaX-p…





This is looking amazing

A 5x AI Speed Up With Not Next Token Prediction But NEXT 7 TOKEN PREDICTION! Next-Token Prediction Just Got Retired: And I’m Already Running the Future in My Lab Right Now I’ve been saying it for years: the autoregressive bottleneck is the single biggest drag holding back real-time, production-scale AI. One token at a time? That’s over. In a new paper researchers took pretrained models, specifically Llama-3.1-8B-MagpieAlign-SFT-v0.1 and Qwen3-4B-Instruct and turned them into native multi-token predictors using nothing more than a simple online self-distillation objective. No extra draft models. No speculative decoding scaffolding. No verifier. No new architecture. Just the exact same weights and implementation as the original checkpoint… now spitting out 2–7 tokens (sometimes more) in a single forward pass. They call the inference trick Confidence-Adaptive Decoding (ConfAdapt). The model dynamically decides how many tokens it’s confident enough to commit to. High-confidence spans fly out in chunks. Tricky spots fall back to single-token precision. It’s like the model is self-regulating its own speed vs. quality trade-off in real time. On GSM8K (grade-school math, the classic reasoning benchmark): - Llama-3.1-8B variant: >3× faster decoding with <3% accuracy drop by (τ=90% confidence threshold). - Up to 5×* acceleration if you’re willing to accept a bit more trade-off. - Average chunk size ~3–6 tokens per forward pass in practice. And the quality holds across instruction following, open-ended generation, and other reasoning suites. This isn’t “fast but dumber.” It’s fast and almost indistinguishable. Figure 1 in the paper shows a beautiful GSM8K solution with colored blocks of 1–7 tokens generated at once. Average chunk size: 3.04. Pure poetry. This Is a Genuine Paradigm Shift Speculative decoding? Cool, but you need a whole extra model and fragile pipelines. Medusa / Lookahead? More scaffolding. This? You literally distill the model against its own frozen teacher copy in an on-policy RL-style loop. The student learns to predict spans that the teacher would have produced anyway. Then at inference… it just works. Drop-in replacement. The authors nailed it: “Future architectures will be optimized for **sequence compression and throughput**, not token latency.” I’ve been screaming this exact sentence since 2023. Today it’s not theory, it’s downloadable checkpoints. I’m Testing It RIGHT NOW (Feb 26, 2026, Live From the Lab) As soon as the checkpoints hit Hugging Face (hf.co/collections/to…), I spun them up. First run: Llama-3.1-8B-MTP variant on a long-form reasoning chain I use daily. Wall-clock speedup: 3.4× on my A100 setup. Coherence? Identical to baseline for 95%+ of outputs. I threw it at a 4,000-token agent workflow that normally takes 18 seconds, now under 6 seconds. I’m already wiring it into The Zero-Human Company. What This Means for All of Us - Inference costs just got slashed. - Real-time voice agents that actually feel instant? Finally. - Longer reasoning chains without blowing your budget? Trivial. - The entire “optimize the decoder” cottage industry just got disrupted overnight. We’re not waiting for 100T-parameter monsters anymore. We’re making the models we already have radically more efficient at the architecture level. Next-token prediction didn’t die today. It was mercy-killed, cleanly, elegantly, and with reproducible code. The throughput wars just began. And I’m all in. Paper: arxiv.org/abs/2602.06019 Checkpoints: hf.co/collections/to… Code: github.com/jwkirchenbauer…


A frontier opensource lab in the West will be born this year. Zero doubt. It requires serious capital, like I’ve said before. Working on it. One day I’ll tell the story of how it started in a basement and ended at the frontier.







