RunInfra

6 posts

RunInfra banner
RunInfra

RunInfra

@runinfrai

The inference platform that auto-optimizes your model by @rightnowai_co

amman Katılım Nisan 2026
5 Takip Edilen65 Takipçiler
Sabitlenmiş Tweet
RunInfra
RunInfra@runinfrai·
When you build on a closed API, you get whatever the provider decided is good enough. No kernel optimization, no custom quantization, no control over routing or scaling. Just a fixed model at a fixed price RunInfra is built around a different idea. You pick any open-source model and actually customize it for your use case. The agent handles GPU benchmarking, Triton kernel optimization, quantization, speculative decoding, and smart routing through a chat interface You are building and optimizing your own runinfra.ai
English
1
6
14
5.4K
RunInfra retweetledi
Jaber
Jaber@Akashi203·
we just released RightNow-Arabic-0.5B-Turbo, the smallest open Arabic model on HuggingFace it has 518M params, takes 398MB after quantization, and runs on a phone. it beats Qwen2.5-0.5B and Falcon-H1-0.5B on Arabic benchmarks, ties Falcon-H1-1.5B on COPA-ar at 1/3 the size, and hits 635 t/s on a single H100 this is a quick one before we release the full specialized models series. more is coming soon! we open sourced the weights and all the ablations on HuggingFace: huggingface.co/RightNowAI/Rig… you can now deploy it and build a complete inference pipeline on @runinfrai (runinfra.ai)
Jaber tweet media
English
0
4
13
746
RunInfra retweetledi
Jaber
Jaber@Akashi203·
runinfra (@runinfrai) isn't just llms anymore! voice, transcription, tts, embeddings, any model you bring, we build you a highly optimized inference pipeline for it just tested it with a voice agent. wrote the prompt, got back a fully tuned stt + llm + tts stack try it out : runinfra.ai
English
3
6
19
1.3K
RunInfra retweetledi
Jaber
Jaber@Akashi203·
been thinking about how wasteful LLM inference is at the token level every token goes through every layer. "the" gets 32 matmuls. a hard reasoning step also gets 32 matmuls. same compute for wildly different information content. always a bit silly, but now it's actually expensive, reasoning models emit thousands of thinking tokens per query and most are "ok", "so", "wait", "let me" the fix is sitting right there in the representations. for most tokens the hidden state at ~layer 11 is already nearly identical to the final layer. the rest barely moves the output. you just need a cheap per-token signal to notice so we built TIDE. tiny MLP routers (~4MB) that sit on a frozen model and predict "has this token converged yet". post training, no retraining, bolt it onto any HF causal LM. calibration is 2000 wikitext samples, under 3 min on one GPU deepseek r1 distill 8B on A100: 100% prefill exit rate, 7.2% lower latency, 99% of decode tokens exit early on a multi step math problem with the answer unchanged. 8B is the floor. the methodology compounds with depth and output length, 70B+ has ~80 layers of redundancy and inference time scaling models emit 10 to 100x more tokens per query. opus class + long chain of thought is where the lever gets real paper: arxiv.org/abs/2603.21365 code: github.com/RightNow-AI/TI… (this kind of kernel level stuff is what we bake into @runinfrai by default, check it out runinfra.ai)
Jaber tweet media
English
19
49
478
27.9K
RunInfra retweetledi
Jaber
Jaber@Akashi203·
we published autokernel on arxiv inspired by @karpathy 's autoresearch, we applied the same keep/revert agent loop to GPU kernel optimization you give it any pytorch model, it profiles it, ranks bottlenecks by amdahl's law, writes triton or CUDA C++ replacements, and runs 300+ experiments overnight with no human in the loop - 5.29x over pytorch eager on rmsnorm - 2.82x on softmax - beats torch.compile by 3.44x on softmax and 2.94x on cross entropy - #1 on the vectorsum_v2 B200 leaderboard - single prompt triton FP4 matmul that beats CUTLASS by up to 2.15x every candidate passes a 5-stage correctness harness before any speedup counts, and the whole thing runs at ~40 experiments/hour so you wake up to a faster model arxiv: arxiv.org/abs/2603.21331 github: github.com/RightNow-AI/au…
Jaber tweet media
English
19
80
664
88.1K