Inference

345 posts

Inference banner
Inference

Inference

@inference_net

Inference Research & Development

San Francisco, CA Katılım Mart 2024
7 Takip Edilen29.7K Takipçiler
Sabitlenmiş Tweet
Inference
Inference@inference_net·
Day Zero fine-tuning & hosting support for Nemotron 3 Super by @nvidia is now live Fine-tune on real production traces & deploy on high-performance infrastructure optimized for Nemotron 3 Super Your data, your weights, your performance edge Learn more: inference.net/blog/nemotron-…
English
6
6
16
5.3K
Inference retweetledi
george k
george k@gk_kintu·
Schematron 3B Vs 8b The 8b is slightly better as expected but both are able to ingest 100s of lines of raw, bloated HTML and output perfectly structured JSON exactly matching my Pydantic schema! Full break down of the models : youtu.be/F__eg5cvS_A @schematron @inference_net
YouTube video
YouTube
English
0
1
1
1.5K
Inference retweetledi
Inference
Inference@inference_net·
@cyrusnewday @dottxtai No constrained sampling. We’ve tested schemas with 50+ fields. We’re releasing V2 of schematron in a few weeks, which will be even more powerful. Would love to see if we can help with your use case, free free to DM @samhogan
English
3
0
3
173
Cyrus
Cyrus@cyrusnewday·
@inference_net Is the schema compliance using eg @dottxtai or legit without any constraining the model produces JSON? If so that’s really impressive Curious how big the schemas you tested on were — we have some use cases that require intelligence and can have 100s of structured outputs
English
1
0
0
167
Inference
Inference@inference_net·
We built the Kimi K2 of web extraction: meet Schematron. It's been getting a lot of love from teams we work with. This is what we heard again this week: "We tested Schematron against smaller models for large-scale HTML schema extraction — it was more accurate and significantly faster. For our discovery endpoint pulling thousands of web pages in parallel, it's the first model that actually works at the quality and latency we need." In short, Schematron is: → 98% of GPT-4.1 quality → 100% JSON schema compliance — zero hallucinated fields → 128K context — handles full raw HTML, no markdown conversion needed → 40–80x cheaper than frontier models → 10x faster — 0.54s per page vs 6s for GPT-5 → Open-source on HuggingFace, runs on Ollama, OpenAI-compatible API And we're cooking its next version — stay tuned 👀 inference.net/blog/schematro…
English
3
1
19
2.1K
Inference
Inference@inference_net·
The LLM Engineering Roadmap. If you want to start today, here's the roadmap👇 1️⃣ LLM Foundations Start by understanding Python and LLM APIs and how they work. Learn prompt engineering, structured outputs, and tool use. ↳ Python/Typescript Basics ↳ LLM APIs ↳ Prompt Engineering ↳ Structured Outputs ↳ Function Calling 2️⃣ Vector Stores Before building anything, you need to understand how text becomes vectors. Learn embedding models, chunking strategies, and similarity search. ↳ Embedding Models (OpenAI Ada, Cohere, BGE) ↳ Vector Databases (Pinecone, Qdrant, ChromaDB, FAISS) ↳ Chunking Strategies ↳ Similarity Search 3️⃣ Retrieval-Augmented Generation (RAG) This is how LLMs answer questions using your data. You learn how to retrieve context and feed it correctly. ↳ Orchestration Frameworks (LangChain, LlamaIndex) ↳ Ingesting Documents ↳ Retrieval Methods (Dense, BM25, Hybrid) ↳ Reranking ↳ Prompt Templates 4️⃣ Advanced RAG This steps helps you understand how to make RAGs reliable and accurate. ↳ Query Transformation ↳ HyDE ↳ Corrective RAG ↳ Self-RAG ↳ Graph RAG 5️⃣ Fine-Tuning Sometimes prompts are not enough for a specialised use case. Fine-tuning will help you understand how models learn domain-specific behaviour. ↳ Data Preparation ↳ LoRA, QLoRA, DoRA ↳ SFT, DPO, RLHF ↳ Training Tools (Unsloth, Axolotl, HF TRL) 6️⃣ Inference Optimization Once systems work, they need to be fast and affordable. This step focuses on learning performance and cost efficiency. ↳ Quantization (GGUF, GPTQ, AWQ) ↳ Serving Engines (vLLM, TGI, llama.cpp) ↳ KV Cache ↳ Flash Attention ↳ Speculative Decoding 7️⃣ Deployment Models are useless if they stay in notebooks. Here you learn how to ship LLM systems to users. ↳ GPU Scheduling ↳ Cloud Platforms (AWS Bedrock, GCP Vertex AI) ↳ Docker, Kubernetes ↳ FastAPI, Streaming (SSE) 8️⃣ Observability This step helps you track quality, latency, and cost. ↳ Tracing (LangSmith, Langfuse, Arize Phoenix) ↳ Latency (TTFT) ↳ Token Usage ↳ Cost Tracking 9️⃣ Agents Agents allows LLMs to plan and use tools. Learn them to understand how LLMs solve multi-step and complex tasks. ↳ Frameworks (LangGraph, CrewAI, Autogen) ↳ Function Calling ↳ Memory Systems ↳ Patterns (ReAct, Plan-and-Execute, Multi-Agent) 🔟 Production & Security Production LLM systems can fail in subtle ways. This step helps you prevent misuse, outages, and cost spikes. ↳ Prompt Injection Defense ↳ Guardrails (NeMo, Guardrails AI) ↳ Semantic Caching ↳ Fallbacks & Rate Limiting ♻️ Repost if you found this insightful Follow us for more AI engineering content!
Inference tweet media
English
23
236
1.1K
49.4K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
We're welcoming @mikepollard_dev to @inference_net as our Founding DevRel Engineer! Mike and I won a pitch competition for my first company nearly 7 years ago Life is long. When you find someone you love to work with, keep them close. You never know when your paths may cross
Sam Hogan 🇺🇸 tweet mediaSam Hogan 🇺🇸 tweet media
English
8
3
44
4K
Inference
Inference@inference_net·
You're overpaying by $30,000/month running AI models at scale. Here's why (and how to fix it) How OpenAI & Anthropic work Per-token pricing: → OpenAI (GPT-4o): $2.50 / $10 per million tokens → Anthropic (Sonnet 4.5): $3 / $15 per million tokens At 1M queries/month: $30,000 - $38,000/mo The problems: 1️⃣ You pay for capabilities you don't use Frontier models are trained for everything. Your task needs maybe 1% of those capabilities. You're paying for the other 99%. 2️⃣ No economies of scale Token #1: $0.003 Token #1,000,000: $0.003 Your costs never decrease. 3️⃣ Smaller frontier models and off-the-shelf open-source models mean worse quality You're forced to choose to pay more or get worse results. The solution: Dedicated GPUs + Specialized Models Instead of per-token pricing, rent dedicated GPUs at a fixed monthly cost. Then train custom models specialized for your specific task: → Distilled from frontier models and large open source models (GPT-5, Claude, Gemini, Kimi, GLM) → Match or exceed frontier quality for your use case → 2-3x faster inference At 1M queries/month: $8,600/mo That's 71-77% cheaper with no quality sacrifice. And the biggest misconception is that "custom models can't match frontier quality." The reality: When specialized for your task, they can exceed frontier intelligence. — Most teams don’t need “the smartest model in the world.” They need the smartest model for one job. Running on infrastructure they control. At a cost that actually scales.
English
9
13
50
6.9K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
Today I’m incredibly excited to announce that @AmarSVS has joined me and @atbeme as a co-founder of @inference_net Anyone who has worked with Amar knows he is a N=1 type of guy. His energy, raw horsepower, and dedication have allowed us to unlock exciting new opportunities and inspired the whole team. I look forward to many more years of partnership, ping pong, and late nights in the office.
Sam Hogan 🇺🇸 tweet media
English
23
6
113
17.9K
Inference
Inference@inference_net·
Claude 3.5 Haiku is getting deprecated even though it worked. Behind the scenes, a lot of teams did the same thing: - Tested newer models - Ran the evals - Quietly rolled back to Haiku Because nothing matched real production behavior. Now they’re stuck. The mistake is thinking the fix is “find the next model.” It’s not. With Inference.net, in < 1 week you can: - Have a custom AI model for your use case - Get the same outputs (or even more accurate) - Keep prompts, workflows, integrations; no rewrites What actually changes: - Costs become predictable - Latency stops being a lottery - Deprecation risk disappears (you own the model) This isn’t open vs closed. It’s about freezing the behavior that already works; and moving on. Haiku’s deprecation just made the dependency visible.
English
5
46
23
3.4K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
AI Agents for DevOps 🤖 We had Claude build us a Slack bot with (read-only) access to all our prod infrastructure. It can access: - K8s logs for all pods - Otel traces & logging - Grafana - Postgres / ClickHouse - Slack - GitHub 2 hours to build. Works like a charm.
Sam Hogan 🇺🇸 tweet media
English
4
80
20
5.1K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
50M impressions on X for Inference net in the last 3 months Q1 2026 we will do ~100M impressions on X alone
Sam Hogan 🇺🇸 tweet media
English
17
6
62
11.9K
Inference
Inference@inference_net·
@gdb Agree inference is the most valuable emerging software category, good insight.
English
0
0
13
1.6K
Greg Brockman
Greg Brockman@gdb·
inference is perhaps the most valuable emerging software category. as models get smarter and more economically valuable, compute will increasingly be spent drawing samples from the models. if you'd like to work on inference at openai, reach out — gdb@openai.com. include a description of an exceptional team you've been a part of, and your contribution towards that team's goals. also indicate any experience in inference, large-scale system optimization, or other areas where you've built up domain expertise. lots of exciting problems to work on, ranging from deeply understanding the model forward pass (including simulating/finding creative opportunities for optimization); to system-level efficiencies such as speculative decoding or kv offloading or workload-aware load balancing; to managing and making observable a massive fleet at scale.
English
102
121
2.3K
358.5K
Inference retweetledi
Amar Singh
Amar Singh@AmarSVS·
This took a lot of trial and error to get right, particularly to train the long context summarizing models. The golden model ended up being hybrid attention, and actually unlocked the ability to process the 100M papers we will release soon
Sam Hogan 🇺🇸@samhogan

We're introducing Project AELLA, in partnership with @laion_ai & @wyndlabs_ai AELLA is an open-science initiative to make scientific research accessible via structured summaries created by LLMs Available now: - Dataset of 100K summaries - 2 fine-tuned LLMs - 3d visualizer 👇

English
1
6
26
9.5K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
Due to an unforeseen naming conflict, we are renaming Project AELLA to Project OSSAS (Open Source Summaries At Scale) Thank you to those who brought the context surrounding this name to our attention, and to our partners and the research community for their ongoing support.
Sam Hogan 🇺🇸@samhogan

We're introducing Project AELLA, in partnership with @laion_ai & @wyndlabs_ai AELLA is an open-science initiative to make scientific research accessible via structured summaries created by LLMs Available now: - Dataset of 100K summaries - 2 fine-tuned LLMs - 3d visualizer 👇

English
241
226
5.2K
1.2M
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
We're introducing Project AELLA, in partnership with @laion_ai & @wyndlabs_ai AELLA is an open-science initiative to make scientific research accessible via structured summaries created by LLMs Available now: - Dataset of 100K summaries - 2 fine-tuned LLMs - 3d visualizer 👇
English
106
183
1.6K
809.1K
Inference retweetledi
Francesco Virga
Francesco Virga@francescodvirga·
Inference verification that’s actually functional and economical. One of the coolest projects I’ve gotten to be apart of, shoutout to @AmarSVS for leading the charge. Many more to come!
Inference@inference_net

Today, we release LOGIC: A novel method for verifying LLM inference in trustless environments. - Detects model substitution, quantization, and decode-time attacks - Works out of the box with @vllm_project, @sgl_project, @openrouter, and more (just need logprops) - Robust across GPU types and hardware configurations - Low computational overhead (~1% of total cost) Blog: inference.net/blog/logic Code: github.com/context-labs/l…

English
1
1
16
6.6K
Inference
Inference@inference_net·
We've been running LOGIC on our globally distributed, permissionless inference network for the last 3 months. Today, we feel confident in saying that LOGIC provides production-ready trust for open inference networks. Learn more: inference.net/blog/logic
English
0
0
21
2.5K
Inference
Inference@inference_net·
LOGIC verifies the statistical fingerprint of model outputs Instead of recreating exact activations, we verify that token-level log-probability distributions match the claimed model. - Operators provide top-k log-probs during generation - Randomly sample decode positions and recompute to verify - Statistical testing (KS test) detects distribution mismatches
Inference tweet media
English
1
0
16
3K