Inference

354 posts

Inference banner
Inference

Inference

@inference_net

Inference Research & Development. Train and deploy specialized LLMs for your apps and agents in minutes. Get started: https://t.co/hGMtZjT2ND

San Francisco, CA Katılım Mart 2024
4 Takip Edilen29.6K Takipçiler
Sabitlenmiş Tweet
Inference
Inference@inference_net·
The best production model is the one trained for the job. Gravity Ads replaced a 70B model on Cerebras with a specialized 1B model trained for their actual workload. Same quality, much faster and cheaper inference: - p50: 152ms - p99: 5.7x lower - cost: ~10x lower - model: 70x smaller Great working with @trygravityai on this. Case study: inference.net/case-study/gra…
English
5
5
33
4.8K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
Excited to launch Day One support for tracing the Cursor Agent SDK with @inference_net 3 lines of code is all you need to track agent performance across executions and iterate to perfection Docs below 👇
Sam Hogan 🇺🇸 tweet media
Cursor@cursor_ai

We’re introducing the Cursor SDK so you can build agents with the same runtime, harness, and models that power Cursor. Run agents from CI/CD pipelines, create automations for end-to-end workflows, or embed agents directly inside your products.

English
7
6
47
9.1K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
We're releasing Schematron V2, a family of Specialized Language Models for converting messy HTML to structured JSON frontier performance at 1/10th the cost Schematron V2 was designed in partnership with some of the largest web-scraping companies in the world to meet the demands of their heaviest workloads Schematron-V2-Turbo and Schematron-V2-Small are available today on @inference_net Get started: docs.inference.net/workhorse-mode…
Sam Hogan 🇺🇸 tweet media
Sam Hogan 🇺🇸@samhogan

I found out today that two of the largest web scraping companies in the world are using a custom Llama 3 model we released last year to process millions of webpages per day. Schematron-3b: HTML -> JSON parsing Frontier quality at dirt-cheap prices. huggingface.co/inference-net/…

English
6
8
73
13.4K
TBPN
TBPN@tbpn·
"Inference, if you look at it as a market, will be much, much bigger than cloud computing was pre-ChatGPT." Lightspeed’s @buckymoore says inference is an underrated investment category in AI, and expects the market to break up into large, specialized platforms for each modality: "The GPU supply crunch that we're seeing right now is largely, as @dylan522p has said on the show before, due to the fact that not only these consumer products, but also B2B products like Claude Code and Codex are just really taking off and creating insane demand for inference." "We're talking hundreds of billions in spend every year. And if that's true, I think there will be very, very large inference platforms built in each modality." "So there will be an inference platform for real-time video models, there will be an inference platform for open-source and custom language models, there will be an inference platform built specifically for long-running agents." "So I think we're just going to see that industry, which today looks like one industry, break up into many because of how big it is and how much room for specialization there is."
English
11
15
205
33.3K
Inference
Inference@inference_net·
Day Zero fine-tuning & hosting support for Nemotron 3 Super by @nvidia is now live Fine-tune on real production traces & deploy on high-performance infrastructure optimized for Nemotron 3 Super Your data, your weights, your performance edge Learn more: inference.net/blog/nemotron-…
English
7
6
25
7.8K
Inference retweetledi
Kintu k
Kintu k@gk_kintu·
Schematron 3B Vs 8b The 8b is slightly better as expected but both are able to ingest 100s of lines of raw, bloated HTML and output perfectly structured JSON exactly matching my Pydantic schema! Full break down of the models : youtu.be/F__eg5cvS_A @schematron @inference_net
YouTube video
YouTube
English
0
1
3
3.5K
Inference retweetledi
Inference
Inference@inference_net·
@cyrusnewday @dottxtai No constrained sampling. We’ve tested schemas with 50+ fields. We’re releasing V2 of schematron in a few weeks, which will be even more powerful. Would love to see if we can help with your use case, free free to DM @samhogan
English
3
0
3
195
Cyrus
Cyrus@cyrusnewday·
@inference_net Is the schema compliance using eg @dottxtai or legit without any constraining the model produces JSON? If so that’s really impressive Curious how big the schemas you tested on were — we have some use cases that require intelligence and can have 100s of structured outputs
English
1
0
0
197
Inference
Inference@inference_net·
We built the Kimi K2 of web extraction: meet Schematron. It's been getting a lot of love from teams we work with. This is what we heard again this week: "We tested Schematron against smaller models for large-scale HTML schema extraction — it was more accurate and significantly faster. For our discovery endpoint pulling thousands of web pages in parallel, it's the first model that actually works at the quality and latency we need." In short, Schematron is: → 98% of GPT-4.1 quality → 100% JSON schema compliance — zero hallucinated fields → 128K context — handles full raw HTML, no markdown conversion needed → 40–80x cheaper than frontier models → 10x faster — 0.54s per page vs 6s for GPT-5 → Open-source on HuggingFace, runs on Ollama, OpenAI-compatible API And we're cooking its next version — stay tuned 👀 inference.net/blog/schematro…
English
3
1
22
2.8K
Inference
Inference@inference_net·
The LLM Engineering Roadmap. If you want to start today, here's the roadmap👇 1️⃣ LLM Foundations Start by understanding Python and LLM APIs and how they work. Learn prompt engineering, structured outputs, and tool use. ↳ Python/Typescript Basics ↳ LLM APIs ↳ Prompt Engineering ↳ Structured Outputs ↳ Function Calling 2️⃣ Vector Stores Before building anything, you need to understand how text becomes vectors. Learn embedding models, chunking strategies, and similarity search. ↳ Embedding Models (OpenAI Ada, Cohere, BGE) ↳ Vector Databases (Pinecone, Qdrant, ChromaDB, FAISS) ↳ Chunking Strategies ↳ Similarity Search 3️⃣ Retrieval-Augmented Generation (RAG) This is how LLMs answer questions using your data. You learn how to retrieve context and feed it correctly. ↳ Orchestration Frameworks (LangChain, LlamaIndex) ↳ Ingesting Documents ↳ Retrieval Methods (Dense, BM25, Hybrid) ↳ Reranking ↳ Prompt Templates 4️⃣ Advanced RAG This steps helps you understand how to make RAGs reliable and accurate. ↳ Query Transformation ↳ HyDE ↳ Corrective RAG ↳ Self-RAG ↳ Graph RAG 5️⃣ Fine-Tuning Sometimes prompts are not enough for a specialised use case. Fine-tuning will help you understand how models learn domain-specific behaviour. ↳ Data Preparation ↳ LoRA, QLoRA, DoRA ↳ SFT, DPO, RLHF ↳ Training Tools (Unsloth, Axolotl, HF TRL) 6️⃣ Inference Optimization Once systems work, they need to be fast and affordable. This step focuses on learning performance and cost efficiency. ↳ Quantization (GGUF, GPTQ, AWQ) ↳ Serving Engines (vLLM, TGI, llama.cpp) ↳ KV Cache ↳ Flash Attention ↳ Speculative Decoding 7️⃣ Deployment Models are useless if they stay in notebooks. Here you learn how to ship LLM systems to users. ↳ GPU Scheduling ↳ Cloud Platforms (AWS Bedrock, GCP Vertex AI) ↳ Docker, Kubernetes ↳ FastAPI, Streaming (SSE) 8️⃣ Observability This step helps you track quality, latency, and cost. ↳ Tracing (LangSmith, Langfuse, Arize Phoenix) ↳ Latency (TTFT) ↳ Token Usage ↳ Cost Tracking 9️⃣ Agents Agents allows LLMs to plan and use tools. Learn them to understand how LLMs solve multi-step and complex tasks. ↳ Frameworks (LangGraph, CrewAI, Autogen) ↳ Function Calling ↳ Memory Systems ↳ Patterns (ReAct, Plan-and-Execute, Multi-Agent) 🔟 Production & Security Production LLM systems can fail in subtle ways. This step helps you prevent misuse, outages, and cost spikes. ↳ Prompt Injection Defense ↳ Guardrails (NeMo, Guardrails AI) ↳ Semantic Caching ↳ Fallbacks & Rate Limiting ♻️ Repost if you found this insightful Follow us for more AI engineering content!
Inference tweet media
English
24
235
1.1K
50.3K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
We're welcoming @mikepollard_dev to @inference_net as our Founding DevRel Engineer! Mike and I won a pitch competition for my first company nearly 7 years ago Life is long. When you find someone you love to work with, keep them close. You never know when your paths may cross
Sam Hogan 🇺🇸 tweet mediaSam Hogan 🇺🇸 tweet media
English
8
3
45
4.4K
Inference
Inference@inference_net·
You're overpaying by $30,000/month running AI models at scale. Here's why (and how to fix it) How OpenAI & Anthropic work Per-token pricing: → OpenAI (GPT-4o): $2.50 / $10 per million tokens → Anthropic (Sonnet 4.5): $3 / $15 per million tokens At 1M queries/month: $30,000 - $38,000/mo The problems: 1️⃣ You pay for capabilities you don't use Frontier models are trained for everything. Your task needs maybe 1% of those capabilities. You're paying for the other 99%. 2️⃣ No economies of scale Token #1: $0.003 Token #1,000,000: $0.003 Your costs never decrease. 3️⃣ Smaller frontier models and off-the-shelf open-source models mean worse quality You're forced to choose to pay more or get worse results. The solution: Dedicated GPUs + Specialized Models Instead of per-token pricing, rent dedicated GPUs at a fixed monthly cost. Then train custom models specialized for your specific task: → Distilled from frontier models and large open source models (GPT-5, Claude, Gemini, Kimi, GLM) → Match or exceed frontier quality for your use case → 2-3x faster inference At 1M queries/month: $8,600/mo That's 71-77% cheaper with no quality sacrifice. And the biggest misconception is that "custom models can't match frontier quality." The reality: When specialized for your task, they can exceed frontier intelligence. — Most teams don’t need “the smartest model in the world.” They need the smartest model for one job. Running on infrastructure they control. At a cost that actually scales.
English
9
11
51
7.2K
Inference retweetledi
Sam Hogan 🇺🇸
Sam Hogan 🇺🇸@samhogan·
Today I’m incredibly excited to announce that @AmarSVS has joined me and @atbeme as a co-founder of @inference_net Anyone who has worked with Amar knows he is a N=1 type of guy. His energy, raw horsepower, and dedication have allowed us to unlock exciting new opportunities and inspired the whole team. I look forward to many more years of partnership, ping pong, and late nights in the office.
Sam Hogan 🇺🇸 tweet media
English
23
18
113
18.2K
Inference
Inference@inference_net·
Claude 3.5 Haiku is getting deprecated even though it worked. Behind the scenes, a lot of teams did the same thing: - Tested newer models - Ran the evals - Quietly rolled back to Haiku Because nothing matched real production behavior. Now they’re stuck. The mistake is thinking the fix is “find the next model.” It’s not. With Inference.net, in < 1 week you can: - Have a custom AI model for your use case - Get the same outputs (or even more accurate) - Keep prompts, workflows, integrations; no rewrites What actually changes: - Costs become predictable - Latency stops being a lottery - Deprecation risk disappears (you own the model) This isn’t open vs closed. It’s about freezing the behavior that already works; and moving on. Haiku’s deprecation just made the dependency visible.
English
5
114
23
3.6K