Inference

354 posts

Inference

@inference_net

Inference Research & Development. Train and deploy specialized LLMs for your apps and agents in minutes. Get started: https://t.co/hGMtZjT2ND

San Francisco, CA Katılım Mart 2024

4 Takip Edilen29.6K Takipçiler

Sabitlenmiş Tweet

Inference@inference_net·14 Nis

Catalyst is live built for teams shipping agents in production train & deploy frontier LLMs in minutes, using the data your application is already generating Get Started: docs.inference.net/introduction

Sam Hogan 🇺🇸@samhogan

Introducing Catalyst: a developer platform to monitor, train & deploy self-improving AI models built for teams operating AI products at scale Catalyst can automatically: - collect traces from your agents - curate training data & evals - train & deploy models on par w/ Opus 4.6

English

7.6K

Inference@inference_net·6d

The best production model is the one trained for the job. Gravity Ads replaced a 70B model on Cerebras with a specialized 1B model trained for their actual workload. Same quality, much faster and cheaper inference: - p50: 152ms - p99: 5.7x lower - cost: ~10x lower - model: 70x smaller Great working with @trygravityai on this. Case study: inference.net/case-study/gra…

English

4.8K

Inference retweetledi

Amar Singh@AmarSVS·19 May

x.com/i/article/2056…

ZXX

124

12.1K

Inference retweetledi

Sam Hogan 🇺🇸@samhogan·15 May

x.com/i/article/2055…

ZXX

219

23K

Inference retweetledi

Amar Singh@AmarSVS·12 May

x.com/i/article/2053…

ZXX

36.3K

Inference retweetledi

Amar Singh@AmarSVS·5 May

x.com/i/article/2051…

ZXX

223

21.1K

Inference retweetledi

Sam Hogan 🇺🇸@samhogan·29 Nis

Excited to launch Day One support for tracing the Cursor Agent SDK with @inference_net 3 lines of code is all you need to track agent performance across executions and iterate to perfection Docs below 👇

Cursor@cursor_ai

We’re introducing the Cursor SDK so you can build agents with the same runtime, harness, and models that power Cursor. Run agents from CI/CD pipelines, create automations for end-to-end workflows, or embed agents directly inside your products.

English

9.1K

Inference retweetledi

Sam Hogan 🇺🇸@samhogan·17 Nis

We're releasing Schematron V2, a family of Specialized Language Models for converting messy HTML to structured JSON frontier performance at 1/10th the cost Schematron V2 was designed in partnership with some of the largest web-scraping companies in the world to meet the demands of their heaviest workloads Schematron-V2-Turbo and Schematron-V2-Small are available today on @inference_net Get started: docs.inference.net/workhorse-mode…

Sam Hogan 🇺🇸@samhogan

I found out today that two of the largest web scraping companies in the world are using a custom Llama 3 model we released last year to process millions of webpages per day. Schematron-3b: HTML -> JSON parsing Frontier quality at dirt-cheap prices. huggingface.co/inference-net/…

English

13.4K

Inference@inference_net·21 Mar

@tbpn @buckymoore Yes

261

TBPN@tbpn·21 Mar

"Inference, if you look at it as a market, will be much, much bigger than cloud computing was pre-ChatGPT." Lightspeed’s @buckymoore says inference is an underrated investment category in AI, and expects the market to break up into large, specialized platforms for each modality: "The GPU supply crunch that we're seeing right now is largely, as @dylan522p has said on the show before, due to the fact that not only these consumer products, but also B2B products like Claude Code and Codex are just really taking off and creating insane demand for inference." "We're talking hundreds of billions in spend every year. And if that's true, I think there will be very, very large inference platforms built in each modality." "So there will be an inference platform for real-time video models, there will be an inference platform for open-source and custom language models, there will be an inference platform built specifically for long-running agents." "So I think we're just going to see that industry, which today looks like one industry, break up into many because of how big it is and how much room for specialization there is."

English

205

33.3K

Inference@inference_net·11 Mar

Day Zero fine-tuning & hosting support for Nemotron 3 Super by @nvidia is now live Fine-tune on real production traces & deploy on high-performance infrastructure optimized for Nemotron 3 Super Your data, your weights, your performance edge Learn more: inference.net/blog/nemotron-…

English

7.8K

Inference retweetledi

Kintu k@gk_kintu·11 Mar

Schematron 3B Vs 8b The 8b is slightly better as expected but both are able to ingest 100s of lines of raw, bloated HTML and output perfectly structured JSON exactly matching my Pydantic schema! Full break down of the models : youtu.be/F__eg5cvS_A @schematron @inference_net

YouTube

English

3.5K

Inference retweetledi

Sam Hogan 🇺🇸@samhogan·17 Şub

By far the most common Clawdbot / agent task is online research. Closed-source models are expensive and slow. So we benchmarked the top 12 open-source LLMs for long-context HTML-to-JSON extraction. Unsurprisingly, Kimi K2.5 and GLM 4.7 take the crown 👑 More on this soon.

Amar Singh@AmarSVS

Introducing our new Schematron benchmark. We took some time to compare all of the latest open source models to see which one takes the crown. The benchmark essentially measures the ability of LLMs to take raw HTML along with a JSON schema, and then fill out that schema. We measure things like recall/precision, hallucinations, and ability to handle ambiguity. The benchmarks are graded with an ensemble of frontier models on a 5 point rubric. We can see that GLM 5 is the best open source model currently for schema extraction. Surprisingly, GPT-OSS 120B does very well at these type of extraction tasks as well. Another interesting result is we noticed degradation of quality using Qwen3.5 Plus on this task versus the original Qwen3.5 397B MOE. The inputs can be up to 120K tokens, so this is akin to a long context benchmark, with an additional reasoning layer. We will be open sourcing this benchmark if it gains sufficient traction. Also, more benchmarks coming from our side!

English

6.2K

Inference@inference_net·14 Şub

@cyrusnewday @dottxtai No constrained sampling. We’ve tested schemas with 50+ fields. We’re releasing V2 of schematron in a few weeks, which will be even more powerful. Would love to see if we can help with your use case, free free to DM @samhogan

English

195

Cyrus@cyrusnewday·13 Şub

@inference_net Is the schema compliance using eg @dottxtai or legit without any constraining the model produces JSON? If so that’s really impressive Curious how big the schemas you tested on were — we have some use cases that require intelligence and can have 100s of structured outputs

English

197

Inference@inference_net·13 Şub

We built the Kimi K2 of web extraction: meet Schematron. It's been getting a lot of love from teams we work with. This is what we heard again this week: "We tested Schematron against smaller models for large-scale HTML schema extraction — it was more accurate and significantly faster. For our discovery endpoint pulling thousands of web pages in parallel, it's the first model that actually works at the quality and latency we need." In short, Schematron is: → 98% of GPT-4.1 quality → 100% JSON schema compliance — zero hallucinated fields → 128K context — handles full raw HTML, no markdown conversion needed → 40–80x cheaper than frontier models → 10x faster — 0.54s per page vs 6s for GPT-5 → Open-source on HuggingFace, runs on Ollama, OpenAI-compatible API And we're cooking its next version — stay tuned 👀 inference.net/blog/schematro…

English

2.8K

Inference@inference_net·5 Şub

The LLM Engineering Roadmap. If you want to start today, here's the roadmap👇 1️⃣ LLM Foundations Start by understanding Python and LLM APIs and how they work. Learn prompt engineering, structured outputs, and tool use. ↳ Python/Typescript Basics ↳ LLM APIs ↳ Prompt Engineering ↳ Structured Outputs ↳ Function Calling 2️⃣ Vector Stores Before building anything, you need to understand how text becomes vectors. Learn embedding models, chunking strategies, and similarity search. ↳ Embedding Models (OpenAI Ada, Cohere, BGE) ↳ Vector Databases (Pinecone, Qdrant, ChromaDB, FAISS) ↳ Chunking Strategies ↳ Similarity Search 3️⃣ Retrieval-Augmented Generation (RAG) This is how LLMs answer questions using your data. You learn how to retrieve context and feed it correctly. ↳ Orchestration Frameworks (LangChain, LlamaIndex) ↳ Ingesting Documents ↳ Retrieval Methods (Dense, BM25, Hybrid) ↳ Reranking ↳ Prompt Templates 4️⃣ Advanced RAG This steps helps you understand how to make RAGs reliable and accurate. ↳ Query Transformation ↳ HyDE ↳ Corrective RAG ↳ Self-RAG ↳ Graph RAG 5️⃣ Fine-Tuning Sometimes prompts are not enough for a specialised use case. Fine-tuning will help you understand how models learn domain-specific behaviour. ↳ Data Preparation ↳ LoRA, QLoRA, DoRA ↳ SFT, DPO, RLHF ↳ Training Tools (Unsloth, Axolotl, HF TRL) 6️⃣ Inference Optimization Once systems work, they need to be fast and affordable. This step focuses on learning performance and cost efficiency. ↳ Quantization (GGUF, GPTQ, AWQ) ↳ Serving Engines (vLLM, TGI, llama.cpp) ↳ KV Cache ↳ Flash Attention ↳ Speculative Decoding 7️⃣ Deployment Models are useless if they stay in notebooks. Here you learn how to ship LLM systems to users. ↳ GPU Scheduling ↳ Cloud Platforms (AWS Bedrock, GCP Vertex AI) ↳ Docker, Kubernetes ↳ FastAPI, Streaming (SSE) 8️⃣ Observability This step helps you track quality, latency, and cost. ↳ Tracing (LangSmith, Langfuse, Arize Phoenix) ↳ Latency (TTFT) ↳ Token Usage ↳ Cost Tracking 9️⃣ Agents Agents allows LLMs to plan and use tools. Learn them to understand how LLMs solve multi-step and complex tasks. ↳ Frameworks (LangGraph, CrewAI, Autogen) ↳ Function Calling ↳ Memory Systems ↳ Patterns (ReAct, Plan-and-Execute, Multi-Agent) 🔟 Production & Security Production LLM systems can fail in subtle ways. This step helps you prevent misuse, outages, and cost spikes. ↳ Prompt Injection Defense ↳ Guardrails (NeMo, Guardrails AI) ↳ Semantic Caching ↳ Fallbacks & Rate Limiting ♻️ Repost if you found this insightful Follow us for more AI engineering content!

English

235

1.1K

50.3K

Inference retweetledi

Sam Hogan 🇺🇸@samhogan·3 Şub

We're welcoming @mikepollard_dev to @inference_net as our Founding DevRel Engineer! Mike and I won a pitch competition for my first company nearly 7 years ago Life is long. When you find someone you love to work with, keep them close. You never know when your paths may cross

English

4.4K

Inference retweetledi

Sam Hogan 🇺🇸@samhogan·31 Oca

Inference R&D merch goes so hard @plmsda

English

157

19.1K

Inference@inference_net·27 Oca

You're overpaying by $30,000/month running AI models at scale. Here's why (and how to fix it) How OpenAI & Anthropic work Per-token pricing: → OpenAI (GPT-4o): $2.50 / $10 per million tokens → Anthropic (Sonnet 4.5): $3 / $15 per million tokens At 1M queries/month: $30,000 - $38,000/mo The problems: 1️⃣ You pay for capabilities you don't use Frontier models are trained for everything. Your task needs maybe 1% of those capabilities. You're paying for the other 99%. 2️⃣ No economies of scale Token #1: $0.003 Token #1,000,000: $0.003 Your costs never decrease. 3️⃣ Smaller frontier models and off-the-shelf open-source models mean worse quality You're forced to choose to pay more or get worse results. The solution: Dedicated GPUs + Specialized Models Instead of per-token pricing, rent dedicated GPUs at a fixed monthly cost. Then train custom models specialized for your specific task: → Distilled from frontier models and large open source models (GPT-5, Claude, Gemini, Kimi, GLM) → Match or exceed frontier quality for your use case → 2-3x faster inference At 1M queries/month: $8,600/mo That's 71-77% cheaper with no quality sacrifice. And the biggest misconception is that "custom models can't match frontier quality." The reality: When specialized for your task, they can exceed frontier intelligence. — Most teams don’t need “the smartest model in the world.” They need the smartest model for one job. Running on infrastructure they control. At a cost that actually scales.

English

7.2K

Inference retweetledi

Sam Hogan 🇺🇸@samhogan·26 Oca

Today I’m incredibly excited to announce that @AmarSVS has joined me and @atbeme as a co-founder of @inference_net Anyone who has worked with Amar knows he is a N=1 type of guy. His energy, raw horsepower, and dedication have allowed us to unlock exciting new opportunities and inspired the whole team. I look forward to many more years of partnership, ping pong, and late nights in the office.

English

113

18.2K

Inference@inference_net·22 Oca

Claude 3.5 Haiku is getting deprecated even though it worked. Behind the scenes, a lot of teams did the same thing: - Tested newer models - Ran the evals - Quietly rolled back to Haiku Because nothing matched real production behavior. Now they’re stuck. The mistake is thinking the fix is “find the next model.” It’s not. With Inference.net, in < 1 week you can: - Have a custom AI model for your use case - Get the same outputs (or even more accurate) - Keep prompts, workflows, integrations; no rewrites What actually changes: - Costs become predictable - Latency stops being a lottery - Deprecation risk disappears (you own the model) This isn’t open vs closed. It’s about freezing the behavior that already works; and moving on. Haiku’s deprecation just made the dependency visible.

English

114

3.6K

Keşfet

@trygravityai @tbpn @buckymoore @dylan522p @nvidia @cyrusnewday @dottxtai @samhogan