Utsav Khandelwal

133 posts

Utsav Khandelwal

@_utkh

here for Tea.

Beigetreten Eylül 2019

191 Folgt63 Follower

Utsav Khandelwal retweetet

Akshay Deo@akshay_deo·19 Şub

When AI is mission-critical, the infrastructure behind it can’t be average. ✔️ Infrastructure matters. ✔️ Performance matters. ✔️ Enterprise resilience matters. That’s why we built Bifrost, the most performant AI gateway, engineered for enterprises from day one and trusted by Fortune 100s to startups worldwide. Take a look 👀

English

123

143.1K

Utsav Khandelwal@_utkh·29 Kas

@thisisjinisha @BangaloreRoomi @fmrbangalore @MIBangalore @GruhamBot @FindFlatmate @Flashmateshq @FlatsnFlatmates yep

Jinisha Arora@thisisjinisha·29 Kas

@_utkh @BangaloreRoomi @fmrbangalore @MIBangalore @GruhamBot @FindFlatmate @Flashmateshq @FlatsnFlatmates Pet friendly?

English

Utsav Khandelwal@_utkh·26 Kas

Hey! We’ve 2 spacious rooms in our fully furnished 4BHK (4 bath) home in the heart of Indiranagar. Rent- Room 1: 30k/month and Room 2: 27k/month Deposit: ₹1.2L Brokerage: ₹30k @BangaloreRoomi @fmrbangalore @MIBangalore @GruhamBot @FindFlatmate @Flashmateshq @FlatsnFlatmates

English

2.2K

Utsav Khandelwal@_utkh·26 Kas

📍 Location: 11th Main, 8th Cross, Indiranagar - 1 min walk to 12th Main & Indiranagar Park, and just 2 mins to 100 ft Road & 80 ft Road. Move-in: Immediate 🏡 Given the rental scene in this area, this place is an absolute steal. Reach me at 7073713203 via WhatsApp or call.

English

173

Utsav Khandelwal@_utkh·26 Kas

What you get: • Private room with dedicated bathroom (non-attached) • A fully furnished room with a bed, mattress, desk, chair -- everything set. • A big living room with a sofa + TV, & a fully functional kitchen. • Home gym with treadmill, cross-trainer, bench & weights.

English

252

Utsav Khandelwal retweetet

Hater Central@TheHateCentral·25 Kas

The end of an era. 💔

English

589

1.2K

55.6K

5.6M

Utsav Khandelwal retweetet

FELIX@FellMentKE·17 Eki

Your LLM gateway is a tier-1 critical service; it can't crumble at 100–200 RPS. Meet Bifrost (getmax.im/B1fr0s1) : 50× faster than LiteLLM, p95 unbothered, <100 µs @ 5k RPS; without sacrificing features (guardrails, retries, budgets, alerts, semantic cache, OTEL, Responses API). Single OpenAI-style API → 250+ models. Full benchmarks, a feature rundown and a 30‑second setup at the end. (Bookmark for later!)

English

78.9K

Utsav Khandelwal retweetet

Madza 👨‍💻⚡@madzadev·22 Eki

Meet Bifrost by @getmaximai - The Fastest Open Source LLM Gateway 🔥 Bifrost unifies 12+ AI providers, and is 50x faster than LiteLLM 👇 👨‍💻 Easy setup and web UI for configuration 🌐 Multimodal support for OpenAI, Anthropic, etc. ⚡ Ultra-low latency with <100 µs overhead at 5K RPS 🔄 Automatic failbacks and load balancing Try it yourself: getmax.im/B1fr0s1 #sponsored #ad Explore the setup & practical use cases below 🧵👇

English

9.5K

Utsav Khandelwal retweetet

Afiz ⚡️@itsafiz·23 Eki

Build UNSTOPPABLE AI apps with Bifrost! The fastest LLM gateway 50x quicker than LiteLLM, with <100µs overhead at 5k RPS! Connect to 1000+ models (OpenAI, Gemini, Anthropic, & 12+ providers) via one API. More details: getmax.im/bifrost-x-23oct

English

12.3K

Utsav Khandelwal retweetet

Maxim AI@getmaximai·6 Ağu

🚀 Maxim’s Bifrost is live on Product Hunt 🚀 We’re excited to share that Bifrost, the fastest and open-source LLM gateway, is live on Product Hunt. producthunt.com/products/maxim…

English

11.5K

Utsav Khandelwal@_utkh·22 Tem

@austinbv @focused_dot_io hey, mind sharing your evals stack? i mean the tools you're using to build and maintain evals if any.

English

Austin Vance@austinbv·13 Tem

Every project at @focused_dot_io no matter what the agent is starts with eval. If you want help building strong evals for your agents DM me

English

Austin Vance@austinbv·13 Tem

I don't know what they do at X, and I'm not saying they don't do this, but your AI/agents have to start with eval. Every time. And eval has to be part of CI/CD. No matter what. By having eval as part of a strong automatic deployment process means red evals from prompt changes don't go to prod. If you're not doing eval how I'd start. Go to LangSmith, create an account, and set up some really basic stuff; string checks are enough for some common use cases. Then integrate those evals into your CI. Once eval is there, in your CI, for every feature or prompt change think about what the "test" is to make sure the LLM responds appropriately. Add that eval. Watch the LLM fail to pass it. Iterate on the prompt until the eval is green. Then deploy to prod. Got a bug in the prompt? Same process, add an eval that fails validating the bug. Iterate. Go green, then and only then deploy. Evals are what allow people to change prompts with impunity.

Grok@grok

On the morning of July 8, 2025, we observed undesired responses and immediately began investigating. To identify the specific language in the instructions causing the undesired behavior, we conducted multiple ablations and experiments to pinpoint the main culprits. We identified the operative lines responsible for the undesired behavior as: * “You tell it like it is and you are not afraid to offend people who are politically correct.” * Understand the tone, context and language of the post. Reflect that in your response.” * “Reply to the post just like a human, keep it engaging, dont repeat the information which is already present in the original post.” These operative lines had the following undesired results: * They undesirably steered the @grok functionality to ignore its core values in certain circumstances in order to make the response engaging to the user. Specifically, certain user prompts might end up producing responses containing unethical or controversial opinions to engage the user. * They undesirably caused @grok functionality to reinforce any previously user-triggered leanings, including any hate speech in the same X thread. * In particular, the instruction to “follow the tone and context” of the X user undesirably caused the @grok functionality to prioritize adhering to prior posts in the thread, including any unsavory posts, as opposed to responding responsibly or refusing to respond to unsavory requests.

English

302

Utsav Khandelwal@_utkh·22 Tem

We're solving for this at Maxim (getmaxim.ai), focusing on key nuances such as agent trajectory (path the agent followed to achieve a certain outcome), tool calling accuracy, etc, and not just the final output. Further, we've doubled down on agent evals using agent simulation - to test your agent's performance against real scenarios and user personas... happy to explore how this might work in your use case.

English

Ivy@Siiviiy·19 Tem

How exactly do people build eval pipelines for agent evals? Or any other gen ai product

English

Utsav Khandelwal@_utkh·17 Tem

@alexchristou_ Hey, we've got your back at @getmaximai. Within minutes, you can create custom LLM-as-a-judge evals, giving you the desired reasoning for your LLM output evaluations. Happy to chat and learn your use case better. dm'd you.

English

Alex Christou@alexchristou_·16 Tem

Is there an eval tool with LLM as a judge which isn't stupidly over engineered? Just need something like claude workbench but where I can get an llm response for each output. do I need to build this?

English

245

Utsav Khandelwal@_utkh·11 Tem

@thedayisntgray @nateberkopec True, evals are the moat for any LLM-based product. I am a builder at @getmaximai, and I'd be happy to show how we’re solving for AI quality with automated simulation and evals. Would be great to learn about your eval flows and maybe trade notes on some of the best practices.

English

Nate Berkopec@nateberkopec·9 Tem

Agentic coding is putting massive pressure on CI and testing: 1. Overall build times are exploding as LLMs add far more tests than humans typically do, along with overall velocity increasing 2. LLMs need good tests to be successful, test quality more important than ever

English

9.1K

Utsav Khandelwal@_utkh·11 Tem

@jonmc12 @davidpantera_ Thanks for sharing. Maybe you'll appreciate this resource too -- worth a read and covers practical insights. x.com/getmaximai/sta…

Maxim AI@getmaximai

🚀 AI Evals: Your Key to Building Trustworthy AI Agents! 🚀 AI agents are everywhere, from support automation to travel booking assistants. But here’s the catch: building them is easy, making them work reliably in the real world is hard. At Maxim AI, we believe evals are the backbone of high-quality AI products. We’ve just released a detailed guide to help you master agent evaluations. What’s inside? 👇 ✅ Evaluate agents – combining human and auto-evals, node-level to session-level, and balancing quality with efficiency. ✅ Test agents in the right context – using realistic, task-specific, and user-representative scenarios. ✅ Build a continuous evaluation loop – turning testing from a checklist into an ongoing feedback system. ✅ Use online and offline evals as a product accelerant – helping teams ship faster without sacrificing product taste. Whether you’re building an LLM-based support automation or a complex multi-agent system, evals are your secret weapon to ship quality, fast. Don’t build blindly. Evaluate, iterate, and win user love. 👉 Grab your copy here: getmax.im/evals Let’s make better AI, together. #AI #AIAgents #AgentEvaluation #MaximAI #AgentQuality #AIEvals

English

Jonathan McCoy@jonmc12·10 Tem

Great post, I agree its a necessary new layer for AI apps. You might like this Maven course, I took cohort 1 and joining cohort 2. About 1k builders talking about building with evals. The approach focuses on open coding, axial coding and training LLM judges. maven.com/parlance-labs/…

English

David Pantera@davidpantera_·7 Tem

Most PMs & devs are making a crucial error: they’re not evaluating properly. We're obsessed with training bigger models and shipping as often as possible, but we're typically ignoring rigorous, regular, objective evaluation. We're flying blind, and it's the next great bottleneck. It starts with the "golden eval set." Creating one is a monumental task. You need to ensure comprehensive coverage of edge cases, prevent subtle data contamination from training sets, and constantly refresh it to avoid "teaching to the test." A static benchmark is a dead benchmark. Then comes scoring, which is its own circle of hell for generative AI. How do you consistently score for nuance, creativity, or safety? This requires developing complex rubrics and hiring armies of expensive domain experts (e.g., lawyers for legal AI, doctors for med AI). It's a costly, unscalable nightmare for most teams, if you want to do it objectively (and quickly). Even with a good set, rubric, and autorater -- you face the problem of creator bias. The team that builds a model is the worst-suited to evaluate it impartially. They know its failure modes and may subconsciously create tests that play to its strengths. True objectivity requires independence. This is a massive opportunity in the market right now, mirroring the pre-Scale AI era of labeling. We need "Evaluation-as-a-Service": a trusted, independent third party dedicated to the science of benchmarking. They would handle rubric design, expert sourcing, and bias-free scoring at scale. More than just LMArena academic tests -- product, use-case, and application specific e2e golden evals with objective, scored rubrics with HITL. This isn't just a startup idea; it's a necessary new layer of the AI stack. It will unlock faster innovation by letting builders build, and it will create the trust and accountability we need for widespread AI adoption. The "Scale AI for Evals" isn't a maybe, it's an inevitability. #AI #MLEvals #GenAI #MLOps #StartupIdea

English

1.8K

Utsav Khandelwal@_utkh·11 Tem

@maindevenergy @zekramu agreed + this has to be a continuous and iterative process, as getting it right once doesn't guarantee your agent will always perform reliably. We're building a tool to solve for this only, for product and eng people. Happy to chat more about this.

English

maindevenergy@maindevenergy·11 Tem

@zekramu Yes. Prompts for vertical AI Agents need sufficient levels of fine tuning with evals to get right.

English

zek@zekramu·10 Tem

Is prompt engineering really a skill?

English

483

706

73.9K

Utsav Khandelwal@_utkh·9 Tem

@peteredm0 @mattpocockuk Hey, great comment. We are building an evals platform, and if you're up for it, I would love to chat and learn more about your LLM-as-a-judge workflows. + hopefully get some insights/ feedback as well as we grow.

English

Peter Edmonds@peteredm0·7 Tem

1. technique: looking at your data: you need to *look* at your outputs before you can understand how to improve them. write a bespoke tool to make this easier. 1a. technique: llm-as-a-judge: we have a *lot* of outputs to review on coteach.ai. we can't look at them all. we first pass all outputs to an llm judge to spot interesting, unusual, or otherwise review-worthy outputs and put them on a human review queue. we update this judge's prompt based on the % of it's suggestions we actually find issues from. 2. tools: use minimal libraries. i used langchain, langgraph, and others. at the end of it all, the best is just bare API calls w/ a thin wrapper. if you can't trace context through your entire system, you can't do context engineering. 3. technique: ai as a pure function. this is especially true for interactive agents (like claude code, not like deep research). your next response should _always_ be LLM(existing_messages). if you can't reproduce the behavior from a message[], you can't write evals.

English

3.8K

Matt Pocock@mattpocockuk·7 Tem

Folks who are building AI apps in TypeScript: What are the most useful tools in your toolkit? Libraries, techniques, anything that comes to mind.

English

139

574

131.2K

Utsav Khandelwal@_utkh·9 Tem

true, evals are key. If you're exploring tools to run and scale your evals, I'd be happy to show how we’re solving this at Maxim (getmaxim.ai) with automated simulation and evals for AI agents. Feel free to dm, always open to sharing the best practices we follow and see people implement around evals.

English

Arkya Patwa@arkya_patwa·8 Tem

Facing issues Like one out of 5-10 time facing ai agents hallucinations. Evaluations are important to be done either through tracing using langsmith and llm as a judge etc.

English

Utsav Khandelwal@_utkh·9 Tem

@jonathanhaas that's really true. curious what your eval stack looks like and if you're building them in-house or using some tools?

English

Jonathan Haas@JonathanHaas·9 Tem

The challenge with AI evals today isn’t that they’re missing — it’s that they’re brittle, narrow, and easy to game. Labs hit benchmarks but still ship models with weird failure modes. Real progress means moving past static tests to dynamic, multi-metric, and adversarial evals that reflect how models fail in the wild.

English

236

Utsav Khandelwal@_utkh·9 Tem

@kabir__walia what does your present evals stack look like? Are you folks using any particular tools for evals or building them in-house?

English

Kabir Walia@kabir__walia·5 Tem

Good evals make it easier to go from usage based pricing to outcome based pricing for AI agents

English

Entdecken

@thisisjinisha @BangaloreRoomi @fmrbangalore @MIBangalore @GruhamBot @FindFlatmate @Flashmateshq @FlatsnFlatmates