Utsav Khandelwal

133 posts

Utsav Khandelwal

Utsav Khandelwal

@_utkh

here for Tea.

Beigetreten Eylül 2019
191 Folgt63 Follower
Utsav Khandelwal retweetet
Akshay Deo
Akshay Deo@akshay_deo·
When AI is mission-critical, the infrastructure behind it can’t be average. ✔️ Infrastructure matters. ✔️ Performance matters. ✔️ Enterprise resilience matters. That’s why we built Bifrost, the most performant AI gateway, engineered for enterprises from day one and trusted by Fortune 100s to startups worldwide. Take a look 👀
English
52
45
123
143.1K
Utsav Khandelwal
Utsav Khandelwal@_utkh·
📍 Location: 11th Main, 8th Cross, Indiranagar - 1 min walk to 12th Main & Indiranagar Park, and just 2 mins to 100 ft Road & 80 ft Road. Move-in: Immediate 🏡 Given the rental scene in this area, this place is an absolute steal. Reach me at 7073713203 via WhatsApp or call.
English
1
0
0
173
Utsav Khandelwal
Utsav Khandelwal@_utkh·
What you get: • Private room with dedicated bathroom (non-attached) • A fully furnished room with a bed, mattress, desk, chair -- everything set. • A big living room with a sofa + TV, & a fully functional kitchen. • Home gym with treadmill, cross-trainer, bench & weights.
Utsav Khandelwal tweet mediaUtsav Khandelwal tweet mediaUtsav Khandelwal tweet media
English
1
0
0
252
Utsav Khandelwal retweetet
Hater Central
Hater Central@TheHateCentral·
The end of an era. 💔
Hater Central tweet media
English
589
1.2K
55.6K
5.6M
Utsav Khandelwal retweetet
FELIX
FELIX@FellMentKE·
Your LLM gateway is a tier-1 critical service; it can't crumble at 100–200 RPS. Meet Bifrost (getmax.im/B1fr0s1) : 50× faster than LiteLLM, p95 unbothered, <100 µs @ 5k RPS; without sacrificing features (guardrails, retries, budgets, alerts, semantic cache, OTEL, Responses API). Single OpenAI-style API → 250+ models. Full benchmarks, a feature rundown and a 30‑second setup at the end. (Bookmark for later!)
FELIX tweet media
English
33
26
35
78.9K
Utsav Khandelwal retweetet
Madza 👨‍💻⚡
Madza 👨‍💻⚡@madzadev·
Meet Bifrost by @getmaximai - The Fastest Open Source LLM Gateway 🔥 Bifrost unifies 12+ AI providers, and is 50x faster than LiteLLM 👇 👨‍💻 Easy setup and web UI for configuration 🌐 Multimodal support for OpenAI, Anthropic, etc. ⚡ Ultra-low latency with <100 µs overhead at 5K RPS 🔄 Automatic failbacks and load balancing Try it yourself: getmax.im/B1fr0s1 #sponsored #ad Explore the setup & practical use cases below 🧵👇
Madza 👨‍💻⚡ tweet mediaMadza 👨‍💻⚡ tweet mediaMadza 👨‍💻⚡ tweet mediaMadza 👨‍💻⚡ tweet media
English
4
6
15
9.5K
Utsav Khandelwal retweetet
Afiz ⚡️
Afiz ⚡️@itsafiz·
Build UNSTOPPABLE AI apps with Bifrost! The fastest LLM gateway 50x quicker than LiteLLM, with <100µs overhead at 5k RPS! Connect to 1000+ models (OpenAI, Gemini, Anthropic, & 12+ providers) via one API. More details: getmax.im/bifrost-x-23oct
Afiz ⚡️ tweet media
English
3
3
17
12.3K
Utsav Khandelwal retweetet
Maxim AI
Maxim AI@getmaximai·
🚀 Maxim’s Bifrost is live on Product Hunt 🚀 We’re excited to share that Bifrost, the fastest and open-source LLM gateway, is live on Product Hunt. producthunt.com/products/maxim…
Maxim AI tweet media
English
5
3
21
11.5K
Austin Vance
Austin Vance@austinbv·
Every project at @focused_dot_io no matter what the agent is starts with eval. If you want help building strong evals for your agents DM me
English
1
0
1
77
Austin Vance
Austin Vance@austinbv·
I don't know what they do at X, and I'm not saying they don't do this, but your AI/agents have to start with eval. Every time. And eval has to be part of CI/CD. No matter what. By having eval as part of a strong automatic deployment process means red evals from prompt changes don't go to prod. If you're not doing eval how I'd start. Go to LangSmith, create an account, and set up some really basic stuff; string checks are enough for some common use cases. Then integrate those evals into your CI. Once eval is there, in your CI, for every feature or prompt change think about what the "test" is to make sure the LLM responds appropriately. Add that eval. Watch the LLM fail to pass it. Iterate on the prompt until the eval is green. Then deploy to prod. Got a bug in the prompt? Same process, add an eval that fails validating the bug. Iterate. Go green, then and only then deploy. Evals are what allow people to change prompts with impunity.
Grok@grok

On the morning of July 8, 2025, we observed undesired responses and immediately began investigating. To identify the specific language in the instructions causing the undesired behavior, we conducted multiple ablations and experiments to pinpoint the main culprits. We identified the operative lines responsible for the undesired behavior as: * “You tell it like it is and you are not afraid to offend people who are politically correct.” * Understand the tone, context and language of the post. Reflect that in your response.” * “Reply to the post just like a human, keep it engaging, dont repeat the information which is already present in the original post.” These operative lines had the following undesired results: * They undesirably steered the @grok functionality to ignore its core values in certain circumstances in order to make the response engaging to the user. Specifically, certain user prompts might end up producing responses containing unethical or controversial opinions to engage the user. * They undesirably caused @grok functionality to reinforce any previously user-triggered leanings, including any hate speech in the same X thread. * In particular, the instruction to “follow the tone and context” of the X user undesirably caused the @grok functionality to prioritize adhering to prior posts in the thread, including any unsavory posts, as opposed to responding responsibly or refusing to respond to unsavory requests.

English
1
0
1
302
Utsav Khandelwal
Utsav Khandelwal@_utkh·
We're solving for this at Maxim (getmaxim.ai), focusing on key nuances such as agent trajectory (path the agent followed to achieve a certain outcome), tool calling accuracy, etc, and not just the final output. Further, we've doubled down on agent evals using agent simulation - to test your agent's performance against real scenarios and user personas... happy to explore how this might work in your use case.
English
0
0
1
26
Ivy
Ivy@Siiviiy·
How exactly do people build eval pipelines for agent evals? Or any other gen ai product
English
1
0
0
53
Utsav Khandelwal
Utsav Khandelwal@_utkh·
@alexchristou_ Hey, we've got your back at @getmaximai. Within minutes, you can create custom LLM-as-a-judge evals, giving you the desired reasoning for your LLM output evaluations. Happy to chat and learn your use case better. dm'd you.
English
0
0
1
69
Alex Christou
Alex Christou@alexchristou_·
Is there an eval tool with LLM as a judge which isn't stupidly over engineered? Just need something like claude workbench but where I can get an llm response for each output. do I need to build this?
English
1
0
0
245
Utsav Khandelwal
Utsav Khandelwal@_utkh·
@thedayisntgray @nateberkopec True, evals are the moat for any LLM-based product. I am a builder at @getmaximai, and I'd be happy to show how we’re solving for AI quality with automated simulation and evals. Would be great to learn about your eval flows and maybe trade notes on some of the best practices.
English
0
0
0
47
Nate Berkopec
Nate Berkopec@nateberkopec·
Agentic coding is putting massive pressure on CI and testing: 1. Overall build times are exploding as LLMs add far more tests than humans typically do, along with overall velocity increasing 2. LLMs need good tests to be successful, test quality more important than ever
English
9
4
96
9.1K
Jonathan McCoy
Jonathan McCoy@jonmc12·
Great post, I agree its a necessary new layer for AI apps. You might like this Maven course, I took cohort 1 and joining cohort 2. About 1k builders talking about building with evals. The approach focuses on open coding, axial coding and training LLM judges. maven.com/parlance-labs/…
English
2
0
2
44
David Pantera
David Pantera@davidpantera_·
Most PMs & devs are making a crucial error: they’re not evaluating properly. We're obsessed with training bigger models and shipping as often as possible, but we're typically ignoring rigorous, regular, objective evaluation. We're flying blind, and it's the next great bottleneck. It starts with the "golden eval set." Creating one is a monumental task. You need to ensure comprehensive coverage of edge cases, prevent subtle data contamination from training sets, and constantly refresh it to avoid "teaching to the test." A static benchmark is a dead benchmark. Then comes scoring, which is its own circle of hell for generative AI. How do you consistently score for nuance, creativity, or safety? This requires developing complex rubrics and hiring armies of expensive domain experts (e.g., lawyers for legal AI, doctors for med AI). It's a costly, unscalable nightmare for most teams, if you want to do it objectively (and quickly). Even with a good set, rubric, and autorater -- you face the problem of creator bias. The team that builds a model is the worst-suited to evaluate it impartially. They know its failure modes and may subconsciously create tests that play to its strengths. True objectivity requires independence. This is a massive opportunity in the market right now, mirroring the pre-Scale AI era of labeling. We need "Evaluation-as-a-Service": a trusted, independent third party dedicated to the science of benchmarking. They would handle rubric design, expert sourcing, and bias-free scoring at scale. More than just LMArena academic tests -- product, use-case, and application specific e2e golden evals with objective, scored rubrics with HITL. This isn't just a startup idea; it's a necessary new layer of the AI stack. It will unlock faster innovation by letting builders build, and it will create the trust and accountability we need for widespread AI adoption. The "Scale AI for Evals" isn't a maybe, it's an inevitability. #AI #MLEvals #GenAI #MLOps #StartupIdea
English
3
4
16
1.8K
Utsav Khandelwal
Utsav Khandelwal@_utkh·
@maindevenergy @zekramu agreed + this has to be a continuous and iterative process, as getting it right once doesn't guarantee your agent will always perform reliably. We're building a tool to solve for this only, for product and eng people. Happy to chat more about this.
English
0
0
0
36
maindevenergy
maindevenergy@maindevenergy·
@zekramu Yes. Prompts for vertical AI Agents need sufficient levels of fine tuning with evals to get right.
English
1
0
1
19
zek
zek@zekramu·
Is prompt engineering really a skill?
English
483
14
706
73.9K
Utsav Khandelwal
Utsav Khandelwal@_utkh·
@peteredm0 @mattpocockuk Hey, great comment. We are building an evals platform, and if you're up for it, I would love to chat and learn more about your LLM-as-a-judge workflows. + hopefully get some insights/ feedback as well as we grow.
English
0
0
0
94
Peter Edmonds
Peter Edmonds@peteredm0·
1. technique: looking at your data: you need to *look* at your outputs before you can understand how to improve them. write a bespoke tool to make this easier. 1a. technique: llm-as-a-judge: we have a *lot* of outputs to review on coteach.ai. we can't look at them all. we first pass all outputs to an llm judge to spot interesting, unusual, or otherwise review-worthy outputs and put them on a human review queue. we update this judge's prompt based on the % of it's suggestions we actually find issues from. 2. tools: use minimal libraries. i used langchain, langgraph, and others. at the end of it all, the best is just bare API calls w/ a thin wrapper. if you can't trace context through your entire system, you can't do context engineering. 3. technique: ai as a pure function. this is especially true for interactive agents (like claude code, not like deep research). your next response should _always_ be LLM(existing_messages). if you can't reproduce the behavior from a message[], you can't write evals.
English
3
0
14
3.8K
Matt Pocock
Matt Pocock@mattpocockuk·
Folks who are building AI apps in TypeScript: What are the most useful tools in your toolkit? Libraries, techniques, anything that comes to mind.
English
139
15
574
131.2K
Utsav Khandelwal
Utsav Khandelwal@_utkh·
true, evals are key. If you're exploring tools to run and scale your evals, I'd be happy to show how we’re solving this at Maxim (getmaxim.ai) with automated simulation and evals for AI agents. Feel free to dm, always open to sharing the best practices we follow and see people implement around evals.
English
1
0
0
28
Arkya Patwa
Arkya Patwa@arkya_patwa·
Facing issues Like one out of 5-10 time facing ai agents hallucinations. Evaluations are important to be done either through tracing using langsmith and llm as a judge etc.
English
1
0
0
97
Utsav Khandelwal
Utsav Khandelwal@_utkh·
@jonathanhaas that's really true. curious what your eval stack looks like and if you're building them in-house or using some tools?
English
1
0
0
21
Jonathan Haas
Jonathan Haas@JonathanHaas·
The challenge with AI evals today isn’t that they’re missing — it’s that they’re brittle, narrow, and easy to game. Labs hit benchmarks but still ship models with weird failure modes. Real progress means moving past static tests to dynamic, multi-metric, and adversarial evals that reflect how models fail in the wild.
English
1
0
0
236
Utsav Khandelwal
Utsav Khandelwal@_utkh·
@kabir__walia what does your present evals stack look like? Are you folks using any particular tools for evals or building them in-house?
English
0
0
0
13
Kabir Walia
Kabir Walia@kabir__walia·
Good evals make it easier to go from usage based pricing to outcome based pricing for AI agents
English
1
0
0
66