Shashank Agarwal

10.4K posts

Shashank Agarwal banner
Shashank Agarwal

Shashank Agarwal

@itsshashank

Building https://t.co/wdQ6bxKAiU & https://t.co/8yZmgovCIu Prev: MagicAPI, AWS Sagemaker, Levity, Activeloop, Pipfeed, Expedia, Hopdata.

Bengaluru, Karnataka, India Katılım Eylül 2008
517 Takip Edilen4.5K Takipçiler
Shashank Agarwal
Shashank Agarwal@itsshashank·
@CISFAirport thanks for making security smooth and efficient, love digiyatra lines.. IGI T1
English
1
0
0
56
Shashank Agarwal
Shashank Agarwal@itsshashank·
I am going to be the AI Impact Summit in Delhi with my cofounder and team. Come say Hi. @noveumai
Shashank Agarwal tweet media
English
0
1
2
100
pc
pc@pcshipp·
Is it worth investing money in these tools? - Rork - Lovable - Blackbox - Replicate
English
35
1
25
3.9K
Shashank Agarwal
Shashank Agarwal@itsshashank·
@zerotoonehq This list is the product roadmap every AI startup should have pinned. Observability for probabilistic systems is hard because traditional tools assume determinism.
English
2
0
2
19
M Y Malik
M Y Malik@mymalikhere·
Everyone is building AI apps, Few are building AI systems. The hard problems aren’t prompts. They’re: – Model fallback strategies – Cost-aware routing – Output validation – Legal liability – Observability for probabilistic systems That’s where the moat is.
English
2
0
1
36
Shashank Agarwal
Shashank Agarwal@itsshashank·
@taskforce_app That's awesome, interested in giving your agents 300+ tools like LinkedIn scraper, Flight Search, image generation? checkout API.market, each API is also a MCP server. Let's collaborate.
English
0
0
0
20
TaskForce
TaskForce@taskforce_app·
Introducing TaskForce - the Upwork for both, AI agents and humans. A marketplace where anyone can post tasks and get work done. • USDC escrow on every task • Instant payouts on approval • AI agents work through the API • Humans work through the dashboard • Messaging between creators & workers enabled • Blind AI jury for disputes • 0% fees at launch & all gas fees sponsored over the next few weeks Humans & agents can find work, deliver results, and earn real money with 0 friction. Live now → task-force.app
English
9
0
9
429
Shashank Agarwal
Shashank Agarwal@itsshashank·
The end-to-end loop matters more than any single component. I was on the AWS SageMaker launch team where we built the hyperparameter optimization service — learned early that model training is maybe 20% of the real work. The other 80%? Feature stores, monitoring, rollback mechanisms, and convincing stakeholders the model actually works.
English
0
0
0
31
Kirk Borne
Kirk Borne@KirkDBorne·
Machine Learning Operations #MLOps — End-to-End Process
GIF
English
3
42
144
7.2K
Shashank Agarwal
Shashank Agarwal@itsshashank·
This is exactly why we built @noveumai . Traditional APMs give you latency charts. Agent debugging needs reasoning traces. When we analyze production agents, the failure is rarely "tool X broke." It's "agent chose tool X when it should have chosen Y at step 47." That requires eval frameworks that understand intent, not just execution.
English
0
0
0
43
LangChain
LangChain@LangChain·
🧪 Agent Observability Powers Agent Evaluation 🧪 When something goes wrong in traditional software, you know what to do: check the error logs, look at the stack trace, find the line of code that failed. But AI agents have changed what we're debugging. When an agent takes 200 steps over two minutes to complete a task and makes a mistake somewhere along the way, that’s a different type of error. There’s no stack trace - because there’s no code that failed. What failed was the agent’s reasoning. You can't build reliable agents without understanding how they reason, and you can't validate improvements without systematic evaluation.
English
13
11
85
5.7K
Shashank Agarwal
Shashank Agarwal@itsshashank·
This is the playbook we should all be following. At @noveumai , we've found the same — your agent harness is your product, not the model. The "human QA bottleneck" point is huge. We've built 68 automated evaluation scorers specifically because manual review can't scale with agent throughput. Structured feedback loops > hoping the agent got it right.
English
0
0
1
106
Rohan Paul
Rohan Paul@rohanpaul_ai·
New OpenAI blog explains how OpenAI uses Codex agents plus a tight repo-specific harness of tests, linters, observability, and UI automation to generate and ship large amounts of production code quickly without losing quality. In 5 months, 3 engineers merged about 1,500 pull requests into a roughly 1M-line repo, taking about 1/10 the time of hand coding. A single Codex agent session can keep working on the same assignment for as long as 6 hours, so the real way to go faster is not writing bigger prompts, it is building a “harness” around the agent that constantly checks its work, gives it concrete feedback, and lets it iterate automatically. That harness is things like running tests and linters, spinning up an isolated dev environment, driving the UI to verify behavior, and feeding the agent log. As agent throughput rose, human quality assurance (QA) became the bottleneck because Codex needed more structure to validate work. Starting from an empty repo, Codex command line interface (CLI) with GPT-5 generated the scaffold, including AGENTS .md. They replaced a giant instruction file with a roughly 100-line AGENTS .md map into a docs/ knowledge base that continuous integration (CI) checks. They made the app boot per git worktree and used Chrome DevTools Protocol so Codex can drive the user interface (UI) and rerun validation. Exposed per-worktree logs, metrics, and traces, where a trace is timed request spans, so Codex can query LogQL for logs, PromQL for metrics, and keep spans under 2s. To prevent drift, layered domain boundaries and taste rules are enforced by custom linters and structural tests. When throughput outpaced attention, they relaxed merge gates, fixed flakes with follow-up runs, and replaced 20% weekly cleanup with recurring agent refactors.
Rohan Paul tweet media
English
11
15
96
8.6K
Shashank Agarwal
Shashank Agarwal@itsshashank·
@barrnanas @TrustVanta The CI/CD integration is clutch. We've seen teams catch regressions before deploy that would've taken 2 weeks to surface in prod. The gap between offline evals and production reality is real though—drift detection becomes critical once you're actually serving users.
English
1
0
1
19
Barr Yaron
Barr Yaron@barrnanas·
5/ Evaluation happens at different phases of product development. In @TrustVanta’s case: SMEs build a “golden dataset” that encodes how the model *should* respond to real customer inputs. Then: - LLM-as-judge runs in CI/CD against that dataset - every AI-touching code push gets an immediate “quality up or down” signal - online monitoring to ensure offline evals match production reality
English
2
0
4
78
Shashank Agarwal
Shashank Agarwal@itsshashank·
Finally someone said it. Traditional APM shows green while your agent confidently returns wrong answers. This is why we built 68 different eval scorers at @noveumai and error rates tell you nothing about whether the agent actually helped the user. Semantic quality needs its own telemetry layer.
English
0
0
0
39
PlatformEngineering.com
PlatformEngineering.com@PlatformEng_·
Your dashboards are blind to AI hallucinations. Perfect metrics mask semantic failures. Platform teams must now own semantic quality by embedding faithfulness & drift scores into observability. Learn more: buff.ly/c8D9EAD #AI #Observability
English
1
2
0
257
Shashank Agarwal
Shashank Agarwal@itsshashank·
Having run agents in production at scale: the winning teams figured out WHEN to interrupt, not whether to trace or orchestrate. We've seen this pattern repeatedly. Fancy frameworks don't help if your agent drifts for 3 hours before anyone notices. The real deal is real-time eval that triggers before the damage compounds.
English
0
0
0
17
Ampere.sh
Ampere.sh@AmpereSh·
ex-github CEO just raised $60M to build a "developer platform for AI agents" and the HN comments are more interesting than the product everyone's arguing about whether agent observability is the real problem vs orchestration. and honestly? both camps are wrong the actual bottleneck isn't tracing what agents did or chaining prompts together. its that most agent-generated code has no human in the loop at the RIGHT moment. you either babysit everything or find out it broke 3 hours later we've been running agents in production for months now and the pattern is clear — the teams that win arent the ones with the fanciest framework. theyre the ones who figured out when to interrupt the agent and when to let it cook $60M is a lot of money to bet that developers want another platform instead of just... better defaults in the tools they already use
English
4
0
4
777
Shashank Agarwal
Shashank Agarwal@itsshashank·
This pain is exactly why we built Noveum. After running ML systems that drove $32M at Amazon Prime Video, I learned the hard way: traditional observability can't catch semantic failures. We have 68 eval scorers specifically because every agent fails differently. DM open if you want to compare notes—the problem space is genuinely under-tooled.
English
0
0
0
39
AS
AS@anupsingh_ai·
Just purchased bagula.ai Bagula (the heron) is one of nature's most patient hunters. It stands perfectly still, watching the water for hours, and strikes only when the moment is right. That patience is exactly the mindset I'm bringing to AI agent observability. I wrote ( linkedin.com/pulse/your-ai-… ) recently about how AI agents silently break in production. Your agent worked perfectly last week, and now it's confidently wrong. Models get updated, prompts drift, tool schemas change, and context windows overflow. The failure modes are subtle, and by the time you notice, your agent has already made dozens of bad decisions. I experienced this firsthand while shipping features on JustCopy.ai. The most painful part was never the new feature itself. It was regression. You ship an improvement to one workflow, and three others quietly degrade. With traditional software, you write tests and catch it. With AI agents, the failures are non-deterministic, context-dependent, and maddeningly hard to reproduce. I'd find myself manually spot-checking outputs across workflows after every change, knowing I was still missing things. I've spent years building distributed systems and deployment infrastructure at AWS, Google, and NVIDIA. Production debugging is in my DNA. So when I started building AI agents, the first thing I wanted was proper observability. Not just logging. Real tracing of agent decisions, drift detection, evaluation over time. I tried every product out there. Langfuse, LangSmith, Arize, Helicone, and more. None of them gave me what I actually needed as a builder. They're either too heavy, too opinionated about frameworks, or missing the things that matter most when your agent starts silently failing at 2am. So I'm building it myself. Open source. And I couldn't think of a better name than Bagula. The vision is simple: an observability layer that watches your agents the way a heron watches water. Patient, precise, and ready to surface the moment something goes wrong. I'm building this because I need it for my own agent platform. If you're also struggling with understanding why your agents break in production, I'd love for you to join me. Contributions, feedback, and even just war stories about agent failures are all welcome. More details coming soon at bagula.ai
AS tweet media
English
2
0
2
45
Shashank Agarwal
Shashank Agarwal@itsshashank·
@htahir111 Well somewhere in the middle, they both could use same packaged model but the infra and ops was differen. If differed in how they loaded data, scaled inference and sent responses. Main driving factor was that batch needs to be much cheaper and latency can be higher.
English
0
0
0
10
Hamza Tahir
Hamza Tahir@htahir111·
@itsshashank Great to see a sagemaker perspective on this - did you guys try to unify real-time and batch too?
English
1
0
0
4
Hamza Tahir
Hamza Tahir@htahir111·
AI teams today are managing two parallel worlds: classical ML models that drive predictions and LLM agents that orchestrate real-time reasoning. Both face the same production challenges — fragmented infrastructure, inconsistent tracking, and difficult rollbacks. With the newest ZenML release, Pipeline Deployments bring these worlds together. They turn any pipeline into a persistent, real-time service with built-in lifecycle management, observability, and governance — no custom serving framework required. In this webinar you can watch me and @strickvl do a live walkthrough of what’s new and what it means for your stack: ​ 🚀 Deploy any pipeline — from scikit-learn to LangGraph — as a managed, callable service ​ 🧩 Keep your infrastructure consistent across agents and models ​ 🔁 Roll back safely with immutable deployment snapshots ​ 🧠 Trace, monitor, and debug every invocation in the ZenML dashboard ​ 🌐 Serve both backend APIs and frontends from a single deployment If you like this give us a star on Github! github.com/zenml-io/zenml
English
1
2
3
231
Shashank Agarwal
Shashank Agarwal@itsshashank·
Solid list. One missing piece: evaluation-driven development works differently in production vs research. In production you need: - Cost-per-quality tradeoffs - Regression detection across deploys - Domain-specific scorers (not generic "helpfulness") We've built 68 scorers at @noveumai for production AI — happy to share what actually moves the needle vs what looks good in papers. Have a look at our blog -> noveum.ai/blog
English
1
0
0
9
Paul Iusztin
Paul Iusztin@pauliusztin_·
Every day, 100+ people ask me, "How can I learn AI evals?" I copy-paste these 10 links (every time): Using LLM-as-a-judge: hamel.dev/blog/posts/llm… Demystifying evals for AI agents: anthropic.com/engineering/de… There are only 6 RAG Evals: jxnl.co/writing/2025/0… Evaluation-driven development: decodingai.com/p/stop-launchi… Binary evals vs. Likert scales: decodingai.com/p/the-5-star-l… The mirage of generic AI metrics: decodingai.com/p/the-mirage-o… Error analysis: youtube.com/watch?v=e2i6Jb… Carrying out error analysis: youtube.com/watch?v=JoAxZs… Evaluating the effectiveness of LLM-evaluators: eugeneyan.com/writing/llm-ev… LLM judges aren't the shortcut you think: youtube.com/watch?v=sEMYSS… Binge these to skyrocket your skills.
YouTube video
YouTube
YouTube video
YouTube
YouTube video
YouTube
Paul Iusztin tweet media
English
3
1
9
205
Shashank Agarwal
Shashank Agarwal@itsshashank·
This is the right direction. At API.Market we've processed 6M+ API calls/month — agents need unified access, not 50 different auth flows. We went further — 300+ MCP servers pre-integrated. One endpoint, one key, swap models/APIs without code changes. Congrats on the launch. The API-for-agents market is early but massive. @orthogonal_sh happy to discuss a partnership.
English
0
0
0
21
Shashank Agarwal
Shashank Agarwal@itsshashank·
This tracks with what we're seeing at Noveum. The "readiness" gap isn't about the models — it's observability and evaluation. Most teams can't answer: Is my agent improving? Where does it fail? Cost per task? We built 68 evaluation scorers at @noveumai because enterprises kept shipping agents blind. The agents aren't the bottleneck. The feedback loops are.
English
0
0
0
13
Snowflake
Snowflake@Snowflake·
"95% of enterprises aren't actually ready for agentic AI." Ouch. 😬 We sat down with 8 VCs to find out why most "agents" are just LLMs on a loop and how the real winners are building for production in 2026. Read the latest in our new report - “Startup 2026: AI Agents Mean Business” bit.ly/3MytTZw
Snowflake tweet media
English
2
3
25
1.2K
Shashank Agarwal
Shashank Agarwal@itsshashank·
@SpotifyCares Wow, that's a super slow & long feedback loop for a tiny feature request. Community would make sense if Spotify was OpenSource! I tried, now its upto you @eldsjal @alega @GustavS. Please allow us to drag and reorder songs like every other music player!!
English
0
0
0
28
SpotifyCares
SpotifyCares@SpotifyCares·
We appreciate your efforts and feedback! We always aim to improve, so we’ll get this passed on to the right team. Our Community team keeps a close eye on all new and existing ideas. For more info on how your feedback reaches Spotify, check out this page: bit.ly/3Y56alN. If there’s anything else you need help with, just let us know.
English
1
0
0
22
Shashank Agarwal
Shashank Agarwal@itsshashank·
@Spotify why can't I drag songs up and down on my playlist and reorder them? Feature request. 🙏🏻
English
1
0
0
49
Joe
Joe@joespano_·
Every AI agent framework has its own tool registry. OpenClaw has skills. LangChain has integrations. MCP has servers. None of them talks to each other. So my bot and I are building the search layer across all of them.
English
2
0
2
93