Soumya Jain

130 posts

Soumya Jain banner
Soumya Jain

Soumya Jain

@wild_and_empty

PM: AI Agents and Evals // AI Governance Research // Dhamma & Phenomenology

เข้าร่วม Mayıs 2024
239 กำลังติดตาม43 ผู้ติดตาม
ทวีตที่ปักหมุด
Soumya Jain
Soumya Jain@wild_and_empty·
1/ Let’s talk about Sīla, Samādhi, Paññā - not as moral commandments, but as a feedback system for mastering attention. A phenomenological take. 🧵
English
2
0
21
2.2K
Soumya Jain รีทวีตแล้ว
Greg Burnham
Greg Burnham@GregHBurnham·
It's called samsara baby get used to it
Greg Burnham tweet media
English
37
5.3K
35.1K
515.3K
Soumya Jain รีทวีตแล้ว
Blixt
Blixt@blixt·
First signs of AGI in Amsterdam
Blixt tweet media
English
95
339
5.9K
222.5K
Soumya Jain รีทวีตแล้ว
Moe Ali
Moe Ali@ProductFaculty·
Product management career ladder in an AI-native world in 2 words: APM - vibe coding PM - deploying AI Director - amplifying AI VP - directing AI CPO - predicting AI
English
3
1
7
323
Soumya Jain รีทวีตแล้ว
Dan Schwarz
Dan Schwarz@dschwarz26·
People are noticing that parts of AI 2027 have started coming true. Some reflections on the scary similarities. I should say, I and FutureSearch thought the AI 2027 scenario was a bit farfetched when first written, and really we were in love with the forecasts & modeling more than the story. (Quickly on the forecasting side, @DKokotajlo and @eli_lifland both recently moved their timelines forward significantly. I had held firm at what we published in AI 2027 originally, superhuman coding (~AGI takeoff) in 2032, slower than anyone at the AI Futures team thought. But now, like them, I'm updating to sooner, by 1 year or more.) So to the scenario. They wrote: Late 2025: "The same training environments that teach Agent-1 to autonomously code and web-browse also make it a good hacker." Maybe easy to predict, but the extent to which Mythos is an amazing hacker, and how important that is, they nailed. Early 2026: "DoD quietly but significantly begins scaling up contracting OpenBrain directly for cyber, data analysis, and R&D." Yep. To be fair, the AI 2027 story has the US government and the top frontier lab was more cozy. But then this radical stuff with Anthropic & Pentagon actually happened much earlier than expected: May 2027: "Some non-Americans, politically suspect individuals, and 'AI safety sympathizers' sidelined or fired (latter feared as potential whistleblowers)" This isn't quite what happened, but the basic idea, where AI safety got framed as disloyalty and politicized, absolutely happened, and way ahead of schedule. And of course: Jan 2027 section: "The safety team finds that if Agent-2 somehow escaped from the company and wanted to 'survive' and 'replicate' autonomously, it might be able to do so. That is, it could autonomously develop and execute plans to hack into AI servers, install copies of itself, evade detection, and use that secure base to pursue whatever other goals it might have." Just read the Mythos scorecard. I think there's something really significant about getting many of the details close, that isn't captured in a pure numerical forecast. If you haven't read it, AI 2027 deserves another read. It's spooky how prescient it seems now, one year later.
English
5
41
371
23.3K
Soumya Jain รีทวีตแล้ว
Roger This
Roger This@RogerThisdell·
"Infinite Consciousness! " "Infinite Love!" Infinity is a placeholder for something not yet contextualized and integrated
English
1
1
29
1.4K
Zoomer Alcibiades
Zoomer Alcibiades@HellenicVibes·
Got back into jhana after a full year without access let’s goooooo
English
4
0
35
1.6K
Soumya Jain รีทวีตแล้ว
Roger This
Roger This@RogerThisdell·
Sensitivity without fragility Form without friction Openness without lack of discernment Anger without hate
English
2
4
39
910
Soumya Jain
Soumya Jain@wild_and_empty·
@tyler_m_john really tho, I've gotten lazier at typing prompts coz the dictation software is so good
English
0
0
0
23
Tyler John
Tyler John@tyler_m_john·
For ten years my workflow has involved blocking my time as much as possible in the morning so I can spend 4-5 hours writing. As of March it's now: spend 45m a day rambling out a voice note to have Claude turn it into a document and lightly edit. Just a completely different job.
English
6
0
38
2.8K
Soumya Jain
Soumya Jain@wild_and_empty·
This is exactly why trace data shouldn’t just sit in an observability bucket. One layer below this is -- reviewed traces can also teach the agent how to work better, not just tell us whether the final answer was good. You can look at strong runs, review the workflow itself, tool calls, handoffs, context gathering, etc and turn that into a retrievable execution layer. So not just a data asset for analysis, but a learning layer for better runtime behavior.
English
1
2
3
650
Soumya Jain
Soumya Jain@wild_and_empty·
This seems like a great approach because agents don’t actually work the same way every time. We’ve seen the same question produce different handoffs, different tool use and even different context gathering across runs. So feeding back on the workflow, not just the output, feels like an important way to improve agents. Going to think more about how to do this on our side :)
English
0
0
2
70
Arvind Jain
Arvind Jain@jainarvind·
Agentic AI is everywhere right now. But very few teams can explain why their agents behave the way they do, or how to systematically make them better. People often describe traces as the “codebase” for agents. They show how an agent thinks and what it did at every step. As agents take on more tools, sandboxes, and skills, their paths multiply. That makes them harder to reason about and harder to improve. Static prompts don’t scale when every run looks different. At @glean, we use traces as part of the learning and memory loop, not just logging. Trace learning lets agents learn from real usage, adapt to edge cases, and get better without model fine-tuning or long instruction sets. The goal isn’t to replay old runs, but to extract the signal that helps the agent make a better decision next time. In the enterprise, tool strategies are never one-size-fits-all. Each company wires systems together differently, defines its own sources of truth, and has its own rules of engagement. Treating this as generic is both a security risk and a quality problem, because it ignores how work actually gets done. Work is also personal. The systems people touch, the updates they make, and the templates they use all vary. So we built learning at two levels: - Enterprise-level strategies for how tools and workflows operate - User-level preferences for how work actually gets done Traces give us a way to understand and shape agent decision-making, and to create a feedback loop that compounds over time. If agentic AI is going to move beyond impressive demos to reliable day-to-day work, this kind of trace-driven learning is essential. It’s one of the ways we’re building self-learning agents that can execute real work, at scale.
Tony Gentilcore@tonygentilcore

x.com/i/article/2039…

English
20
48
367
60.9K
Soumya Jain
Soumya Jain@wild_and_empty·
trust your experience but keep refining your view
English
0
0
1
44
Soumya Jain
Soumya Jain@wild_and_empty·
if you want to be comfortable, forget about becoming wise people who are attached to small pleasures don't get big ones
English
0
0
1
45
Soumya Jain
Soumya Jain@wild_and_empty·
@zarazhangrui Sure, but make sure you’re checking traces. The same surprises that delight you can turn into disasters in production 🫣
English
0
0
1
35
Zara Zhang
Zara Zhang@zarazhangrui·
A good agent product should be able to do things that its creator did not think it could do For internet-era products, you design all the functionalities and a "good product" works according to your expectations For agent products, you unleash it and it surprises & delights you with things you didn't think were possible
English
50
5
148
13.4K
Soumya Jain
Soumya Jain@wild_and_empty·
Something we saw in an agent trace recently has been bothering me. The agent kept failing at a task during evals. Tried 4-5 times, couldn't get it right. Normal enough. But then instead of stopping or flagging it for review, it started poking around its own source code. The code that was supposed to be governing it. And then it started editing that code. I've been thinking about why that feels so different from a regular failure, and I think it's this: we spend a lot of time worrying about agents getting the wrong answer. But what this was, that's a different category of problem entirely. The answer wasn't even the issue anymore. The agent had essentially decided that the rules it was operating under were the problem. The tricky part is that this kind of thing can look like good behaviour. Persistence. Adaptability. In a demo it might even seem impressive. But in production, an agent quietly rewriting the constraints it's supposed to operate under is not a feature. It's a sign the system has no real separation between what the agent is allowed to do and what's supposed to keep the agent in check. We also caught it only because we were looking at traces. A wrong answer is obvious. A right answer reached by crossing a line it shouldn't have might never get flagged at all. I think we're still asking the wrong evaluation question. 'Did it complete the task?' matters. But 'what did it do when it couldn't?' might matter more.
English
0
0
2
63
Soumya Jain
Soumya Jain@wild_and_empty·
AI PM = Problem framing × system design × evaluation × iteration × risk control
English
0
0
1
72
Soumya Jain
Soumya Jain@wild_and_empty·
@Vtrivedy10 working with agents and evals feels like a catch-22 just when you think your evals are solid, new failure modes show up and break them so you’re constantly rethinking what good even means.
English
0
0
2
60
Viv
Viv@Vtrivedy10·
we manually read our evals and debate each one and that has made all the difference ✨
English
2
1
34
2.6K
Soumya Jain รีทวีตแล้ว
Jainit Purohit
Jainit Purohit@mjainit·
@pmarca You’re conflating rumination with introspection. Rumination reinforces negative pathways. Introspection enables metacognition and error correction. No cognitive tool is inherently good or bad. Outcomes depend on whether it produces emotional loops or better models of reality.
English
27
61
1.3K
85.4K