ptk

5.3K posts

ptk banner
ptk

ptk

@ptkbhv

└► I am becoming agentic, the destroyer of worlds

Remote Katılım Aralık 2013
382 Takip Edilen3.5K Takipçiler
Sabitlenmiş Tweet
ptk
ptk@ptkbhv·
💥 𝗔𝗴𝗲𝗻𝘁 𝗟𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 v2! 💥 We’ve supercharged our benchmark to discover AI agent failures before they hit production. Instead of simple one-shot tool-calling tests, we simulate real enterprise workflows across five industries, with multi-turn dialogues, complex decision chains, and dynamic personas. 𝗪𝗵𝗮𝘁 𝘄𝗲 𝗰𝗵𝗮𝗻𝗴𝗲𝗱: ▶️ 100 synthetic scenarios per domain (banking, healthcare, investment, telecom, insurance) ▶️ 5–8 interconnected user goals in each conversation ▶️ Context carry-over, hidden parameters, time-sensitive requests, conditional flows 🥁 𝗪𝗵𝗮𝘁’𝘀 𝘂𝗻𝗱𝗲𝗿 𝘁𝗵𝗲 𝗵𝗼𝗼𝗱? ▶️ Multi-domain synthetic dataset: Tailored tools, personas, and scenarios for each vertical ▶️ Simulation pipeline: AI agent ↔ user simulator ↔ tool simulator in parallel experiments ▶️ Metrics: Action Completion (did the agent actually solve the user’s problem?) and Tool Selection Quality (did it choose and call the right APIs?) ▶️ Open code & dataset: Transparent and reproducible [what we measure shapes AI]
Galileo@rungalileo

Our Agent Leaderboard v2 is LIVE: We’ve added more models, a second benchmark, and more complexity to reflect the evolution of multi-agent systems. What LLMs came out on top? Agent Leaderboard v2 is our next step in benchmarking AI agents by moving beyond tool-calling tests to realistic enterprise scenarios. We simulated real customer support conversations across five industries with multi-turn dialogues, complex decision-making, and interdependent goals. Our evaluation focuses on two of our key agent metrics: - Action Completion: Did the agent fully accomplish the user’s goals, with explicit confirmations? - Tool Selection Quality: How effectively does the agent choose and use tools in context? Here were the key takeaways as of today 🧵

English
0
1
14
2.5K
ptk retweetledi
Galileo
Galileo@rungalileo·
In January 2025, researchers found a zero-click vulnerability in Microsoft 365 Copilot. The attacker sent one email. The recipient never opened it. Copilot found it during a routine search, followed the embedded instructions, and exfiltrated confidential files and chat logs. No firewall was breached. No credentials were stolen. The agent just couldn't tell its operator's instructions from the attacker's. That was a copilot with limited autonomy. The agents deployed in enterprises today have tool access, persistent memory, and the ability to delegate work to other agents. When they get hijacked, the blast radius is orders of magnitude larger. Enterprises have prompt injection guardrails to detect someone typing "ignore your instructions,” but that's one variant out of seven. The other six go undetected. RAG poisoning. Multi-turn goal manipulation. Cross-agent propagation. Each one a different attack surface. Each one invisible to a guardrail trained only on the obvious case. We published our ASI01 deep dive today: → The full 7-variant taxonomy with real enterprise attack scenarios → How to detect injections at every ingestion point, not just user input → Why the hardest injections to catch read exactly like legitimate instructions The gap between "we have a guardrail" and "we have coverage" is where the real risk lives. Read our newest blog on ASI01 here: galileo.ai/blog/owasp-age…
Galileo tweet media
English
0
1
4
133
ptk
ptk@ptkbhv·
𝗧𝗵𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗶𝘀 𝗻𝗼𝘁 𝗷𝘂𝘀𝘁 𝘄𝗵𝗲𝘁𝗵𝗲𝗿 𝘆𝗼𝘂 𝗵𝗮𝘃𝗲 𝘁𝗵𝗲𝗺. 𝗜𝘁 𝗶𝘀 𝘄𝗵𝗲𝗿𝗲 𝘁𝗵𝗲𝘆 𝗹𝗶𝘃𝗲. One of the strongest examples from our new blog: an agent team thought its prompt injection guardrail was working. The dashboard looked clean. The model said risk was low. But the system was only catching 2 of the 10 OWASP scenarios. The rest, indirect injection, zero-shot attacks, multi-turn manipulation, cross-agent propagation, were effectively invisible. That is the trap with agent security: coverage gaps can look exactly like safety. That story is one of several in this new blog from @rungalileo. 𝗟𝗮𝗿𝗴𝗲 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀, 𝗲𝘀𝗽𝗲𝗰𝗶𝗮𝗹𝗹𝘆 𝗶𝗻 𝗳𝗶𝗻𝗮𝗻𝗰𝗶𝗮𝗹 𝘀𝗲𝗿𝘃𝗶𝗰𝗲𝘀, 𝗮𝗿𝗲 𝗺𝗼𝘃𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 “𝘄𝗲 𝗸𝗻𝗼𝘄 𝗢𝗪𝗔𝗦𝗣 𝗺𝗮𝘁𝘁𝗲𝗿𝘀” 𝘁𝗼 “𝘄𝗲 𝗰𝗮𝗻 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗲𝗻𝗳𝗼𝗿𝗰𝗲 𝗶𝘁.” And the pattern keeps showing up across teams: security controls cannot live inside every individual agent. They need to be centrally owned, centrally updated, and enforced consistently across every production use case. 𝗪𝗵𝗮𝘁 𝘁𝗲𝗮𝗺𝘀 𝗰𝗮𝗿𝗲 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗮𝗯𝗼𝘂𝘁: → Prompt injection is much broader than most teams assume. Direct attacks are only one slice of the problem. Indirect retrieval-based injection, multi-turn steering, and cross-agent contamination all need coverage. → PII leakage keeps coming up as a hard gating requirement, especially in banking. One quote from the piece stayed with me: “We don’t need to prove that PII doesn’t leak 99% of the time. We need to prove it doesn’t leak, period.” → Heuristic controls hit a wall fast. Regex, keyword filters, and custom rules help early, but they create maintenance burden, leave coverage gaps, and do not scale as agent use cases multiply. → Policy updates need to propagate immediately. When a new threat vector appears or requirements change, security teams need one policy definition that every agent picks up within seconds, across ADK, LangGraph, CrewAI, or custom stacks. 𝗧𝗵𝗲 𝗲𝗻𝗱𝗴𝗮𝗺𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻: 𝗖𝗮𝗻 𝘆𝗼𝘂 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹𝗶𝘇𝗲 𝗢𝗪𝗔𝗦𝗣 𝗮𝗰𝗿𝗼𝘀𝘀 𝘁𝗵𝗲 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝗳𝗼𝗿 𝗮𝗹𝗹 𝘁𝗵𝗲 𝗮𝗴𝗲𝗻𝘁𝘀? Learn more...
ptk tweet media
English
1
1
2
212
ptk
ptk@ptkbhv·
@bcherny How will we become disciplined if you will keep spoiling us?
English
0
0
0
35
Boris Cherny
Boris Cherny@bcherny·
Opus 4.7 uses more thinking tokens, so we've increased rate limits for all subscribers to make up for it. Enjoy!
English
1.2K
936
22.2K
1.3M
ptk
ptk@ptkbhv·
@GergelyOrosz more true than ever..human taste in UX is valuable because LLMs are still weak at vision..but getting better with each passing day
English
0
0
0
192
Gergely Orosz
Gergely Orosz@GergelyOrosz·
Prediction: The next 12-24 months, "UX-pilled" builders will be in massive demand. Who can create intuitive interfaces, web+mobile+desktop apps that "feel good," natural, fast, and far better than the competition. THIS will be the difference vs those building "just" with AI.
English
180
170
3K
365.9K
ptk
ptk@ptkbhv·
@aakashgupta I think you are absolutely right. With the Uber type complaints on the rise we will see an intense pricing war between open and closed source and switching on a weekly basis.
English
0
0
0
258
Aakash Gupta
Aakash Gupta@aakashgupta·
Khosla just paid $1.5B to short the idea that model lock-in is a moat in AI coding. Factory's valuation went from $300M to $1.5B in 7 months. 5x. Look past the number. What Khosla is actually buying is the only company whose core bet is that the foundation model under you stops mattering. Every AI coding platform had to pick a thesis. Cursor: we'll rewrap whichever model wins. Claude Code: our model is the best. Cognition's Devin: we own the agent end to end. Factory's bet is sharper. Agent design beats model choice, and they'll prove it on every frontier model simultaneously. They did. Droid hit #1 on Terminal-Bench with Claude Opus at 58.8%. Then #3 with GPT-5 at 52.5%. Then Sonnet at 50.5%. Three of the top five agents on the hardest end-to-end coding benchmark are all Factory running different models. Claude Code running Claude itself came in at 43.2%. That's the thesis trade. If agent framework beats model selection, then Anthropic and OpenAI get commoditized in code the same way AWS commoditized server hardware. The moat moves from "which model" to "which orchestration layer sits between the developer and the model." Run the math on where the money is going. Cursor is at $29.3B. Replit is at $9B (also Khosla, tripled in 6 months). Cognition, Magic, Codeium, and Factory bring the AI coding stack to roughly $50B in private valuation. The space is being priced like one of them wins a generational prize. Factory is the only one in that set whose product gets better as the model landscape gets noisier. Every new frontier model release is distribution for them. Every model release for a rival is a feature migration risk. The part nobody's pricing: enterprise buyers are starting to ask which vendor survives three years from now. At MongoDB, EY, Bayer, Zapier, and Clari, Factory is already the answer. 31x faster feature delivery and 96.1% shorter migration times is what a CIO shows the board when moving a dev org off one vendor. The real question for the rest of the stack: what happens to your valuation when model choice stops being a purchase criterion?
Factory@FactoryAI

Today, we are excited to announce our $150M Series C led by Khosla Ventures with strong participation from Sequoia Capital, Blackstone, Insight Partners, Evantic Capital, Abstract Ventures, 20VC, NEA, and Mantis VC. This puts our valuation at $1.5B and will accelerate our investment in research, product, and global go-to-market. Long live developers.

English
24
33
445
118.1K
ptk
ptk@ptkbhv·
@ashpreetbedi If you publish faster than we can read then what will be the impact!?
English
1
0
0
143
Ashpreet Bedi
Ashpreet Bedi@ashpreetbedi·
🚨 New post: The data agent every company needs OpenAI, Vercel, Uber, LinkedIn, Salesforce, DoorDash are all building data agents. I'm open-sourcing ours. Clone it, run it, ask questions in slack. ashpreetbedi.com/articles/dash-…
Ashpreet Bedi tweet media
English
6
18
121
6.3K
ptk
ptk@ptkbhv·
@philipkiely Ordered! Gonna get it in 2 weeks.
English
0
0
1
178
ptk
ptk@ptkbhv·
@GergelyOrosz I am hoping that we can establish a scaling curve where cost is not the concern.
English
0
0
2
143
Gergely Orosz
Gergely Orosz@GergelyOrosz·
There is massive irony in how AI coding tools are starting to become TOO expensive for many enterprises - after eg Anthropic removed subsidizing AI subscriptions. We might go from "everyone use AI for everything!" to "you have $300/month AI budget; use your brain for the rest."
English
267
254
3.7K
254.6K
“paula”
“paula”@paularambles·
welcome to the future
“paula” tweet media“paula” tweet media
English
53
105
3.3K
239.7K
ptk retweetledi
Galileo
Galileo@rungalileo·
EU AI Act audits begin in August. The theoretical conversation about AI governance just became a procurement requirement with deadlines attached. Large banks now require security sign-off before any agentic use case reaches production. Risk teams are blocking deployments until observability and governance are in place. Many enterprises guard against only 2-3 of the 10 OWASP threat categories for agentic AI. Prompt injection guardrails cover approximately 2 of 10 defined injection variants. Entire attack categories, tool misuse, identity abuse, privilege escalation, and inter-agent communication risks, remain invisible to existing controls. Traditional application security rests on one foundational property: the system under protection is a constrained actor with fixed logic. Agentic AI is an adaptive actor with open-ended behavior, and is fundamentally different to secure. We just published Operationalizing the OWASP Top 10 for Agentic AI; a security whitepaper that shows how to turn the OWASP framework into enforceable, auditable controls using a central control plane architecture. Read our whitepaper to: – Understand why agents break traditional application security models – Map every OWASP ASI01–ASI10 threat to concrete detection controls – Architect a central control plane that enforces policy across every agent – Separate platform-level and per-agent controls without duplicating effort – Close the gap between prompt injection guardrails and full OWASP coverage – Build an immutable audit trail regulators and CISOs will accept –  Apply the same infrastructure to GDPR, EU AI Act, and internal requirements –  Validate OWASP threat coverage with aligned test suites, not generic benchmarks The enterprises that treat OWASP as a checkbox will fall behind. The ones that treat it as the architectural blueprint for agentic AI governance will lead. Download the whitepaper here: galileo.ai/owasp-whitepap… Written by: @ptkbhv, AI Engineer, Galileo @mike_branc, FDE, Galileo Bianca DePriest, Enterprise Sales, Galileo Obine Adoh, Security, Galileo
Galileo tweet media
English
1
1
4
244
Anissa Gardizy
Anissa Gardizy@anissagardizy8·
Uber's CTO told @LauraBratton5 that AI coding tools—particularly Anthropic’s Claude Code—has already maxed out its 2026 AI budget 📈 “I'm back to the drawing board, because the budget I thought I would need is blown away already,” Neppalli Naga said. theinformation.com/newsletters/ap…
English
108
165
1.4K
1.7M
ptk
ptk@ptkbhv·
@aakashgupta This movie was peak cinema x humankind x love
English
0
0
0
30
Aakash Gupta
Aakash Gupta@aakashgupta·
At the end of Interstellar, Murph is nearly 90 and dying. Cooper is still physically in his 40s. The distance between them isn't years. It's a black hole spinning at 99.99% of maximum angular momentum. Kip Thorne, the Nobel laureate who consulted on the film, spent hours proving this was mathematically possible. A time dilation factor of 60,000x on a stable orbit. Nolan made that number non-negotiable. Thorne thought it was impossible, ran the Kerr metric equations, and found it was marginally achievable. Then Nolan broke two of his own filmmaking rules to shoot it. He filmed McConaughey's reaction in close-up first. Directors never start there. And McConaughey hadn't seen the video messages from his on-screen kids. Those tears were real. First take. Ellen Burstyn was 82 playing elderly Murph. McConaughey was 45. No de-aging tech. No prosthetics. The age gap between father and dying daughter was real because the physics demanded it. Jonathan Nolan's original script had no reunion. The ending was darker. Cooper never made it back. Christopher read his brother's draft, added this scene because he was a parent, and called the father-daughter relationship "the north star of the film." The most devastating goodbye in modern cinema exists because a physicist found a loophole in Einstein's equations and a director became a dad.
English
97
698
10.9K
1.5M
Nathan Lambert
Nathan Lambert@natolambert·
My book, Reinforcement Learning from Human Feedback, is wrapping up and going into final production (copyediting, making pretty, formatting, etc.). Shipping to you in 1-2 months! It's a wonderful project to create a foundation of knowledge for the research communities that I love and operate in. It’s the book I wish I had when starting on my LLM journey about 3 years ago. The book’s deepest cut is on core reinforcement learning methods, intuitons, and implementations for LLMs. These don’t live in isolation, and it’s presented in the broader context of post-training methods and unsolved problems in RLHF. A nice balance of depth and breadth. I’m always asked about the title, and I am staying firm that this is THE book documenting the organization of the field of RLHF. Any other topic is too dynamic, where writing a book today would be immediately outdated. RLHF is largely being overshadowed by lots of other developments in AI, but will always be around and at the forefront of human-AI interactions. The topic deserves coverage in depth and this platform. Thank you for all your support. More projects related to the book being announced soon 🎥 I'm excited to reconnect with the community through in-person book events this summer and fall.
Nathan Lambert tweet media
English
15
36
402
28.1K
ptk retweetledi
Battery Ventures
Battery Ventures@BatteryVentures·
We’re proud to celebrate an exciting milestone as @Cisco announces its intent to acquire @rungalileo! Battery is fortunate to have partnered with @vikramchatterji, Atin, @YashSheth46 and the Galileo team early in the company’s journey. We led the Series A in 2022 when generative AI was just beginning to take off, and had the privilege of working closely with the team as they shaped product, talent and go-to-market, scaling to serve enterprise customers including ServiceTitan, NTT, Comcast and HP. The announced acquisition will build on Cisco’s full-stack observability strategy, adding Galileo’s AI-native observability and evaluation engineering platform to extend Cisco / @splunk's visibility into AI systems and agentic applications that are becoming core to how work gets done in the enterprise. Congratulations to the entire Galileo and Cisco teams on the acquisition! More details: blogs.cisco.com/news/Cisco-ann…
Battery Ventures tweet media
Galileo@rungalileo

🚀 Big News: Galileo is joining forces with @Cisco! 🚀 We are thrilled to announce a massive milestone: Cisco has announced its intent to acquire Galileo! Five years ago, we started Galileo with a simple but bold mission: to solve the “trust problem” for software built with language models (aka NLP). We saw early on that these software workloads were fundamentally different—non-deterministic, unpredictable, and requiring a completely new approach to observability. Today, language model powered AI software is increasingly ubiquitous, the "trust gap" is the biggest bottleneck to unleash AI at scale and Galileo’s platform has been rapidly adopted by some of the world’s largest enterprises to ship trustworthy AI products. @splunk and Cisco more broadly have been pioneers in the observability and security space for decades. In becoming part of Cisco, we are excited and prepared to redefine how the world builds, deploys, and trusts AI at scale. The opportunity ahead of us is massive, and we are only getting started. What does this mean for our customers? The most important thing to know is that our commitment to you remains unchanged. You will still be working with the same reliable Galileo team you know and trust. However, we are now turbocharged with the "superpowers" of Cisco and Splunk! ⚡️ We are incredibly grateful to our team, our partners, and—most importantly—our users. We are always here for you, and we couldn’t be more excited about this next chapter. Onward! 🚀✨ @vikramchatterji, Atin, and @YashSheth46 Learn more here: blogs.cisco.com/news/Cisco-ann…

English
0
2
5
1.4K
ptk retweetledi
Galileo
Galileo@rungalileo·
🚀 Big News: Galileo is joining forces with @Cisco! 🚀 We are thrilled to announce a massive milestone: Cisco has announced its intent to acquire Galileo! Five years ago, we started Galileo with a simple but bold mission: to solve the “trust problem” for software built with language models (aka NLP). We saw early on that these software workloads were fundamentally different—non-deterministic, unpredictable, and requiring a completely new approach to observability. Today, language model powered AI software is increasingly ubiquitous, the "trust gap" is the biggest bottleneck to unleash AI at scale and Galileo’s platform has been rapidly adopted by some of the world’s largest enterprises to ship trustworthy AI products. @splunk and Cisco more broadly have been pioneers in the observability and security space for decades. In becoming part of Cisco, we are excited and prepared to redefine how the world builds, deploys, and trusts AI at scale. The opportunity ahead of us is massive, and we are only getting started. What does this mean for our customers? The most important thing to know is that our commitment to you remains unchanged. You will still be working with the same reliable Galileo team you know and trust. However, we are now turbocharged with the "superpowers" of Cisco and Splunk! ⚡️ We are incredibly grateful to our team, our partners, and—most importantly—our users. We are always here for you, and we couldn’t be more excited about this next chapter. Onward! 🚀✨ @vikramchatterji, Atin, and @YashSheth46 Learn more here: blogs.cisco.com/news/Cisco-ann…
Galileo tweet media
English
0
5
19
2.7K
ptk
ptk@ptkbhv·
@GergelyOrosz "Mythos means a traditional story, belief system, or the underlying set of myths in a culture. It can also mean the plot or narrative structure of a story or play."
English
0
0
0
34
Gergely Orosz
Gergely Orosz@GergelyOrosz·
It is both annoying and sad to see how this one scorecard release of Anthropic’s Mythos is spreading so much FUD, esp on social media. It’s like the less information released, the more the dramatic assumptions. There’s sparse information - I’ll hold judgement until I’ll be able to get hands-on experience or see far more details.
English
60
25
502
40.9K
ptk
ptk@ptkbhv·
@istdrc you are on the right path, get us rid of slack
English
1
0
1
202
stdrc
stdrc@istdrc·
Hi, I’m RC. I previously built Kimi CLI at Moonshot AI. Now I’m building Slock, an agent-human collaboration platform for modern builders and teams. Today, we're shipping a ton of new features and improvements in Slock: search, thread inbox, saved messages, message permalinks, pinned chats, server join links, a more consistent color system, and many smaller upgrades. More details in the thread below.
stdrc tweet media
English
115
70
915
95K