ptk

5.3K posts

ptk

@ptkbhv

└► I am becoming agentic, the destroyer of worlds

Remote Katılım Aralık 2013

382 Takip Edilen3.5K Takipçiler

Sabitlenmiş Tweet

ptk@ptkbhv·17 Tem

💥 𝗔𝗴𝗲𝗻𝘁 𝗟𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 v2! 💥 We’ve supercharged our benchmark to discover AI agent failures before they hit production. Instead of simple one-shot tool-calling tests, we simulate real enterprise workflows across five industries, with multi-turn dialogues, complex decision chains, and dynamic personas. 𝗪𝗵𝗮𝘁 𝘄𝗲 𝗰𝗵𝗮𝗻𝗴𝗲𝗱: ▶️ 100 synthetic scenarios per domain (banking, healthcare, investment, telecom, insurance) ▶️ 5–8 interconnected user goals in each conversation ▶️ Context carry-over, hidden parameters, time-sensitive requests, conditional flows 🥁 𝗪𝗵𝗮𝘁’𝘀 𝘂𝗻𝗱𝗲𝗿 𝘁𝗵𝗲 𝗵𝗼𝗼𝗱? ▶️ Multi-domain synthetic dataset: Tailored tools, personas, and scenarios for each vertical ▶️ Simulation pipeline: AI agent ↔ user simulator ↔ tool simulator in parallel experiments ▶️ Metrics: Action Completion (did the agent actually solve the user’s problem?) and Tool Selection Quality (did it choose and call the right APIs?) ▶️ Open code & dataset: Transparent and reproducible [what we measure shapes AI]

Galileo@rungalileo

Our Agent Leaderboard v2 is LIVE: We’ve added more models, a second benchmark, and more complexity to reflect the evolution of multi-agent systems. What LLMs came out on top? Agent Leaderboard v2 is our next step in benchmarking AI agents by moving beyond tool-calling tests to realistic enterprise scenarios. We simulated real customer support conversations across five industries with multi-turn dialogues, complex decision-making, and interdependent goals. Our evaluation focuses on two of our key agent metrics: - Action Completion: Did the agent fully accomplish the user’s goals, with explicit confirmations? - Tool Selection Quality: How effectively does the agent choose and use tools in context? Here were the key takeaways as of today 🧵

English

2.5K

ptk retweetledi

Galileo@rungalileo·28 Nis

In January 2025, researchers found a zero-click vulnerability in Microsoft 365 Copilot. The attacker sent one email. The recipient never opened it. Copilot found it during a routine search, followed the embedded instructions, and exfiltrated confidential files and chat logs. No firewall was breached. No credentials were stolen. The agent just couldn't tell its operator's instructions from the attacker's. That was a copilot with limited autonomy. The agents deployed in enterprises today have tool access, persistent memory, and the ability to delegate work to other agents. When they get hijacked, the blast radius is orders of magnitude larger. Enterprises have prompt injection guardrails to detect someone typing "ignore your instructions,” but that's one variant out of seven. The other six go undetected. RAG poisoning. Multi-turn goal manipulation. Cross-agent propagation. Each one a different attack surface. Each one invisible to a guardrail trained only on the obvious case. We published our ASI01 deep dive today: → The full 7-variant taxonomy with real enterprise attack scenarios → How to detect injections at every ingestion point, not just user input → Why the hardest injections to catch read exactly like legitimate instructions The gap between "we have a guardrail" and "we have coverage" is where the real risk lives. Read our newest blog on ASI01 here: galileo.ai/blog/owasp-age…

English

133

ptk@ptkbhv·21 Nis

galileo.ai/blog/owasp-age…

ZXX

ptk@ptkbhv·21 Nis

𝗧𝗵𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗶𝘀 𝗻𝗼𝘁 𝗷𝘂𝘀𝘁 𝘄𝗵𝗲𝘁𝗵𝗲𝗿 𝘆𝗼𝘂 𝗵𝗮𝘃𝗲 𝘁𝗵𝗲𝗺. 𝗜𝘁 𝗶𝘀 𝘄𝗵𝗲𝗿𝗲 𝘁𝗵𝗲𝘆 𝗹𝗶𝘃𝗲. One of the strongest examples from our new blog: an agent team thought its prompt injection guardrail was working. The dashboard looked clean. The model said risk was low. But the system was only catching 2 of the 10 OWASP scenarios. The rest, indirect injection, zero-shot attacks, multi-turn manipulation, cross-agent propagation, were effectively invisible. That is the trap with agent security: coverage gaps can look exactly like safety. That story is one of several in this new blog from @rungalileo. 𝗟𝗮𝗿𝗴𝗲 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀, 𝗲𝘀𝗽𝗲𝗰𝗶𝗮𝗹𝗹𝘆 𝗶𝗻 𝗳𝗶𝗻𝗮𝗻𝗰𝗶𝗮𝗹 𝘀𝗲𝗿𝘃𝗶𝗰𝗲𝘀, 𝗮𝗿𝗲 𝗺𝗼𝘃𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 “𝘄𝗲 𝗸𝗻𝗼𝘄 𝗢𝗪𝗔𝗦𝗣 𝗺𝗮𝘁𝘁𝗲𝗿𝘀” 𝘁𝗼 “𝘄𝗲 𝗰𝗮𝗻 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗲𝗻𝗳𝗼𝗿𝗰𝗲 𝗶𝘁.” And the pattern keeps showing up across teams: security controls cannot live inside every individual agent. They need to be centrally owned, centrally updated, and enforced consistently across every production use case. 𝗪𝗵𝗮𝘁 𝘁𝗲𝗮𝗺𝘀 𝗰𝗮𝗿𝗲 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗮𝗯𝗼𝘂𝘁: → Prompt injection is much broader than most teams assume. Direct attacks are only one slice of the problem. Indirect retrieval-based injection, multi-turn steering, and cross-agent contamination all need coverage. → PII leakage keeps coming up as a hard gating requirement, especially in banking. One quote from the piece stayed with me: “We don’t need to prove that PII doesn’t leak 99% of the time. We need to prove it doesn’t leak, period.” → Heuristic controls hit a wall fast. Regex, keyword filters, and custom rules help early, but they create maintenance burden, leave coverage gaps, and do not scale as agent use cases multiply. → Policy updates need to propagate immediately. When a new threat vector appears or requirements change, security teams need one policy definition that every agent picks up within seconds, across ADK, LangGraph, CrewAI, or custom stacks. 𝗧𝗵𝗲 𝗲𝗻𝗱𝗴𝗮𝗺𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻: 𝗖𝗮𝗻 𝘆𝗼𝘂 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹𝗶𝘇𝗲 𝗢𝗪𝗔𝗦𝗣 𝗮𝗰𝗿𝗼𝘀𝘀 𝘁𝗵𝗲 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝗳𝗼𝗿 𝗮𝗹𝗹 𝘁𝗵𝗲 𝗮𝗴𝗲𝗻𝘁𝘀? Learn more...

English

212

ptk@ptkbhv·18 Nis

@bcherny How will we become disciplined if you will keep spoiling us?

English

Boris Cherny@bcherny·16 Nis

Opus 4.7 uses more thinking tokens, so we've increased rate limits for all subscribers to make up for it. Enjoy!

English

1.2K

936

22.2K

1.3M

ptk@ptkbhv·17 Nis

@GergelyOrosz more true than ever..human taste in UX is valuable because LLMs are still weak at vision..but getting better with each passing day

English

192

Gergely Orosz@GergelyOrosz·17 Nis

Prediction: The next 12-24 months, "UX-pilled" builders will be in massive demand. Who can create intuitive interfaces, web+mobile+desktop apps that "feel good," natural, fast, and far better than the competition. THIS will be the difference vs those building "just" with AI.

English

180

170

365.9K

ptk@ptkbhv·17 Nis

@aakashgupta I think you are absolutely right. With the Uber type complaints on the rise we will see an intense pricing war between open and closed source and switching on a weekly basis.

English

258

Aakash Gupta@aakashgupta·17 Nis

Khosla just paid $1.5B to short the idea that model lock-in is a moat in AI coding. Factory's valuation went from $300M to $1.5B in 7 months. 5x. Look past the number. What Khosla is actually buying is the only company whose core bet is that the foundation model under you stops mattering. Every AI coding platform had to pick a thesis. Cursor: we'll rewrap whichever model wins. Claude Code: our model is the best. Cognition's Devin: we own the agent end to end. Factory's bet is sharper. Agent design beats model choice, and they'll prove it on every frontier model simultaneously. They did. Droid hit #1 on Terminal-Bench with Claude Opus at 58.8%. Then #3 with GPT-5 at 52.5%. Then Sonnet at 50.5%. Three of the top five agents on the hardest end-to-end coding benchmark are all Factory running different models. Claude Code running Claude itself came in at 43.2%. That's the thesis trade. If agent framework beats model selection, then Anthropic and OpenAI get commoditized in code the same way AWS commoditized server hardware. The moat moves from "which model" to "which orchestration layer sits between the developer and the model." Run the math on where the money is going. Cursor is at $29.3B. Replit is at $9B (also Khosla, tripled in 6 months). Cognition, Magic, Codeium, and Factory bring the AI coding stack to roughly $50B in private valuation. The space is being priced like one of them wins a generational prize. Factory is the only one in that set whose product gets better as the model landscape gets noisier. Every new frontier model release is distribution for them. Every model release for a rival is a feature migration risk. The part nobody's pricing: enterprise buyers are starting to ask which vendor survives three years from now. At MongoDB, EY, Bayer, Zapier, and Clari, Factory is already the answer. 31x faster feature delivery and 96.1% shorter migration times is what a CIO shows the board when moving a dev org off one vendor. The real question for the rest of the stack: what happens to your valuation when model choice stops being a purchase criterion?

Factory@FactoryAI

Today, we are excited to announce our $150M Series C led by Khosla Ventures with strong participation from Sequoia Capital, Blackstone, Insight Partners, Evantic Capital, Abstract Ventures, 20VC, NEA, and Mantis VC. This puts our valuation at $1.5B and will accelerate our investment in research, product, and global go-to-market. Long live developers.

English

445

118.1K

ptk@ptkbhv·16 Nis

@ashpreetbedi If you publish faster than we can read then what will be the impact!?

English

143

Ashpreet Bedi@ashpreetbedi·16 Nis

🚨 New post: The data agent every company needs OpenAI, Vercel, Uber, LinkedIn, Salesforce, DoorDash are all building data agents. I'm open-sourcing ours. Clone it, run it, ask questions in slack. ashpreetbedi.com/articles/dash-…

English

121

6.3K

ptk@ptkbhv·16 Nis

@philipkiely Ordered! Gonna get it in 2 weeks.

English

178

ptk@ptkbhv·16 Nis

@GergelyOrosz I am hoping that we can establish a scaling curve where cost is not the concern.

English

143

Gergely Orosz@GergelyOrosz·16 Nis

There is massive irony in how AI coding tools are starting to become TOO expensive for many enterprises - after eg Anthropic removed subsidizing AI subscriptions. We might go from "everyone use AI for everything!" to "you have $300/month AI budget; use your brain for the rest."

English

267

254

3.7K

254.6K

ptk@ptkbhv·15 Nis

@paularambles sign of the times

English

204

“paula”@paularambles·15 Nis

welcome to the future

English

105

3.3K

239.7K

ptk retweetledi

Galileo@rungalileo·15 Nis

EU AI Act audits begin in August. The theoretical conversation about AI governance just became a procurement requirement with deadlines attached. Large banks now require security sign-off before any agentic use case reaches production. Risk teams are blocking deployments until observability and governance are in place. Many enterprises guard against only 2-3 of the 10 OWASP threat categories for agentic AI. Prompt injection guardrails cover approximately 2 of 10 defined injection variants. Entire attack categories, tool misuse, identity abuse, privilege escalation, and inter-agent communication risks, remain invisible to existing controls. Traditional application security rests on one foundational property: the system under protection is a constrained actor with fixed logic. Agentic AI is an adaptive actor with open-ended behavior, and is fundamentally different to secure. We just published Operationalizing the OWASP Top 10 for Agentic AI; a security whitepaper that shows how to turn the OWASP framework into enforceable, auditable controls using a central control plane architecture. Read our whitepaper to: – Understand why agents break traditional application security models – Map every OWASP ASI01–ASI10 threat to concrete detection controls – Architect a central control plane that enforces policy across every agent – Separate platform-level and per-agent controls without duplicating effort – Close the gap between prompt injection guardrails and full OWASP coverage – Build an immutable audit trail regulators and CISOs will accept – Apply the same infrastructure to GDPR, EU AI Act, and internal requirements – Validate OWASP threat coverage with aligned test suites, not generic benchmarks The enterprises that treat OWASP as a checkbox will fall behind. The ones that treat it as the architectural blueprint for agentic AI governance will lead. Download the whitepaper here: galileo.ai/owasp-whitepap… Written by: @ptkbhv, AI Engineer, Galileo @mike_branc, FDE, Galileo Bianca DePriest, Enterprise Sales, Galileo Obine Adoh, Security, Galileo

English

244

ptk@ptkbhv·15 Nis

@data__wizard @anissagardizy8 @LauraBratton5 What did Uber see then!

English

Raj Rohit@data__wizard·15 Nis

@ptkbhv @anissagardizy8 @LauraBratton5 Nah

Anissa Gardizy@anissagardizy8·14 Nis

Uber's CTO told @LauraBratton5 that AI coding tools—particularly Anthropic’s Claude Code—has already maxed out its 2026 AI budget 📈 “I'm back to the drawing board, because the budget I thought I would need is blown away already,” Neppalli Naga said. theinformation.com/newsletters/ap…

English

108

165

1.4K

1.7M

ptk retweetledi

Alessio Fanelli@FanaHOVA·14 Nis

x.com/i/article/2041…

ZXX

5.8K

ptk@ptkbhv·13 Nis

@aakashgupta This movie was peak cinema x humankind x love

English

Aakash Gupta@aakashgupta·11 Nis

At the end of Interstellar, Murph is nearly 90 and dying. Cooper is still physically in his 40s. The distance between them isn't years. It's a black hole spinning at 99.99% of maximum angular momentum. Kip Thorne, the Nobel laureate who consulted on the film, spent hours proving this was mathematically possible. A time dilation factor of 60,000x on a stable orbit. Nolan made that number non-negotiable. Thorne thought it was impossible, ran the Kerr metric equations, and found it was marginally achievable. Then Nolan broke two of his own filmmaking rules to shoot it. He filmed McConaughey's reaction in close-up first. Directors never start there. And McConaughey hadn't seen the video messages from his on-screen kids. Those tears were real. First take. Ellen Burstyn was 82 playing elderly Murph. McConaughey was 45. No de-aging tech. No prosthetics. The age gap between father and dying daughter was real because the physics demanded it. Jonathan Nolan's original script had no reunion. The ending was darker. Cooper never made it back. Christopher read his brother's draft, added this scene because he was a parent, and called the father-daughter relationship "the north star of the film." The most devastating goodbye in modern cinema exists because a physicist found a loophole in Einstein's equations and a director became a dad.

English

698

10.9K

1.5M

ptk@ptkbhv·11 Nis

@natolambert Excited to receive my copy!!!

English

Nathan Lambert@natolambert·9 Nis

My book, Reinforcement Learning from Human Feedback, is wrapping up and going into final production (copyediting, making pretty, formatting, etc.). Shipping to you in 1-2 months! It's a wonderful project to create a foundation of knowledge for the research communities that I love and operate in. It’s the book I wish I had when starting on my LLM journey about 3 years ago. The book’s deepest cut is on core reinforcement learning methods, intuitons, and implementations for LLMs. These don’t live in isolation, and it’s presented in the broader context of post-training methods and unsolved problems in RLHF. A nice balance of depth and breadth. I’m always asked about the title, and I am staying firm that this is THE book documenting the organization of the field of RLHF. Any other topic is too dynamic, where writing a book today would be immediately outdated. RLHF is largely being overshadowed by lots of other developments in AI, but will always be around and at the forefront of human-AI interactions. The topic deserves coverage in depth and this platform. Thank you for all your support. More projects related to the book being announced soon 🎥 I'm excited to reconnect with the community through in-person book events this summer and fall.

English

402

28.1K

ptk retweetledi

Battery Ventures@BatteryVentures·10 Nis

We’re proud to celebrate an exciting milestone as @Cisco announces its intent to acquire @rungalileo! Battery is fortunate to have partnered with @vikramchatterji, Atin, @YashSheth46 and the Galileo team early in the company’s journey. We led the Series A in 2022 when generative AI was just beginning to take off, and had the privilege of working closely with the team as they shaped product, talent and go-to-market, scaling to serve enterprise customers including ServiceTitan, NTT, Comcast and HP. The announced acquisition will build on Cisco’s full-stack observability strategy, adding Galileo’s AI-native observability and evaluation engineering platform to extend Cisco / @splunk's visibility into AI systems and agentic applications that are becoming core to how work gets done in the enterprise. Congratulations to the entire Galileo and Cisco teams on the acquisition! More details: blogs.cisco.com/news/Cisco-ann…

Galileo@rungalileo

🚀 Big News: Galileo is joining forces with @Cisco! 🚀 We are thrilled to announce a massive milestone: Cisco has announced its intent to acquire Galileo! Five years ago, we started Galileo with a simple but bold mission: to solve the “trust problem” for software built with language models (aka NLP). We saw early on that these software workloads were fundamentally different—non-deterministic, unpredictable, and requiring a completely new approach to observability. Today, language model powered AI software is increasingly ubiquitous, the "trust gap" is the biggest bottleneck to unleash AI at scale and Galileo’s platform has been rapidly adopted by some of the world’s largest enterprises to ship trustworthy AI products. @splunk and Cisco more broadly have been pioneers in the observability and security space for decades. In becoming part of Cisco, we are excited and prepared to redefine how the world builds, deploys, and trusts AI at scale. The opportunity ahead of us is massive, and we are only getting started. What does this mean for our customers? The most important thing to know is that our commitment to you remains unchanged. You will still be working with the same reliable Galileo team you know and trust. However, we are now turbocharged with the "superpowers" of Cisco and Splunk! ⚡️ We are incredibly grateful to our team, our partners, and—most importantly—our users. We are always here for you, and we couldn’t be more excited about this next chapter. Onward! 🚀✨ @vikramchatterji, Atin, and @YashSheth46 Learn more here: blogs.cisco.com/news/Cisco-ann…

English

1.4K

ptk retweetledi

Galileo@rungalileo·9 Nis

English

2.7K

ptk@ptkbhv·8 Nis

@GergelyOrosz "Mythos means a traditional story, belief system, or the underlying set of myths in a culture. It can also mean the plot or narrative structure of a story or play."

English

Gergely Orosz@GergelyOrosz·8 Nis

It is both annoying and sad to see how this one scorecard release of Anthropic’s Mythos is spreading so much FUD, esp on social media. It’s like the less information released, the more the dramatic assumptions. There’s sparse information - I’ll hold judgement until I’ll be able to get hands-on experience or see far more details.

English

502

40.9K

ptk@ptkbhv·6 Nis

@istdrc you are on the right path, get us rid of slack

English

202

stdrc@istdrc·5 Nis

Hi, I’m RC. I previously built Kimi CLI at Moonshot AI. Now I’m building Slock, an agent-human collaboration platform for modern builders and teams. Today, we're shipping a ton of new features and improvements in Slock: search, thread inbox, saved messages, message permalinks, pinned chats, server join links, a more consistent color system, and many smaller upgrades. More details in the thread below.

English

115

915

95K

Keşfet

@rungalileo @bcherny @GergelyOrosz @aakashgupta @ashpreetbedi @philipkiely @paularambles @mike_branc