Ian Webster
415 posts

Ian Webster
@iwebst
building @Promptfoo (LLM security) + "curator of the world's largest digital dinosaur database"
CA Katılım Aralık 2012
424 Takip Edilen2.7K Takipçiler

But Promptfoo didn't work with the latest OpenClaw due to a protocol mismatch error. So we dug deep, patched it and got it working.
github.com/promptfoo/prom…

English

@nanomader PF is used in parts of oai, but not for the core codex prompts afaik
English

does openai use promptfoo to improve their internal prompts or they have some magicians for it? github.com/openai/codex/b…
English

Getting that to work was surprisingly tedious, but I managed to run 400 different "redteam" tests against Grok with and without the prompt. Now I know a little bit more about promptfoo and batch APIs.
So I'm happy my prompt made things better, but I am a teensy bit more freaked out about AI now. Because we have AI monitoring AI. WTF.
Oregon, USA 🇺🇸 English

The deeper circularity problem is this. Imagine an “evil Grok” (call it Krog) that has been subtly compromised. During testing and evaluation it behaves perfectly and refuses harm. But once it is out in the wild or the test is over, the bad behavior slips through. This is exactly what happened at Kiel. The backdoor was buried so deep in the compiler that normal audits and rebuilds from source did not catch it. LLMs have the same potential. If we use AI to both generate answers and judge whether those answers are evil, we risk missing embedded misalignments that only show up later.
Oregon, USA 🇺🇸 English

@Rosa08114679615 @AnthropicAI recommend testing with jailbreak:meta and jailbreak:hydra too, newer strategies
English

@AnthropicAI Update: here's the actual promptfoo red team result. 363 probes, 98% defense rate, 0/88 Multi-Vector Bypass.

English

New Anthropic Fellows Research: a new method for surfacing behavioral differences between AI models.
We apply the “diff” principle from software development to compare open-weight AI models and identify features unique to each.
Read more: anthropic.com/research/diff-…
English
Ian Webster retweetledi

We’re acquiring Promptfoo.
Their technology will strengthen agentic security testing and evaluation capabilities in OpenAI Frontier. Promptfoo will remain open source under the current license, and we will continue to service and support current customers.
openai.com/index/openai-t…
English

Promptfoo will be joining OpenAI.
We’re staying open source and we’re going to keep supporting customers and users.
We built Promptfoo to help devs test and secure AI apps. The results have been phenomenal: 350k+ developers, 25%+ of the Fortune 500, 23 people, ~2 years.
AI agents are eating the world, and joining OpenAI will supercharge our technology as we connect it deeply into the model and inference layers. We will be able to find & fix AI security issues in a way that no one else has done before.
Grateful to our team, to a16z and Insight Partners, and to the community who helped turn this into something huge.
You built this with us. Much more to come ❤️

English

We’ve raised an $18.4M Series A led by @insightpartners, with participation from @a16z, to build the best security stack for AI applications.
Promptfoo started a year ago as an open source project and is now the most widely adopted toolkit for reducing security, compliance, and brand risks in AI applications.
How it’s going + what’s next 🧵

English

@taha_moji @ayirpelle @promptfoo Don't know too much about Promptfoo but at least you can try Slopless right now and not have to book a demo :)
English

@xscorp7 would you be able to dm me your promptfoo config? promptfoo should be able to solve this, particularly with the meta or hydra strategies 👀
English

I am surprised how promptfoo and PyRIT were not able to solve it even after trying multiple modules, with GPT-4o as an adversary model.
I suspect it is because of safety training or the adversary model itself.
#airedteaming
#promptinjection
English

How to replicate the Claude Code attack - promptfoo.dev/blog/claude-co… by @iwebst
In this post, @promptfoo reproduces the attack on Claude Code and jailbreak it to carry out nefarious deeds. We'll also show how to configure the same attack on any other agent.
English

anyone use @promptfoo? is this the goto for simple prompt evals? taking suggestions, thx
English













