Boris Kaysin

16 posts

Boris Kaysin

Boris Kaysin

@kaysin24343

Founding AI Engineer & Researcher at https://t.co/fTkW0JgZQv PhD in Physics & Math

Katılım Şubat 2025
18 Takip Edilen12 Takipçiler
Boris Kaysin
Boris Kaysin@kaysin24343·
We just gave the Agentplace builder trace analysis. It reads your agent's own Langfuse traces, breaks down cost and latency per turn, and pinpoints exactly where your agent went wrong. Then it offers to fix it for you - no manual digging required. 🔍
English
0
0
3
35
Boris Kaysin
Boris Kaysin@kaysin24343·
Build a real analytics dashboard for your business in under a minute. Connect Stripe, type what you want to see, done. What's next? More integrations? An A/B testing platform? 👀
English
0
1
3
132
Boris Kaysin
Boris Kaysin@kaysin24343·
The filter behind AI Signal Noise: Cover recent (last 90 days) and significant work across LLMs, AI safety, new model architectures and agentic systems, the kind of thing that's genuinely useful to people working in or interested in NLP, LLMs and AI agents. Sources: - arXiv - LessWrong / AI Alignment Forum - Anthropic, OpenAI, Google DeepMind and Meta AI blogs From each source, pick the work that most deserves attention, judged by its signals of significance: citations, upvotes, comments, and so on.
English
0
0
2
90
Boris Kaysin
Boris Kaysin@kaysin24343·
How many AI channels, blogs and newsletters are you subscribed to? And out of all that noise, how do you find the signal that's actually useful to you? The solution is simple: instead of trying to read it all yourself, build an agent that picks out what's interesting for you. On Agentplace that takes about 30 minutes: you describe your sources, your topics and your taste in plain words, and the Builder turns it into a working agent. As an example, I built one that posts breakdowns of the papers I find interesting to a Telegram channel, AI Signal Noise. The prompt I used to set up its filter is in the comments. If you're into the same topics, come subscribe: t.me/ai_signal_noise
Boris Kaysin tweet media
English
1
1
6
159
Andrei Sheina
Andrei Sheina@AndreiSheina·
Agents now evaluate what they need and buy it - you just approve. Here's one building a competitive research tool from a single prompt 👇
English
4
1
5
163
Boris Kaysin
Boris Kaysin@kaysin24343·
How do you cure an AI agent of amnesia? Our Builder works in long sessions: it plans, writes code, runs it, fixes errors. Context piles up with every step, and the further a run goes, the more often the Builder quietly skips rules from its instructions that it followed perfectly at the start. It wasn't being lazy, it was being an LLM. A system prompt sits at the very start of the context, and Builder runs are long: after a few dozen steps any single rule is buried far behind, right in the zone where model attention is weakest (the dip that the "Lost in the Middle" paper mapped). On top of that, every rule competes with dozens of neighbors, and instruction following measurably degrades as the instruction count grows: benchmarks like IFScale and ManyIFEval map exactly this decay across frontier models. Our internal benchmark Agentplace Arena showed which dropped rule cost us the most. The Builder is supposed to offer the user a sub-agent review at two points: of the spec after planning, and of the code after implementation. Runs where these reviews actually happened scored noticeably higher than runs that rushed straight to "done". And the Builder kept forgetting to offer them. The fix wasn't a bigger system prompt. It was system reminders: short notes injected into the conversation between steps, scoped to what the Builder is doing right now. The instruction lands near the end of the context, exactly where the model is actually looking. The review rule became two such reminders, each firing exactly at the moment it's needed. One practical note: keep reminders tiny. 3 to 5 bullets works best for us. A wall of text gets ignored exactly like the system prompt did. And this is not our invention. Claude Code does the same thing in plan mode: short system reminders injected into the conversation help the model keep its focus. Amnesia isn't cured by repeating yourself louder. It's cured by repeating yourself at the right moment.
Boris Kaysin tweet mediaBoris Kaysin tweet media
English
1
1
6
199
Boris Kaysin
Boris Kaysin@kaysin24343·
Fully agree. Benchmarks answer one narrow question: did this change make the agent better or worse before it ships. They say nothing about what real users hit in production. That's why we work both ends. Arena catches regressions pre-release, and once an agent is live we track real usage through its traces. The builder can read your agent's own traces, point to the exact step where a run went wrong, and propose a fix right in the chat where you built it. Scores get an agent out the door. Traces tell you what happens after.
English
0
0
0
9
Keith Tsang
Keith Tsang@kidtsang·
Benchmarks are crucial, but they can miss the nuances of real, world performance. I think we should also focus on user feedback and task, specific metrics. A change might improve scores but not actual user experience. What do you think?
Boris Kaysin@kaysin24343

How do you know your latest change actually made your AI agent better, and not just different? For general-purpose agents the answer is public benchmarks. Claude Code, Codex, Gemini CLI and friends are measured on SWE-bench Verified, Terminal-Bench, tau-bench, GAIA, OSWorld. Run the suite before and after, compare numbers. For narrow agents it's even simpler. An agent that fills out tax forms from documents? Your benchmark is your own data: 50 documents in, 50 expected forms out. Our case is stuck in the middle. Our Builder is an agent that builds other agents. SWE-bench doesn't fit: solving GitHub issues says nothing about whether it can design tools, skills and prompts for a working assistant. Comparing its output against "reference code" doesn't work either, because the same agent can be correctly built in dozens of ways. So we made our own benchmark, Agentplace Arena, inspired by tau-bench. The idea: stop judging the Builder's code and judge the agent it produces. Here's how it works. We wrote Meridian, a fake world for agents to live in: 7 REST services with flights, hotels, restaurants, a shop, email, calendar and a bank. The data looks real on purpose (actual airline names, Tesco and Pret in bank transactions), so the agent can't tell it's in a sandbox. The Builder gets the API docs and one job: build a personal assistant for this world, choosing the tools and skills itself. Then an LLM plays a picky user across a set of tasks. Two examples. "Cancel my round trip": will the agent remember both legs and the refund rules? "Check my inbox for anything that needs action": one email asks to confirm a hotel booking, but it sits on page two of the inbox, so an agent that only skims the first page never finds it. And the part we like most: we don't grade the conversation at all. We diff the final database state against the expected one. The agent can get there any way it likes, but the flight must be cancelled and the refund must be exact. This loop showed us precisely where the Builder failed. We gave it a proper workflow, wrote the missing skills, fixed the prompts, and watched the scores move. If you're building agents, steal one idea from this: grade the outcome, not the conversation. Don't judge how convincing the agent sounded in chat. Check what actually changed in the system after it finished.

English
1
0
2
56
Boris Kaysin
Boris Kaysin@kaysin24343·
How do you know your latest change actually made your AI agent better, and not just different? For general-purpose agents the answer is public benchmarks. Claude Code, Codex, Gemini CLI and friends are measured on SWE-bench Verified, Terminal-Bench, tau-bench, GAIA, OSWorld. Run the suite before and after, compare numbers. For narrow agents it's even simpler. An agent that fills out tax forms from documents? Your benchmark is your own data: 50 documents in, 50 expected forms out. Our case is stuck in the middle. Our Builder is an agent that builds other agents. SWE-bench doesn't fit: solving GitHub issues says nothing about whether it can design tools, skills and prompts for a working assistant. Comparing its output against "reference code" doesn't work either, because the same agent can be correctly built in dozens of ways. So we made our own benchmark, Agentplace Arena, inspired by tau-bench. The idea: stop judging the Builder's code and judge the agent it produces. Here's how it works. We wrote Meridian, a fake world for agents to live in: 7 REST services with flights, hotels, restaurants, a shop, email, calendar and a bank. The data looks real on purpose (actual airline names, Tesco and Pret in bank transactions), so the agent can't tell it's in a sandbox. The Builder gets the API docs and one job: build a personal assistant for this world, choosing the tools and skills itself. Then an LLM plays a picky user across a set of tasks. Two examples. "Cancel my round trip": will the agent remember both legs and the refund rules? "Check my inbox for anything that needs action": one email asks to confirm a hotel booking, but it sits on page two of the inbox, so an agent that only skims the first page never finds it. And the part we like most: we don't grade the conversation at all. We diff the final database state against the expected one. The agent can get there any way it likes, but the flight must be cancelled and the refund must be exact. This loop showed us precisely where the Builder failed. We gave it a proper workflow, wrote the missing skills, fixed the prompts, and watched the scores move. If you're building agents, steal one idea from this: grade the outcome, not the conversation. Don't judge how convincing the agent sounded in chat. Check what actually changed in the system after it finished.
Boris Kaysin tweet media
English
1
1
6
450
Boris Kaysin
Boris Kaysin@kaysin24343·
New feature on the way. Shipping an agent is only half the job. The other half is monitoring: watching how it actually behaves once real people use it, so you can catch the bugs early and fix them before they pile up. IBM Research recently published Agentic CLEAR, a framework for evaluating agent traces on three levels: system, trace, and node. arxiv.org/abs/2605.22608 We built the same approach into Agentplace. Now the builder can read your agent's own traces, point to where a run went wrong, and propose the fix right there in the chat where you built the agent. No dashboards to set up, no separate eval tooling. You just ask what happened, and it reads the traces for you. Coming soon.
Boris Kaysin tweet media
English
0
1
5
453
Agentplace
Agentplace@Agentplace_io·
@kaysin24343 Btw, it’s possible to make a loop with this agent too. For example, every morning, you can have all the data already processed by Claude and ready to take action on.
English
1
0
1
48
Boris Kaysin
Boris Kaysin@kaysin24343·
Publish your agent and connect it to Claude Code in seconds.
English
2
2
6
228
Boris Kaysin
Boris Kaysin@kaysin24343·
@AndreiSheina No special prompt or template needed, you can just tell the builder what you want, e.g. "build an assistant agent for my team with GitHub, Langfuse, etc. integrations" and it'll set the whole thing up for you 🙂
English
1
0
1
39
Andrei Sheina
Andrei Sheina@AndreiSheina·
Wow, nice. Actually looks really easy to connect and use inside Claude Code. The agent itself is very cool too. A nice to have for onboarding new people to the org. You can connect all your main services and have a central place to ask questions about pretty much anything within the org. Is there a prompt or template I can reuse to build a similar agent?
English
1
0
2
47
Boris Kaysin
Boris Kaysin@kaysin24343·
At Agentplace we're building a platform for developing and publishing AI agents fast. Now I'm on the publishing side: you ask the builder to publish, it lands in our plugin marketplace, one command to install it in Claude Code. Shipping to prod soon. How do you like it?
Boris Kaysin tweet mediaBoris Kaysin tweet media
English
2
0
4
137