Quesma

99 posts

Quesma banner
Quesma

Quesma

@QuesmaOrg

Make AI agents production-ready through independent evaluation and training.

Katılım Ocak 2024
14 Takip Edilen125 Takipçiler
Sabitlenmiş Tweet
Quesma
Quesma@QuesmaOrg·
Recently we built OTelBench – a benchmark to test how well LLMs handle OpenTelemetry instrumentation. We tested 14 models. The best (Claude Opus 4.5) hit only 29%. These weren't trick questions, just small subset of typical SRE tasks. Link here: quesma.com/blog/introduci…
Quesma tweet media
English
0
0
3
907
Quesma retweetledi
Piotr Migdal
Piotr Migdal@pmigdal·
AI + Ghidra by NSA = reverse-engineering fun I am speaking at @AITinkerers Warsaw, 4th Mar 2026. One of my favorite event series - by and for the creators community. Vibe-resurrecting an old game from binaries 👾 and vibe-hardware-ing a LED backpack 🎒🌈.
Piotr Migdal tweet media
English
1
2
7
229
Quesma retweetledi
Piotr Migdal
Piotr Migdal@pmigdal·
Claude can code, but can it read machine code? We gave AI agents access to Ghidra (a decompiler by the NSA) and tasked them with finding hidden backdoors in servers - working solely from binaries, without any access to source code. See our BinaryAudit: quesma.com/blog/introduci…
Piotr Migdal tweet media
English
75
181
1.5K
231K
Quesma retweetledi
Ryan Marten
Ryan Marten@ryanmart3n·
Great to see the community releasing benchmarks in @harborframework now. These are invaluable resources for collectively building the most useful agents.
Jacek Migdal@jakozaur

@ryanmart3n Last week @QuesmaOrg released “terminal-bench-sre-part-1” called OTelBench in Harbor. Another releasing coming soon. Maybe even next week.

English
1
1
9
1.7K
Quesma
Quesma@QuesmaOrg·
Finally, an AI that can draw a map without getting lost. Nano Banana Pro uses tools to create factually correct infographics - and it's a game-changer. quesma.com/blog/nano-bana…
English
0
2
1
232
Quesma
Quesma@QuesmaOrg·
Interesting use case for AWS Lambda that we explored: sandboxing AI-generated code. We tried WebAssembly first but hit the wall. So, we scrapped our experiment for AWS Lambda with Docker containers in an isolated VPC. Full writeup from @pmigdal: awsfundamentals.com/blog/sandboxin…
Tobias Schmidt@tpschmidt_

Lambda has tons of use cases, but one I've missed: using it as some kind of sandbox for running AI-generated code. Lambda's isolation and scaling are a solid fit for this problem.

English
0
0
1
160
Quesma retweetledi
AISecHub
AISecHub@AISecHub·
The security paradox of local LLMs - quesma.com/blog/local-llm… by @jakozaur at @QuesmaOrg If you’re running a local LLM for privacy and security, you need to read this. Our research on gpt-oss-20b (for OpenAI’s Red‑Teaming Challenge) shows they are much more prone to being tricked than frontier models. When attackers prompt them to include vulnerabilities, local models comply with up to 95% success rate. These local models are smaller and less capable of recognizing when someone is trying to trick them. #AISecurity #LLMSecurity #LocalLLM #GenAI #MLOps #ModelRisk #DataPrivacy #AIPrivacy #PromptInjection #AIThreats #AIGovernance #EdgeAI
English
0
4
8
327
Quesma
Quesma@QuesmaOrg·
Cost-efficiency crown: @OpenAI. Across difficulties, OpenAI models dominate the Pareto frontier of cost. GPT-5-mini (high reasoning) is a great price/perf pick; GPT-4.1 is the fastest with solid wins.
Quesma tweet media
English
1
0
2
129
Quesma
Quesma@QuesmaOrg·
Can AI compile 22-year-old code? We built CompileBench to find out. We know that LLMs can vibe-code or even win IOI, but what about dependency hell or legacy build systems? (image based on XKCD 2347)
Quesma tweet media
English
1
0
4
184
Quesma
Quesma@QuesmaOrg·
Our blog post is second on Hacker News. Enjoy!
Quesma tweet media
English
1
2
10
2.9K
Quesma
Quesma@QuesmaOrg·
At #IcebergSummit 2025, Ryan Blue unveiled Iceberg beyond Java, plus the path to Table Spec V3 & forward to V4. Przemysław Delewski’s new blog covers Fokko Driesprong on Pylceberg, Matt Topol on Go, Julien Le Dem on modular DBs. Essential read for next-gen data platforms. Link👇
Quesma tweet media
English
1
0
3
189