ragas

183 posts

ragas

@ragas_io

Supercharge Your LLM Application Evaluations 🚀 Github: https://t.co/f0LNX3PjIW Discord: https://t.co/uaw1hwwaB9

Katılım Mart 2024

1 Takip Edilen1K Takipçiler

ragas retweetledi

ikka@Shahules786·19 Mar

(1/n) Today, we’re releasing Cloning Bench. Labs are paying 6-7 figures for clones of web apps to do web/computer use-based RL training. At @VibrantLabsAI , our fundamental goal is to automate the creation of RL environments. For web/CUAs, one way that we do that is by using coding agents and custom harness to automatically generated the simulation environment. We tested Codex, Gemini, Claude Code, and GLM using our harness on their ability to recreate a Slack workspace and benchmarked their performances. We have published our methods, results and analysis here today: vibrantlabs.com/blog/cloning-b…

English

142

12.8K

ragas retweetledi

ikka@Shahules786·17 Mar

There are 3 elements to improving models: 1) Architecture 2) Compute 3) Data No one is changing (1), (2) is actively being solved by the compute giants. Now what’s left is (3), which has effectively become 2026’s “pickaxes in a gold rush.” Today, the choke point is fully human-created data. We at @VibrantLabsAI believe AGI will not be achieved by human data alone, so we’re laser-focused on synthesizing as much as possible to advance models to the next frontier.

English

555

ragas retweetledi

ikka@Shahules786·5 Mar

Last week, we did an internal deep dive into enterprise environments/benchmarks like τ²-𝐁𝐞𝐧𝐜𝐡 and 𝐂𝐨𝐫𝐞𝐂𝐫𝐚𝐟𝐭. This type of high-fidelity RL env is becoming increasingly popular as frontier labs push their models into more and more agentic capabilities.

English

631

ragas retweetledi

Tarun Jain@TRJ_0751·27 Şub

Writing prompts by hand is just guessing. Let DSPy optimizer find the best one for you. Here is a full report that: > build your traditional rag with @OpenAI and @qdrant_engine > optimizes prompts with @DSPyOSS MIPROv2 optimizer > trace everything with @weave_wb > evaluate the baseline and optimized RAG system using @ragas_io 🔗 report: wandb.ai/ai-team-articl…

English

196

ragas retweetledi

ikka@Shahules786·26 Şub

PA Bench - our first public benchmark on multi-tab web agents in on first page of HN now 🔥

English

401

ragas retweetledi

ikka@Shahules786·25 Şub

How good are coding agents at cloning web apps? Check out how Claude Code (+ our harness) clones a Slack workspace completely from scratch using only recordings of the real version. 🔥

GIF

English

433

ragas retweetledi

Vibrant Labs@VibrantLabsAI·24 Şub

The top frontier labs are paying tiny startups millions of dollars for RL environments: newsletter.semianalysis.com/p/rl-environme… Since most experts agree that RL post-training is causing the next wave of major model advancements, the data budget for these labs has grown more than anyone could have predicted. Browser use is a major vertical, and clones of popular consumer/enterprise websites (think: Amazon, Salesforce, Epic, etc.) are in high demand. Many companies in this space are using overseas human labor to build these environments. At Vibrant Labs, we’re instead taking the approach of automating the creation of post-training data and environments. We built out a harness that uses coding agents to clone any given web application given screen recordings of workflows we want to train on. So with all of the hype around building clones of websites, we decided to do a benchmark. Later this week, we will release Cloning Bench, a benchmark that utilizes our harness and state-of-the-art coding agents (Codex, Claude Code, etc.) to benchmark how well they perform at web cloning tasks. Stay tuned for more.

English

651

ragas retweetledi

ikka@Shahules786·17 Şub

Browser agents are becoming a hit at the consumer level, as most ad-hoc tasks people do daily through browsers can now be automated. But are the models actually good at doing any of it reliably? To evaluate this, we built PABench - a personal assistant benchmark requiring 2+ tabs to complete real-world tasks. (1/n)

English

447

ragas retweetledi

Vibrant Labs@VibrantLabsAI·16 Şub

We are releasing our first public benchmark: PA Bench . PA Bench is a first-of-its-kind web/computer-use agent benchmark focused on the types of tasks normally done by a Personal Assistant (especially multi-tab, long-horizon workflows).

English

1.7K

ragas retweetledi

Vibrant Labs@VibrantLabsAI·11 Şub

OS-Genesis: 1/ Most agent scaling is throttled by the cost of human time. OS-Genesis took a much more scalable path by using Reverse Task Synthesis. Instead of recording a user completing a task, they started from a terminal state and worked backwards to hypothesize the intent.

English

165

ragas retweetledi

Pavan Belagatti™🥑@Pavan_Belagatti·9 Şub

You are about to deploy your first RAG application. It works fine on your local environment - but you aren't sure if it will perform the same once the real users start using it. That's a sign of a weak link. Now, some obvious questions follow - How do we ensure consistent performance in production? What metrics should we track? How do we verify whether the RAG system’s responses are actually good? The answer is simple: you need a proper evaluation framework to systematically measure, validate, and improve your RAG application. The image below, you can see that we are using the RAGA's framework to make sure our RAG system/application produces contextually relevant, high-quality, and ethical responses. As you can see the RAG pipeline, only the responses are not evaluated, the RAGAS framework evaluates the system at multiple stages. Retriever quality is assessed using contextual precision, recall, and relevance by comparing the query with the retrieved contexts. Generator quality is evaluated using answer relevancy and faithfulness by grounding the generated response against the retrieved documents, enabling holistic RAG evaluation. So next time while deploying your RAG application, make sure you use any of the evaluation frameworks to safeguard your application with proper responses. I have created a simple hands-on guide to evaluating RAG applications in minutes using RAGAs framework. The link to the guide is in the comments. This is my hands-on guide to evaluating RAG applications in minutes using @ragas_io - youtu.be/-69Fx8F9ma4

YouTube

GIF

English

117

ragas retweetledi

LangChain OSS@LangChain_OSS·24 Oca

LangChain Community Spotlight: 🧠 HMLR: Long-Term Memory for AI Agents Made by the LangChain Community HMLR adds long-term memory to AI agents via LangGraph drop-in. Perfect RAGAS scores on hardest benchmarks using GPT-4.1-mini, maintains context across days/weeks without token bloat. 📦 pip install hmlr 🔗 github.com/Sean-V-Dev/HML…

English

133

10.9K

ragas retweetledi

Masaki Yatsu@yatsu·17 Oca

RSSリーダーのコード公開する。github.com/buun-ch/buun-c… 本当はRSSだけでなく、もっと広く情報種集・蓄積するものにしたい。LLMワークフロー／RAGのようなものを試せて、ふだん使いのためにメンテし続けられるもの、ということで作った。

日本語

108

ragas retweetledi

FloTorch@flo_torch_ai·15 Oca

PGVector RAG evaluation—done right. Compare retrieval, KNN, and models using Ragas metrics in one place. Measure what works. 🔗 flotorch.ai #RAG #GenAI #AIInfrastructure

English

ragas@ragas_io·13 Oca

Also squashed some bugs: - Fixed DiskCacheBackend pickling issues - Lazy tokenizer init (no more surprise network calls at import) ...and more. 💬 Join Discord: discord.gg/5djav8GGNZ

English

ragas@ragas_io·13 Oca

⚙️ System prompts everywhere InstructorLLM and LiteLLMStructuredLLM now support system prompts. More control, better customization. Thanks @c_g_aswin for the implementation! ✨

English

ragas@ragas_io·13 Oca

🚀 Ragas v0.4.3 is live! DSPyOptimizer with MIPROv2, llms.txt generation, system prompt support, and DSPy caching are here 🧵 👇

English

192

ragas retweetledi

Pavan Belagatti™🥑@Pavan_Belagatti·10 Oca

Your AI Agents are failing in production but you don't know why? Here's how and why to evaluate AI Agents rigorously👇 Dive in: anthropic.com/engineering/de… BTW, this is my hands-on guide to evaluating RAG applications in minutes using RAGAs framework: youtu.be/-69Fx8F9ma4

YouTube

English

133

Keşfet

@VibrantLabsAI @OpenAI @qdrant_engine @DSPyOSS @weave_wb @c_g_aswin @elonmusk @BarackObama