ragas

183 posts

ragas banner
ragas

ragas

@ragas_io

Supercharge Your LLM Application Evaluations 🚀 Github: https://t.co/f0LNX3PjIW Discord: https://t.co/uaw1hwwaB9

Katılım Mart 2024
1 Takip Edilen1K Takipçiler
ragas retweetledi
ikka
ikka@Shahules786·
(1/n) Today, we’re releasing Cloning Bench. Labs are paying 6-7 figures for clones of web apps to do web/computer use-based RL training. At @VibrantLabsAI , our fundamental goal is to automate the creation of RL environments. For web/CUAs, one way that we do that is by using coding agents and custom harness to automatically generated the simulation environment. We tested Codex, Gemini, Claude Code, and GLM using our harness on their ability to recreate a Slack workspace and benchmarked their performances. We have published our methods, results and analysis here today: vibrantlabs.com/blog/cloning-b…
ikka tweet media
English
7
12
142
12.8K
ragas retweetledi
ikka
ikka@Shahules786·
There are 3 elements to improving models: 1) Architecture 2) Compute 3) Data No one is changing (1), (2) is actively being solved by the compute giants. Now what’s left is (3), which has effectively become 2026’s “pickaxes in a gold rush.” Today, the choke point is fully human-created data. We at @VibrantLabsAI believe AGI will not be achieved by human data alone, so we’re laser-focused on synthesizing as much as possible to advance models to the next frontier.
ikka tweet media
English
1
4
8
555
ragas retweetledi
ikka
ikka@Shahules786·
Last week, we did an internal deep dive into enterprise environments/benchmarks like τ²-𝐁𝐞𝐧𝐜𝐡 and 𝐂𝐨𝐫𝐞𝐂𝐫𝐚𝐟𝐭. This type of high-fidelity RL env is becoming increasingly popular as frontier labs push their models into more and more agentic capabilities.
ikka tweet media
English
1
2
9
631
ragas retweetledi
Tarun Jain
Tarun Jain@TRJ_0751·
Writing prompts by hand is just guessing. Let DSPy optimizer find the best one for you. Here is a full report that: > build your traditional rag with @OpenAI and @qdrant_engine > optimizes prompts with @DSPyOSS MIPROv2 optimizer > trace everything with @weave_wb > evaluate the baseline and optimized RAG system using @ragas_io 🔗 report: wandb.ai/ai-team-articl…
Tarun Jain tweet media
English
0
2
6
196
ragas retweetledi
ikka
ikka@Shahules786·
PA Bench - our first public benchmark on multi-tab web agents in on first page of HN now 🔥
ikka tweet media
English
0
4
7
401
ragas retweetledi
ikka
ikka@Shahules786·
How good are coding agents at cloning web apps? Check out how Claude Code (+ our harness) clones a Slack workspace completely from scratch using only recordings of the real version. 🔥
GIF
English
1
2
8
433
ragas retweetledi
Vibrant Labs
Vibrant Labs@VibrantLabsAI·
The top frontier labs are paying tiny startups millions of dollars for RL environments: newsletter.semianalysis.com/p/rl-environme… Since most experts agree that RL post-training is causing the next wave of major model advancements, the data budget for these labs has grown more than anyone could have predicted. Browser use is a major vertical, and clones of popular consumer/enterprise websites (think: Amazon, Salesforce, Epic, etc.) are in high demand. Many companies in this space are using overseas human labor to build these environments. At Vibrant Labs, we’re instead taking the approach of automating the creation of post-training data and environments. We built out a harness that uses coding agents to clone any given web application given screen recordings of workflows we want to train on. So with all of the hype around building clones of websites, we decided to do a benchmark. Later this week, we will release Cloning Bench, a benchmark that utilizes our harness and state-of-the-art coding agents (Codex, Claude Code, etc.) to benchmark how well they perform at web cloning tasks. Stay tuned for more.
Vibrant Labs tweet media
English
0
2
6
651
ragas retweetledi
ikka
ikka@Shahules786·
Browser agents are becoming a hit at the consumer level, as most ad-hoc tasks people do daily through browsers can now be automated. But are the models actually good at doing any of it reliably? To evaluate this, we built PABench - a personal assistant benchmark requiring 2+ tabs to complete real-world tasks. (1/n)
ikka tweet media
English
1
2
6
447
ragas retweetledi
Vibrant Labs
Vibrant Labs@VibrantLabsAI·
We are releasing our first public benchmark: PA Bench . PA Bench is a first-of-its-kind web/computer-use agent benchmark focused on the types of tasks normally done by a Personal Assistant (especially multi-tab, long-horizon workflows).
Vibrant Labs tweet media
English
1
3
6
1.7K
ragas retweetledi
Vibrant Labs
Vibrant Labs@VibrantLabsAI·
OS-Genesis: 1/ Most agent scaling is throttled by the cost of human time. OS-Genesis took a much more scalable path by using Reverse Task Synthesis. Instead of recording a user completing a task, they started from a terminal state and worked backwards to hypothesize the intent.
English
1
2
3
165
ragas retweetledi
Pavan Belagatti™🥑
Pavan Belagatti™🥑@Pavan_Belagatti·
You are about to deploy your first RAG application. It works fine on your local environment - but you aren't sure if it will perform the same once the real users start using it. That's a sign of a weak link. Now, some obvious questions follow - How do we ensure consistent performance in production? What metrics should we track? How do we verify whether the RAG system’s responses are actually good? The answer is simple: you need a proper evaluation framework to systematically measure, validate, and improve your RAG application. The image below, you can see that we are using the RAGA's framework to make sure our RAG system/application produces contextually relevant, high-quality, and ethical responses. As you can see the RAG pipeline, only the responses are not evaluated, the RAGAS framework evaluates the system at multiple stages. Retriever quality is assessed using contextual precision, recall, and relevance by comparing the query with the retrieved contexts. Generator quality is evaluated using answer relevancy and faithfulness by grounding the generated response against the retrieved documents, enabling holistic RAG evaluation. So next time while deploying your RAG application, make sure you use any of the evaluation frameworks to safeguard your application with proper responses. I have created a simple hands-on guide to evaluating RAG applications in minutes using RAGAs framework. The link to the guide is in the comments. This is my hands-on guide to evaluating RAG applications in minutes using @ragas_io - youtu.be/-69Fx8F9ma4
YouTube video
YouTube
GIF
English
0
1
2
117
ragas retweetledi
LangChain OSS
LangChain OSS@LangChain_OSS·
LangChain Community Spotlight: 🧠 HMLR: Long-Term Memory for AI Agents Made by the LangChain Community HMLR adds long-term memory to AI agents via LangGraph drop-in. Perfect RAGAS scores on hardest benchmarks using GPT-4.1-mini, maintains context across days/weeks without token bloat. 📦 pip install hmlr 🔗 github.com/Sean-V-Dev/HML…
LangChain OSS tweet media
English
3
17
133
10.9K
ragas retweetledi
Masaki Yatsu
Masaki Yatsu@yatsu·
RSSリーダーのコード公開する。github.com/buun-ch/buun-c… 本当はRSSだけでなく、もっと広く情報種集・蓄積するものにしたい。LLMワークフロー/RAGのようなものを試せて、ふだん使いのためにメンテし続けられるもの、ということで作った。
日本語
1
1
2
108
ragas retweetledi
FloTorch
FloTorch@flo_torch_ai·
PGVector RAG evaluation—done right. Compare retrieval, KNN, and models using Ragas metrics in one place. Measure what works. 🔗 flotorch.ai #RAG #GenAI #AIInfrastructure
English
0
1
1
66
ragas
ragas@ragas_io·
Also squashed some bugs: - Fixed DiskCacheBackend pickling issues - Lazy tokenizer init (no more surprise network calls at import) ...and more. 💬 Join Discord: discord.gg/5djav8GGNZ
English
0
0
1
40
ragas
ragas@ragas_io·
⚙️ System prompts everywhere InstructorLLM and LiteLLMStructuredLLM now support system prompts. More control, better customization. Thanks @c_g_aswin for the implementation! ✨
English
1
0
3
54
ragas
ragas@ragas_io·
🚀 Ragas v0.4.3 is live! DSPyOptimizer with MIPROv2, llms.txt generation, system prompt support, and DSPy caching are here 🧵 👇
ragas tweet media
English
1
2
2
192
ragas retweetledi
Pavan Belagatti™🥑
Pavan Belagatti™🥑@Pavan_Belagatti·
Your AI Agents are failing in production but you don't know why? Here's how and why to evaluate AI Agents rigorously👇 Dive in: anthropic.com/engineering/de… BTW, this is my hands-on guide to evaluating RAG applications in minutes using RAGAs framework: youtu.be/-69Fx8F9ma4
YouTube video
YouTube
Pavan Belagatti™🥑 tweet media
English
1
2
2
133