Zeno

62 posts

Zeno banner
Zeno

Zeno

@try_zeno

AI evaluation platform

San Francisco, CA Bergabung Ocak 2023
7 Mengikuti180 Pengikut
Tweet Disematkan
Zeno
Zeno@try_zeno·
We've teamed up with @AiEleuther to make it super easy to visualize your evaluation results in Zeno! Try it out the next time you run a benchmark: #visualizing-results" target="_blank" rel="nofollow noopener">github.com/EleutherAI/lm-…
English
2
11
49
12.6K
Zeno me-retweet
TwelveLabs (twelvelabs.io)
TwelveLabs (twelvelabs.io)@twelve_labs·
@a13xba @cmuhcii @a13xba will give a presentation about @try_zeno, an interactive AI evaluation platform for exploring, debugging, and sharing how your AI systems perform. (co-founded with @a_a_cabrera) twitter.com/CarnegieMellon…
Carnegie Mellon University@CarnegieMellon

An @SCSatCMU team has released a new interactive platform for data management and machine learning (ML) evaluation called Zeno. It empowers users to explore, visualize and analyze data and ML model performance across custom use cases. cmu.is/zeno

English
1
3
6
811
Zeno
Zeno@try_zeno·
We just sent out the first issue of Zeno's Notes, our newsletter on AI evaluation. In case you're not on the recipient list yet, read it here: zenoml.com/blog/newslette…
English
0
1
4
506
Zeno me-retweet
Alex Cabrera
Alex Cabrera@a_a_cabrera·
Lots of predictions of synthetic data for AI being big this year. Decided to look at the OG Alpaca dataset: hub.zenoml.com/project/f192ed… Impressive for being GPT-4 generated w/ 1 prompt, but begs the question of how to generate diverse, OOD data
English
1
2
9
864
Zeno me-retweet
Alex Cabrera
Alex Cabrera@a_a_cabrera·
In case you missed it over the break - you can now visualize the outputs of any Eleuther LM Eval Harness run in @try_zeno with one command! 𝚙𝚢𝚝𝚑𝚘𝚗 𝚜𝚌𝚛𝚒𝚙𝚝𝚜/𝚣𝚎𝚗𝚘_𝚟𝚒𝚜𝚞𝚊𝚕𝚒𝚣𝚎.𝚙𝚢
Zeno@try_zeno

We've teamed up with @AiEleuther to make it super easy to visualize your evaluation results in Zeno! Try it out the next time you run a benchmark: #visualizing-results" target="_blank" rel="nofollow noopener">github.com/EleutherAI/lm-…

English
1
5
13
4K
Zeno me-retweet
Graham Neubig
Graham Neubig@gneubig·
Google’s Gemini recently made waves as a major competitor to OpenAI’s GPT. Exciting! But we wondered: How good is Gemini really? At CMU, we performed an impartial, in-depth, and reproducible study comparing Gemini, GPT, and Mixtral. Paper: arxiv.org/abs/2312.11444 🧵
Graham Neubig tweet media
English
29
252
1.4K
493.5K
Zeno me-retweet
Shuyan Zhou
Shuyan Zhou@shuyanzh36·
Since the initial release, we have significantly improved the usability of WebArena, accuracy of the evaluation, and provided interactive result analysis with @try_zeno I am attending #NeurIPS2023 , say hi 👋 if you are interested in AI agent, code gen and their evaluations!
Shuyan Zhou@shuyanzh36

🤖There have been recent exciting demos of agents that navigate the web and perform tasks for us. But how well do they work in practice? 🔊To answer this, we built WebArena, a realistic and reproducible web environment with 4+ real-world web apps for benchmarking useful agents🧵

English
2
10
45
10K
Zeno me-retweet
Alex
Alex@a13xba·
Since some of you might be wondering whether Mamba 2.8B can serve as a drop-in replacement of some of the larger models, we've compared the Mamba model family to some of the most popular 7B models in @try_zeno Report: hub.zenoml.com/report/2443/Ma… 🧵 1/5
Albert Gu@_albertgu

Quadratic attention has been indispensable for information-dense modalities such as language... until now. Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried. With @tri_dao 1/

English
3
22
127
98.6K
Zeno me-retweet
Graham Neubig
Graham Neubig@gneubig·
Recently there were some great results from the new Mamba architecture (arxiv.org/abs/2312.00752) by @_albertgu and @tri_dao. We did a bit of third-party validation, and 1. The results are reproducible 2. Mamba 2.8B is competitive w/ some 7B models (!) 3. Mistral is still strong
Alex@a13xba

Since some of you might be wondering whether Mamba 2.8B can serve as a drop-in replacement of some of the larger models, we've compared the Mamba model family to some of the most popular 7B models in @try_zeno Report: hub.zenoml.com/report/2443/Ma… 🧵 1/5

English
3
30
264
48.4K
Zeno me-retweet
Alex Cabrera
Alex Cabrera@a_a_cabrera·
Google just released 𝑮𝒆𝒎𝒊𝒏𝒊, their long-awaited GPT-4 competitor. Their report shows comparison across multiple common benchmarks, but 𝐡𝐨𝐰 𝐫𝐞𝐥𝐢𝐚𝐛𝐥𝐞 𝐚𝐫𝐞 𝐭𝐡𝐞𝐬𝐞 𝐫𝐞𝐬𝐮𝐥𝐭𝐬? 🧵 on potential issues with the benchmark scores
Alex Cabrera tweet media
English
6
28
174
76.3K
Zeno me-retweet
Alex
Alex@a13xba·
Awesome blogpost by our friends @huggingface and @AiEleuther and a demonstration of how @try_zeno can be used to systematically spot issues with benchmark results. Give it a read! Also: Zeno Report: hub.zenoml.com/report/1255/DR… Zeno Project: hub.zenoml.com/project/2f5dec…
Clémentine Fourrier 🍊 is off till Dec 2026 hiking@clefourrier

⚠️ We are removing DROP from the Open LLM Leaderboard! With leaderboard evaluation data openly shared on 2000+ models, we did a deep dive with our friends @AiEleuther and @try_zeno, & found out that its original implementation is unfair to many models 😱 huggingface.co/blog/leaderboa…

English
0
1
7
701
Zeno me-retweet
Alex Cabrera
Alex Cabrera@a_a_cabrera·
We loved collaborating with the @huggingface and @AiEleuther teams to investigate the odd behavior on the DROP benchmark! Check out the blog post and supporting Zeno report & project: Report: hub.zenoml.com/report/1255/DR… Project: hub.zenoml.com/project/2f5dec…
Clémentine Fourrier 🍊 is off till Dec 2026 hiking@clefourrier

⚠️ We are removing DROP from the Open LLM Leaderboard! With leaderboard evaluation data openly shared on 2000+ models, we did a deep dive with our friends @AiEleuther and @try_zeno, & found out that its original implementation is unfair to many models 😱 huggingface.co/blog/leaderboa…

English
0
11
60
22.7K
Zeno
Zeno@try_zeno·
Zeno now supports 3D 🧊 data! We've uploaded over 1M @allen_ai ObjaverseXL models to a Zeno project to showcase how you can explore 3D data in a Zeno Project: hub.zenoml.com/project/d7fddd…
English
0
2
10
912