OpenCompass

80 posts

OpenCompass banner
OpenCompass

OpenCompass

@OpenCompassX

OpenCompass focus on the evaluation and analysis of large language models and vision language models. github: https://t.co/zF7ycuTXxs

China Katılım Nisan 2024
47 Takip Edilen276 Takipçiler
Sabitlenmiş Tweet
OpenCompass
OpenCompass@OpenCompassX·
OpenCompass is onboard for Twitter (@X), focusing intently on the evaluation and analysis of Large Language Models and Vision-Language Models. Welcome to star our project: github.com/open-compass/o…
English
0
2
7
1.6K
OpenCompass
OpenCompass@OpenCompassX·
😉Introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. 🥰Comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. 👇 hub.opencompass.org.cn/daily-benchmar…
OpenCompass tweet media
English
0
0
0
68
OpenCompass
OpenCompass@OpenCompassX·
😀A new study finds that long prompts cause a fidelity–diversity trade-off in leading T2I models: more detail but reduced diversity. 😉To evaluate this issue, the authors introduce LPD-Bench and propose PromptMoG, a training-free approach that enhances diversity by sampling prompt embeddings from a Gaussian mixture. 😉Experiments on SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image show that PromptMoG improves long-prompt diversity without causing semantic drift. 👇 hub.opencompass.org.cn/daily-benchmar…
OpenCompass tweet media
English
0
0
1
70
OpenCompass
OpenCompass@OpenCompassX·
😀SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. 😉Two categories: Environmental Perception and Scene Understanding 😉13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. 🥰SpatialSky-Bench is now part of the Daily Benchmark. Explore more: hub.opencompass.org.cn/daily-benchmar… #VLM #UAV #AI
OpenCompass tweet media
English
0
0
1
78
OpenCompass
OpenCompass@OpenCompassX·
🥰VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. 😊Two-stage evaluation framework: 1.Examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. 2.Investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. 👏VP-Bench is now part of the Daily Benchmark. Explore more: hub.opencompass.org.cn/daily-benchmar…😊
OpenCompass tweet media
English
0
0
1
67
OpenCompass
OpenCompass@OpenCompassX·
🥰OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. 😍Covers 4 modalities: 18,000+ bilingual (ZH/EN) text prompts 4,500 images 450 audio clips 450 videos 👏OutSafe-Bench is now part of the Daily Benchmark. 👇Explore more: hub.opencompass.org.cn/daily-benchmar…
OpenCompass tweet media
English
0
1
2
453
OpenCompass
OpenCompass@OpenCompassX·
🚀 OpenCompass Daily Benchmark is live! ✅ Daily updates of the latest AI evaluation papers ✅ AI-powered smart summaries ✅ Available in English & Chinese 😍Stay ahead of AI trends, key insights, and cutting-edge research—all in one place! 🔗 hub.opencompass.org.cn/daily-benchmar…
OpenCompass tweet media
English
0
2
2
420
OpenCompass
OpenCompass@OpenCompassX·
🚀 Introducing #CompassVerifier: A unified and robust answer verifier for #LLMs evaluation and #RLVR! ✨LLM progress is bottlenecked by weak evaluation, looking for an alternative to rule-based verifiers? CompassVerifier can handle multiple domains including math, science, and reasoning, as well as various answer types such as multiple-choice questions, sequences, and multiple subproblems. 🔍 Introducing #VerifierBench: A challenging dataset for evaluating the verification capabilities of different models! 🏆 Want to evaluate the verification abilities? We collected over 1 million LLMs responses using #OpenCompass framework, and selected the most challenging examples through multiple rounds of screening and manual annotation. 🏠 Main Page: open-compass.github.io/CompassVerifier 📄 Paper: arxiv.org/pdf/2508.03686 💻 Code: github.com/open-compass/C… 🤗 Model & Datasets:@huggingface huggingface.co/collections/op…
OpenCompass tweet mediaOpenCompass tweet mediaOpenCompass tweet mediaOpenCompass tweet media
English
0
1
5
889
OpenCompass
OpenCompass@OpenCompassX·
🥳#StructFlowBench is a structurally annotated multi-turn benchmark that leverages a structure-driven generation paradigm to enhance the simulation of complex dialogue scenarios. 🥳StructFlowBench is now part of the #CompassHub! 😉Feel free to download and explore it—available for public use. 🤗hub.opencompass.org.cn/dataset-detail…
OpenCompass tweet mediaOpenCompass tweet media
English
0
1
3
744
OpenCompass
OpenCompass@OpenCompassX·
😉#VBench is a comprehensive benchmark evaluates video generation quality. It comprises 16 dimensions in video generation, and also provides a dataset of human preference annotations. 🥳VBench is now part of the #CompassHub! Feel free to download and explore it—available for public use. 🤗hub.opencompass.org.cn/dataset-detail…
OpenCompass tweet media
English
0
0
2
175
OpenCompass
OpenCompass@OpenCompassX·
🥰VLM²-Bench is the first comprehensive benchmark that evaluates vision-language models' (#VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases. 🥳VLM²-Bench is now part of the #CompassHub! Feel free to download and explore it—available for public use. ☺️hub.opencompass.org.cn/dataset-detail…
OpenCompass tweet media
English
0
0
6
299
OpenCompass
OpenCompass@OpenCompassX·
🌟 Exciting News! CompassArena now back with some major updates: - **Judge Copilot**: An LLM-as-a-Judge tool for model comparisons. 🤖 - **Enhanced Statistical Model**: Improved Bradley-Terry accuracy by addressing confounding variables. 📊 - **20+ New LLMs**: A global mix of commercial and open-source models. 🌍 Discover the latest CompassArena today! huggingface.co/spaces/opencom…
OpenCompass tweet media
English
0
2
5
206
OpenCompass retweetledi
meng shao
meng shao@shao__meng·
大语言模型具备稳定推理能力吗? 「来自上海 AI 实验室 @OpenCompassX 的研究,通过创新的评估方法揭示了一个关键问题:尽管大语言模型在单次测试中可能表现出色(如 OpenAI 最新模型单次准确率达 66.5%),但在需要持续稳定输出的场景中,几乎所有模型的表现都会大幅下降(降幅普遍超过 60%),这说明当前 AI 模型的"思考能力"仍不够稳定和可靠」 论文目前在 HF Papers 当日排名居首位,关注度很高 @_akhaliq 👍 一、研究背景和问题 · 虽然大语言模型在数学推理等复杂任务上表现优秀,但基准测试成绩与实际应用效果存在差距 · 现有评估方法无法真实反映模型在实际应用中的稳定性 二、主要创新 · G-Pass@k评估指标:评估模型在多次尝试中保持正确答案的能力 · LiveMathBench测试集:收集最新数学竞赛题目,确保评估的时效性和难度 三、关键评测发现 · 通用大模型表现 - Gemini-1.5-Pro表现最佳:单次测试49.2%,但在严格标准下降至26.8% - GPT-4和Claude-3.5表现相近:单次成绩约40%,严格测试均降至20%以下 · 数学专用模型表现 - Qwen2.5-Math-72B潜力最高:单次可达50.4%,但稳定性测试降至26.8% - 其他专用模型表现不佳:DeepSeek-Math和NuminaMath在严格测试中均低于6% · o1系列模型亮点 - OpenAI o1-mini最为稳定:单次66.5%,严格测试仍保持42.0% - QwQ-32B紧随其后:严格测试保持33.3%的水平 展现出更好的推理稳定性 四、核心启示 · 稳定性挑战 - 几乎所有模型在严格测试下表现显著下降 - 平均性能衰减超过60% - 即使是专门优化的数学模型也难以保持稳定表现 · 规模效应局限 - 增加模型参数量并不能显著提升推理稳定性 - 如72B和32B版本的Qwen模型表现相近 · 未来方向 - 需要更注重模型的稳定性而非单次表现 - o1系列的架构创新可能提供了提升稳定性的新思路 - 评估方法需要更贴近实际应用场景
meng shao tweet media
中文
1
2
10
1.3K
OpenCompass retweetledi
Haodong Duan
Haodong Duan@KennyUTC·
OpenCompass has established a leaderboard to evaluate complex reasoning capability of LMMs, consisting of four advanced multi-modal math reasoning benchmarks. Currently, Gemini-2.0-Flash took the 1st place. DM me to suggest more benchmarks and models to this LB.
Haodong Duan tweet media
English
0
2
8
1.7K
OpenCompass
OpenCompass@OpenCompassX·
🚀 Shocking : O1-mini scores just 15.6% on AIME under strict, real-world metrics. 🚨 📈 Introducing G-Pass@k: A metric that reveals LLMs' performance consistency across trials. 🌐 LiveMathBench: Challenging LLMs with contemporary math problems, minimizing data leaks. 🔍 Our research shows LLMs have significant room for improvement in practical reasoning tasks. 🔗 Dive deeper: - Arxiv: arxiv.org/abs/2412.13147 - HF: huggingface.co/papers/2412.13…
OpenCompass tweet mediaOpenCompass tweet media
English
3
13
66
11.9K