Rubio Huang

32 posts

Rubio Huang banner
Rubio Huang

Rubio Huang

@HuangRubio

Katılım Eylül 2019
300 Takip Edilen45 Takipçiler
Rubio Huang retweetledi
Rui-Jie (Ridger) Zhu
Rui-Jie (Ridger) Zhu@RidgerZhu·
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
Rui-Jie (Ridger) Zhu tweet media
English
23
150
685
173.9K
Rubio Huang retweetledi
Daoguang Zan
Daoguang Zan@zandaoguang·
🔥 Can your LLM fix bugs beyond Python? Meet our Multi-SWE-bench — the first multilingual benchmark for issue resolving. Not just Python, but Java, TS, JS, Go, Rust, C, and C++🧩 💥 1,632 real-world issues ✅ Verified by 68 engineers 📦 Dockerized, reproducible, battle-tested 🧠 Covers easy, medium, and hard bug fixes 📊 Designed to benchmark LLMs as true dev agents To scale beyond benchmarks, we also launch Multi-SWE-RL — 🎮 An open-source RL community to build interactive training environments for LLMs as autonomous agents. 🌱 4,723 containerized issue-resolving tasks, 7 languages, and counting. 🤝 We invite the community to contribute, expand, and shape the future of software-native RL. It took us a year to build. Now let’s see what your model can do. 🏆 Leaderboard: multi-swe-bench.github.io 📄 Paper: arxiv.org/abs/2504.02605 🧬 Code: github.com/multi-swe-benc… 📚 Multi-SWE-bench Dataset: huggingface.co/datasets/ByteD… 🎮 Multi-SWE-RL Dataset: huggingface.co/datasets/ByteD… #LLM #RL #SWEbench #OpenAI #Anthropic #DeepSeek #Doubao
Daoguang Zan tweet media
English
8
10
46
13.1K
Rubio Huang retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/n] SuperExcited to announce SuperGPQA!!! We spend more than half a year to finally make it done! SuperGPQA is a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. It also provides the largest human-LLM collaborated high-quality benchmark annotation practice! We thank the sponsorship from Bytedance.Inc and 2077.AI! Resources: Websites: supergpqa.github.io Huggingface: huggingface.co/datasets/m-a-p… Github: github.com/SuperGPQA/Supe… Paper: arxiv.org/abs/2502.14739 HF Paper: huggingface.co/papers/2502.14…
Ge Zhang tweet media
English
5
47
213
29.2K
Rubio Huang retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/n] 🎉We are very pleased to introduce FineFineWeb, which is currently the largest open-source fully automatic classification practice for fine-grained web data. Specifically, our contributions are as follows: 🔪We decompose the entire deduplicated version of Fineweb into 67 categories with a significant amount of seed data. 🧮We conduct a correlation analysis between vertical categories as well as between vertical categories and common Benchmarks for FineFineWeb, and also provided the distribution analysis of URLs and other content. 🧑‍⚖️We provide test sets for PPL evaluation based on the 67 selected vertical domains of FineFineWeb, and offer a "small cup" (Validation) and a "medium cup" (Test). 🪙We provide all the full-process materials for training fasttext and bert. 📅We will give suggestions on data proportioning based on our dataset. (Based on RegMix, Coming Soon in our Report! [Due to tight computing power, it will be as soon as possible])
Ge Zhang tweet media
English
7
45
161
24.3K
Rubio Huang retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/n] 🔥 Happy to Introduce FullStack Bench: A comprehensive evaluation dataset, focusing on full-stack programming across 16 languages and more than 11 real-world application domains like data analysis, software engineering, and machine learning. Whether or not your CodeLLM is a FullStack Coder instead of an leetcode nerd? It's time to put your code LLMs to the test!!! 📝
Ge Zhang tweet mediaGe Zhang tweet media
English
11
34
135
46.5K
Rubio Huang retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/n] ### Discover AutoKaggle: Revolutionizing Data Science Competitions with Multi-Agent Collaboration! 🚀 Introducing AutoKaggle — a multi-agent framework designed to automate the full spectrum of data science competitions on Kaggle! From background understanding to model prediction, AutoKaggle takes on all phases, boosting efficiency and reducing manual overhead. 💡 Highlights of AutoKaggle: 🛠️ Phase-based workflow: Six key phases (Understanding, EDA, Cleaning, Feature Engineering, Model Building). 🤖 Five specialized agents: Reader, Planner, Developer, Reviewer, Summarizer. 🔁 Iterative debugging & unit testing for robust, correct code generation. 📊 Built-in ML tools library to handle data cleaning, feature engineering, and modeling. 🤤 Flexible Customize Support on ML Tool Library allows you to drive the workflow as you want.
Ge Zhang tweet media
English
7
35
153
15.1K
Rubio Huang retweetledi
Ge Zhang
Ge Zhang@GeZhang86038849·
[1/n] ### Exploring the Boundaries of AI Reasoning — Launch of KOR-Bench 🚀To more accurately assess large models' reasoning in new, unfamiliar areas, we’re thrilled to introduce the all-new KOR-Bench (Knowledge-Orthogonal Reasoning Benchmark)! ### 💡 Highlights of KOR-Bench: > 5 categories (🔢Operation, 🔍Logic, 🔐Cipher, 🧩Puzzle, 📖Counterfactual) assess reasoning from multiple perspectives, using 25 custom rules 📜 with 10 problem ❓ instances each, ensuring rules are orthogonal to pre-training data. > Minimizes reliance on pre-trained knowledge by testing large language models' ability to solve new rule-driven questions using new rule descriptions, ensuring a fairer evaluation of models' true reasoning skills. > Encourages models to break traditional frameworks and adapt to non-standard challenges, revealing abilities in reading comprehension, immediate learning, knowledge transfer, logical reasoning, and problem-solving. 🔗 #Reasoning #KOR Bench #Large Language Models #Benchmark
Ge Zhang tweet media
English
3
14
53
5.3K
Rubio Huang retweetledi
JB
JB@IAMJBDEL·
HuggingFace Paper-central now hosts open-source leaderboards. This is like a h-index but for 🤗 artifacts. Discover the authors whose papers have attracted the most open-source artifacts (datasets, models or spaces), and most-active contributors who have developed artifacts associated with papers.
JB tweet media
English
2
15
44
35.3K
Rubio Huang retweetledi
Yizhi Li
Yizhi Li@yizhilll·
Exciting news! We're thrilled to introduce OmniBench: a groundbreaking benchmark for evaluating omni-language models (OLMs) that can process visual, acoustic, and textual inputs simultaneously! 🖼️🔊📝 huggingface.co/papers/2409.15… #Multimodal#LLM
English
1
8
15
2.6K
Rubio Huang retweetledi
Wenhu Chen
Wenhu Chen@WenhuChen·
A sad truth about evaluation is that: If you make a private test set for your benchmark, people just won't adopt it. We have our official MMMU private test set hosted in EvalAI (eval.ai/web/challenges…), but everyone is still reporting validation score. I found it's similar for MathVista, where everyone is just reporting testmini score.
English
9
11
196
83.2K
Alexander Kolesnikov
Alexander Kolesnikov@__kolesnikov__·
I am at CVPR, DM me if you want to meet in person.
English
5
1
16
9.5K
Rubio Huang retweetledi
Siwei Wu(吴思为)
Siwei Wu(吴思为)@siweiwu7·
1/ Excited to announce the release of our new paper "SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval". This benchmark comprises 530K meticulously curated image-text pairs extracted from scientific documents(arXiv Paper). arxiv.org/abs/2401.13478
Siwei Wu(吴思为) tweet media
English
1
14
30
9.3K