BigCode

272 posts

BigCode

@BigCodeProject

Open and responsible research and development of large language models for code. #BigCodeProject run by @huggingface + @ServiceNowRSRCH

Bergabung Ağustos 2022

3 Mengikuti9.2K Pengikut

Tweet Disematkan

BigCode@BigCodeProject·28 Şub

Introducing: StarCoder2 and The Stack v2 ⭐️ StarCoder2 is trained with a 16k token context and repo-level information for 4T+ tokens. All built on The Stack v2 - the largest code dataset with 900B+ tokens. All code, data and models are fully open! hf.co/bigcode/starco…

English

202

665

222.4K

BigCode me-retweet

Terry Yue Zhuo@terryyuezhuo·8 Eki

It’s so much fun working with the other 39 community members on this project! Start to try out various frontier models in BigCodeArena today.

BigCode@BigCodeProject

Introducing BigCodeArena, a human-in-the-loop platform for evaluating code through execution. Unlike current open evaluation platforms that collect human preferences on text, it enables interaction with runnable code to assess functionality and quality across any language.

English

128

47.4K

BigCode@BigCodeProject·8 Eki

- For more details, please check out the blog: huggingface.co/blog/bigcode/a… - Try recent LLMs (e.g., Qwen3 series and DeepSeek-V3.2) on BigCodeArena now: huggingface.co/spaces/bigcode… - Paper Link: drive.google.com/file/d/1gt5Ws0… - GitHub: github.com/bigcode-projec…

English

797

BigCode@BigCodeProject·8 Eki

BigCodeArena cannot be built without the support of the BigCode community. We are grateful for the huge credits provided by the @e2b team. We thank @hyperbolic_labs, @nvidia, and @Alibaba_Qwen for providing the model inference endpoints.

English

957

BigCode@BigCodeProject·8 Eki

English

43.9K

BigCode me-retweet

Terry Yue Zhuo@terryyuezhuo·7 Eki

BigCodeBench @BigCodeProject evaluation framework has been fully upgraded! Just pip install -U bigcodebench With v0.2.0, it's now much easier to use compared to the previous v0.1.* versions. The new version adopts the @Gradio Client API interface from @huggingface Spaces by default, w/o the need for local environment setup, and can be replaced with a custom API if desired. Moreover, the latest version no longer requires running separate commands for each stage (like generate, sanitize, and evaluate), significantly simplifying the workflow. The new version also features Batch Inference — running the LLMs on the BigCodeBench-Full set now takes under 5mins for generation and execution! BTW, the benchmark data has been updated to v0.1.2, improving task instructions and test examples. Some of the updates in this release were inspired by EvalPlus @JiaweiLiu_ . A big thank you for the continued maintenance of EvalPlus and the strong support for BigCodeBench 🤗

English

6.2K

BigCode me-retweet

Josh@JoshPurtell·5 Eyl

Evaluating LM agents has come a long way since gpt-4 released in March of 2023. We now have SWE-Bench, (Visual) Web Arena, and other evaluations that tell us a lot about how the best models + architectures do on hard and important tasks. There's still lots to do, though 🧵

English

12K

BigCode me-retweet

Terry Yue Zhuo@terryyuezhuo·3 Eyl

People may think BigCodeBench @BigCodeProject is nothing more than a straightforward coding benchmark, but it is not. BigCodeBench is a rigorous testbed for LLM agents using code to solve complex and practical challenges. Each task demands significant reasoning capabilities for selecting appropriate library APIs and logically connecting them to craft a program. Rather than merely providing high-level instructions, each task comes with detailed requirements to evaluate the model's ability to adhere to all aspects. While language models typically perform well on short and simple tasks, they often struggle with longer and more complex problems (e.g., BigCodeBench-Hard). A model needs to perform well on BigCodeBench before it can be used for agentic software development.

English

6.6K

BigCode me-retweet

Qian Liu@sivil_taram·23 Ağu

By popular demand, I have released the StarCoder2 code documentation dataset, please check it out ⬇️ hf.co/datasets/Sivil…

English

BigCode me-retweet

Arjun Guha@ArjunGuha·21 Ağu

This work will appear at OOPSLA 2024. New since last year: the StarCoder2 LLM from @BigCodeProject uses MultiPL-T as part of its pretraining corpus.

Arjun Guha@ArjunGuha

LLMs are great at programming tasks... for Python and other very popular PLs. But, they are often unimpressive at artisanal PLs, like OCaml or Racket. We've come up with a way to significantly boost LLM performance of on low-resource languages. If you care about them, read on!

English

1.6K

BigCode me-retweet

Terry Yue Zhuo@terryyuezhuo·19 Ağu

Today, we are happy to announce the beta mode of real-time Code Execution for BigCodeBench @BigCodeProject, which has been integrated into our Hugging Face leaderboard. We understand that setting up a dependency-based execution environment can be cumbersome, even with the built-in Docker image and Dockerfile. To make the evaluation process more reproducible, we've built an interactive environment for you, with guidance from the @Gradio team! (Special thanks to @evilpingwin 🤗) Please note: (1) The execution process might be slightly slower than what you experience on a local machine, as we are using the basic CPU option. There are some compatibility issues with the upgraded CPU environment, and we are currently exploring solutions. (2) Four tasks in the full set require some tricky setup, which has resulted in a pass rate of 99.6%. We will work to fix these in the next iterations :)

Terry Yue Zhuo@terryyuezhuo

In the past few months, we’ve seen SOTA LLMs saturating basic coding benchmarks with short and simplified coding tasks. It's time to enter the next stage of coding challenge under comprehensive and realistic scenarios! -- Here comes BigCodeBench, benchmarking LLMs on solving practical and challenging programming tasks! So, can LLMs solve these tasks? - Not yet! 🏆 Pass @1: Humans ace 97%, GPT-4o only hits 50-60%, but DeepSeek-Coder-V2 is tighy at its heels! Check out our leaderboard, data, code, and paper: bigcode-bench.github.io 1/🧵

English

20.3K

BigCode@BigCodeProject·17 Tem

Releasing BigCodeBench-Hard: a subset of more challenging and user-facing tasks. BigCodeBench-Hard provides more accurate model performance evaluations and we also investigate some recent model updates. Read more: huggingface.co/blog/terryyz/b… Leaderboard: huggingface.co/spaces/bigcode…

English

35.6K

BigCode me-retweet

Rajiv Shah@rajistics·25 Haz

BigCodeBench dataset🌸 Use it as inspiration when building your Generative AI evaluations. BigCodeBench h/t: @BigCodeProject @terryyuezhuo @lvwerra @clefourrier @huggingface (to name just a few of the people involved)

English

1.9K

BigCode me-retweet

Terry Yue Zhuo@terryyuezhuo·19 Haz

Ppl are curious about the performance of DeepSeek-Coder-V2-Lite on BigCodeBench. We've added its results, along with a few other models, to the leaderboard! huggingface.co/spaces/bigcode… DeepSeek-Coder-V2-Lite-Instruct is a beast indeed, similar to Magicoder-S-DS-6.7B, but with only 2.4B activated parameters! 🤯 We also update all the code generation results here: github.com/bigcode-projec… Feel free to submit a PR if you want to see other models on BigCodeBench 🤗github.com/bigcode-projec…

BigCode@BigCodeProject

Introducing 🌸BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks! BigCodeBench goes beyond simple evals like HumanEval and MBPP and tests LLMs on more realistic and challenging coding tasks.

English

5.6K

BigCode me-retweet

Philipp Schmid@_philschmid·19 Haz

It is time to deprecate HumanEval! 🧑🏻‍💻 @BigCodeProject just released BigCodeBench, a new benchmark to evaluate LLMs on challenging and complex coding tasks focused on realistic, function-level tasks that require the use of diverse libraries and complex reasoning! 👀 🧩 Contains 1,140 tasks with 5.6 test cases each, covering 139 libraries in Python. 📊 Uses Pass@1 with greedy decoding and Elo rating for comprehensive evaluation. 🏆 Best model is GPT-4o 61.1%, followed by DeepSeek-Coder-V2. 🥈 Best open Model is DeepSeek-Coder-V2 with 59.7%, better than Claude 3 Opus or Gemini. 👥 Tasks are created in a three-stage process, including synthetic data generation and cross-validation by humans. 🧱 Evaluation framework and Docker images available for easy reproduction 🔜 Plans to extend to multilingualism. Blog: hf.co/blog/leaderboa… Leaderboard: huggingface.co/spaces/bigcode… Code: github.com/bigcode-projec…

English

240

35.9K

BigCode me-retweet

Terry Yue Zhuo@terryyuezhuo·18 Haz

In the past few months, we’ve seen SOTA LLMs saturating basic coding benchmarks with short and simplified coding tasks. It's time to enter the next stage of coding challenge under comprehensive and realistic scenarios! -- Here comes BigCodeBench, benchmarking LLMs on solving practical and challenging programming tasks! So, can LLMs solve these tasks? - Not yet! 🏆 Pass@1: Humans ace 97%, GPT-4o only hits 50-60%, but DeepSeek-Coder-V2 is tighy at its heels! Check out our leaderboard, data, code, and paper: bigcode-bench.github.io 1/🧵

BigCode@BigCodeProject

English

123

46.8K

BigCode@BigCodeProject·18 Haz

We release leaderboard, dataset, code, and paper: - 🤓 Blog: hf.co/blog/leaderboa… - 🌐 Website: bigcode-bench.github.io - 🏆 Leaderboard: huggingface.co/spaces/bigcode… - 📚 Dataset: huggingface.co/datasets/bigco… - 🛠️ Code: github.com/bigcode-projec… - 📄 Paper: github.com/bigcode-bench/…

English

2.1K

BigCode@BigCodeProject·18 Haz

BigCodeBench contains 1,140 function-level tasks to challenge LLMs to follow instructions and compose multiple function calls as tools from 139 Python libraries. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%.

English

2.1K

BigCode@BigCodeProject·18 Haz

GIF

English

212

102.1K

Jelajahi

@e2b @hyperbolic_labs @nvidia @Alibaba_Qwen @Gradio @huggingface @JiaweiLiu_ @evilpingwin