
GitStart
21 posts

GitStart
@GitStartHQ
We have officially moved to @GitStart. Find us there!


no it can't

LongCodeBench provides a benchmark using real GitHub data, evaluating models on code comprehension and repair at scales up to one million tokens. Methods Explored in this Paper 🔧: → LongCodeBench includes LongCodeQA for comprehension and LongSWE-Bench for bug fixing, using real GitHub issues. → LongCodeQA uses multiple-choice questions derived from GitHub issue discussions about repository content. → LongSWE-Bench tests models by requiring patch generation for bugs with evaluation based on passing unit tests. → The benchmark uses data from 108 repositories, 1043 instances, spanning context lengths from 32 thousand to 1 million tokens. 📌 Claude 3.5 Sonnet LongSWE-Bench accuracy falls from 29% at 32K context to 3% at 256K. 📌 Qwen2.5 LongCodeQA accuracy drops from 70.2% at 512K context to 40% at 1M. 📌 Long-context coding remains a critical weakness; open-source models largely fail LongSWE-Bench beyond 32K. ---------------------------- Paper - arxiv. org/abs/2505.07897v1 Paper Title: "LongCodeBench: Evaluating Coding LLMs at 1M Context Windows"

today we are introducing codex. it is a software engineering agent that runs in the cloud and does tasks for you, like writing a new feature of fixing a bug. you can run many tasks in parallel.


Here’s the incredible story of Wajiha, an Afghan woman who went from sewing clothes by the piece to becoming a software engineer @GitStart 1/5















