Mouxiang Chen

8 posts

Mouxiang Chen

@MouxiangC

Ph.D. student @ZJU_China | LLM | code generation | time series | unbiased ranking & recommendation

Katılım Kasım 2021

13 Takip Edilen18 Takipçiler

Mouxiang Chen@MouxiangC·5 Şub

Scaling environment is all your need!

Binyuan Hui@huybery

How to scale training environments for coding agents? Let the agent build their own! 🙌 We introduce SWE-Universe 🪐, a scalable framework that turns GitHub PRs into real-world, multilingual, verifiable SWE environments. The agent configures each environment like a human expert, with extra safeguards for reliability. We’ve validated these environments in both mid-training and RL for Qwen3-Coder-Next, and will push environment synthesis further as a path toward agent self-improvement. Let's move on! 👉 huggingface.co/papers/2602.02…

English

Mouxiang Chen retweetledi

Zhongxin Liu@Zhongxin_Liu·19 Ağu

How far are we from natural language programming? We introduce NoCode-bench, a benchmark with 634 real-world feature-addition tasks from software doc changes to validated code changes, towards answering this question. Even top models struggle with new feature implementation from software docs, facing challenges such as cross-file edits, large codebase comprehension, and accurate tool use. Paper: arxiv.org/abs/2507.18130 Code: github.com/NoCode-bench/N… Hugging Face: huggingface.co/NoCode-bench Leaderboard: nocodebench.org

English

30.7K

Mouxiang Chen@MouxiangC·17 May

@main_horse @teortaxesTex If combining with MoE, I tend to do an additional aggregation before each routing, to let the activation weights be the same among different streams. This may be more cost-efficient.

English

122

main@main_horse·16 May

@teortaxesTex wait, no. it shouldn't work well w/ MoE. uniq kv prefix per stream -> token emb differs per stream (after 1st attn layer) -> routing differs per stream -> close to Px more experts retrieved 😬 i did not spot this issue while reading the paper earlier, thanks

English

1.6K

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·16 May

> scaling P times of parallel computation is equivalent to scaling the model parameter count, by a factor of (P1/α·DIVERSITY) They remark on synergy with MoEs. I hope we see the already string Qwen 30B3AB enhanced with this method. Makes a ton of sense in a local 1-GPU setting

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Qwen introduces: Parallel Scaling Law for Language Models "We introduce the third and more inference-efficient scaling paradigm: increasing the model’s parallel computation during both training and inference time." "We draw inspiration from classifier-free guidance (CFG)" "In this paper, we hypothesize that the effectiveness of CFG lies in its double computation." "We propose a proof-of-concept scaling approach called parallel scaling (PARSCALE) to validate this hypothesis on language models. " "parallelizing into P streams equates to scaling the model parameters by O(log P)" "for a 1.6B model, when scaling to P = 8 using PARSCALE, it uses 22× less memory increase and 6× less latency increase compared to parameter scaling that achieves the same model capacity"

English

Mouxiang Chen@MouxiangC·17 May

@johnowhitaker @huybery If you compair N=3B/P=1 and N=3B/P=2, the shape of loss curve is similar, so I believe the introduction of prefixes may be not too much related to the overfitting dynamics.

English

Jonathan Whitaker@johnowhitaker·16 May

@huybery A question for you re: the (lack of) overfitting on repeated epochs for PARSCALE: could this be because the trained prefixes act a little like data augmentation, introducing a bit of variation in the data rather than having it exactly the same each epoch?

English

183

Binyuan Hui@huybery·16 May

Parameter and inference-time scaling have already demonstrated that more compute brings more intelligence. 🤔 But is there a new way to scale compute? The answer might be yes! We propose Parallel Scaling—increasing parallel computation during training and inference. As an exploratory study, we theoretically propose a new scaling law and validate it through pre-training, which shows that a model with P parallel streams is similar to scaling the parameters by O(log P), while showing superior inference efficiency. 📄 arxiv.org/abs/2505.10475 🤖 github.com/QwenLM/ParScale 🤗hf.co/spaces/ParScal…

English

103

650

158.8K

Mouxiang Chen@MouxiangC·17 May

@johnowhitaker @huybery Of course, this is also a reasonable hypothesis 🤔 My previous hypothesis was that smaller-parameter models have weaker memorization capabilities, and therefore are less likely to overfit by excessively memorizing the fine-grained features of the dataset.

English

Mouxiang Chen@MouxiangC·17 May

@TheodoreGalanos @huybery Thank you for your support! The inference code is directly applicable to existing qwen2.5 dense checkppint (setting P=1 by default). We may plan release a training code in the future, using huggingface Trainer, which is also effective in my early experiments.

English

Theodore Galanos@TheodoreGalanos·17 May

@huybery Amazing work! Thank you for inference code as well. I wonder, is this directly applicable to any existing open source checkpoint (assuming parscale only at end)? Do you plan to share training code for that perhaps? Small scale experiments across the community would be awesome.

English

786

Mouxiang Chen retweetledi

Zhongxin Liu@Zhongxin_Liu·21 Eyl

How to select the best LLM-generated solution? We discuss the theoretically optimal strategy for selecting the best LLM-generated solutions (e.g., unreliable code) based on LLM-generated validators (e.g., unreliable test cases) in our @ASE_conf paper 👉🏻 Paper: huggingface.co/papers/2409.08… 👉🏻 Code: github.com/ZJU-CTAG/B4 1) We establish an optimal strategy for this problem within a Bayesian framework. 2) We show that identifying the best solution can be framed as an integer programming problem, and propose an efficient approach called B4 for approximating this optimal (yet uncomputable) strategy. 3) B4 significantly surpasses existing heuristics, achieving a relative performance improvement by up to 50% in the most challenging scenarios.

English

140

Mouxiang Chen retweetledi

Chenghao Liu@ChenghaoLiu15·2 Eyl

🚀 We’re thrilled to introduce VisionTS, a groundbreaking time series forecasting foundation model, building from rich, high-quality natural images without any time-series training and showing superior accuracy compared to SOTAs like Moirai, timesFM. arxiv.org/abs/2408.17253

English

179

Keşfet

@main_horse @teortaxesTex @johnowhitaker @huybery @TheodoreGalanos @ASE_conf @elonmusk @BarackObama