Mouxiang Chen

8 posts

Mouxiang Chen

Mouxiang Chen

@MouxiangC

Ph.D. student @ZJU_China | LLM | code generation | time series | unbiased ranking & recommendation

Katılım Kasım 2021
13 Takip Edilen18 Takipçiler
Mouxiang Chen retweetledi
Zhongxin Liu
Zhongxin Liu@Zhongxin_Liu·
How far are we from natural language programming? We introduce NoCode-bench, a benchmark with 634 real-world feature-addition tasks from software doc changes to validated code changes, towards answering this question. Even top models struggle with new feature implementation from software docs, facing challenges such as cross-file edits, large codebase comprehension, and accurate tool use. Paper: arxiv.org/abs/2507.18130 Code: github.com/NoCode-bench/N… Hugging Face: huggingface.co/NoCode-bench Leaderboard: nocodebench.org
Zhongxin Liu tweet media
English
4
11
71
30.7K
Mouxiang Chen
Mouxiang Chen@MouxiangC·
@main_horse @teortaxesTex If combining with MoE, I tend to do an additional aggregation before each routing, to let the activation weights be the same among different streams. This may be more cost-efficient.
English
1
0
2
122
main
main@main_horse·
@teortaxesTex wait, no. it shouldn't work well w/ MoE. uniq kv prefix per stream -> token emb differs per stream (after 1st attn layer) -> routing differs per stream -> close to Px more experts retrieved 😬 i did not spot this issue while reading the paper earlier, thanks
English
3
0
10
1.6K
Mouxiang Chen
Mouxiang Chen@MouxiangC·
@johnowhitaker @huybery If you compair N=3B/P=1 and N=3B/P=2, the shape of loss curve is similar, so I believe the introduction of prefixes may be not too much related to the overfitting dynamics.
English
1
0
1
33
Jonathan Whitaker
Jonathan Whitaker@johnowhitaker·
@huybery A question for you re: the (lack of) overfitting on repeated epochs for PARSCALE: could this be because the trained prefixes act a little like data augmentation, introducing a bit of variation in the data rather than having it exactly the same each epoch?
Jonathan Whitaker tweet media
English
2
0
1
183
Binyuan Hui
Binyuan Hui@huybery·
Parameter and inference-time scaling have already demonstrated that more compute brings more intelligence. 🤔 But is there a new way to scale compute? The answer might be yes! We propose Parallel Scaling—increasing parallel computation during training and inference. As an exploratory study, we theoretically propose a new scaling law and validate it through pre-training, which shows that a model with P parallel streams is similar to scaling the parameters by O(log P), while showing superior inference efficiency. 📄 arxiv.org/abs/2505.10475 🤖 github.com/QwenLM/ParScale 🤗hf.co/spaces/ParScal…
Binyuan Hui tweet media
English
25
103
650
158.8K
Mouxiang Chen
Mouxiang Chen@MouxiangC·
@johnowhitaker @huybery Of course, this is also a reasonable hypothesis 🤔 My previous hypothesis was that smaller-parameter models have weaker memorization capabilities, and therefore are less likely to overfit by excessively memorizing the fine-grained features of the dataset.
English
0
0
1
21
Mouxiang Chen
Mouxiang Chen@MouxiangC·
@TheodoreGalanos @huybery Thank you for your support! The inference code is directly applicable to existing qwen2.5 dense checkppint (setting P=1 by default). We may plan release a training code in the future, using huggingface Trainer, which is also effective in my early experiments.
English
0
0
3
75
Theodore Galanos
Theodore Galanos@TheodoreGalanos·
@huybery Amazing work! Thank you for inference code as well. I wonder, is this directly applicable to any existing open source checkpoint (assuming parscale only at end)? Do you plan to share training code for that perhaps? Small scale experiments across the community would be awesome.
English
1
1
2
786
Mouxiang Chen retweetledi
Zhongxin Liu
Zhongxin Liu@Zhongxin_Liu·
How to select the best LLM-generated solution? We discuss the theoretically optimal strategy for selecting the best LLM-generated solutions (e.g., unreliable code) based on LLM-generated validators (e.g., unreliable test cases) in our @ASE_conf paper 👉🏻 Paper: huggingface.co/papers/2409.08… 👉🏻 Code: github.com/ZJU-CTAG/B4 1) We establish an optimal strategy for this problem within a Bayesian framework. 2) We show that identifying the best solution can be framed as an integer programming problem, and propose an efficient approach called B4 for approximating this optimal (yet uncomputable) strategy. 3) B4 significantly surpasses existing heuristics, achieving a relative performance improvement by up to 50% in the most challenging scenarios.
Zhongxin Liu tweet media
English
1
1
1
140
Mouxiang Chen retweetledi
Chenghao Liu
Chenghao Liu@ChenghaoLiu15·
🚀 We’re thrilled to introduce VisionTS, a groundbreaking time series forecasting foundation model, building from rich, high-quality natural images without any time-series training and showing superior accuracy compared to SOTAs like Moirai, timesFM. arxiv.org/abs/2408.17253
Chenghao Liu tweet media
English
1
1
2
179