xyou

479 posts

xyou

@XYOU

Berlin, Germany 参加日 Haziran 2009

1.5K フォロー中755 フォロワー

xyou@XYOU·9 Eki

The very first workshop on multilingual data quality signals (WMQDS 🦆) is kicking off tomorrow at #COLM2025

Workshop on Multilingual Data Quality Signals@wmdqs

In collaboration with @CommonCrawl @MLCommons @AiEleuther, the first edition of WMDQS at @COLM_conf starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.

English

364

xyou@XYOU·4 Nis

@MatthewBerman Sure about this? Given the current reproducibility crisis in ML research, I doubt that humans would achieve a much higher replication score.

English

Matthew Berman@MatthewBerman·3 Nis

Which model won? Turns out Claude 3.5 Sonnet leads the pack, achieving a ~21% replication score on PaperBench! This is impressive, but, it shows there's still a gap compared to human PhD-level experts.

English

4.3K

Matthew Berman@MatthewBerman·3 Nis

.@OpenAI dropped a new research paper showing AI agents are now capable of replicating cutting-edge AI research papers from scratch. This is one step closer to the Intelligence Explosion: AI that can discover new science and improve itself. Here’s what they learned: 🧵

English

150

1.3K

190.2K

xyou@XYOU·9 Şub

4/ In academia, the work is very different. PhD students or even undergraduates are the ones doing most the actual research work. But as a PhD student, you need to decide whether you prioritize the project work over your own PhD work (papers and thesis).

English

171

xyou@XYOU·9 Şub

3/ LLMs and other foundation models are no longer research artifacts but products. Frontier models are developed by dedicated teams of +100 people specialized across the whole stack (from low level hardware optimization over data to ML and UX topics).

English

198

xyou@XYOU·14 Tem

@hu_yifei Did you already try Grobid? github.com/kermitt2/grobid

English

Yifei Hu@hu_yifei·13 Tem

I am currently working on an end-to-end OCR pipeline for research papers. Open Research Assistant needs high a quality OCR pipeline to work properly, so I really have to solve the OCR problem before making more progress in the OpenRA project. Good news: paper OCR will be solved soon.

English

121

17.9K

xyou@XYOU·13 Haz

@gui_penedo @pjox13 That’s even better. I will share the data with you as soon it’s ready!

English

Guilherme Penedo@gui_penedo·13 Haz

@XYOU @pjox13 We're happy to run a training on the same conditions, but you can find details on the model setup (we haven't posted the exact training script yet) and the exact eval code on our blogpost

English

Guilherme Penedo@gui_penedo·12 Haz

We keep getting new pretraining datasets 🔥 Congratulations to the Matrix team for such a strong dataset!

English

18.1K

xyou@XYOU·13 Haz

@gui_penedo @pjox13 We will release a filtered version of Colossal OSCAR soon. Is your training and evaluation script somewhere available? I would love to do the comparison with that version.

English

Guilherme Penedo@gui_penedo·13 Haz

@pjox13 @XYOU Interesting! Is the dataset obtained with this config released somewhere?

English

xyou@XYOU·13 May

@mark_cummins For Germany, we have ~50B tokens of court decisions but that are only the publicly available ones and that represent ~1% of all court decisions. However, you won't need all for LLM training due to high duplicate ratio. @mlissner might have the US numbers.

English

Mark Cummins@mark_cummins·12 May

@XYOU One other thing I forgot to include was court documents. Seems like you might know about that. Do you have any data on how many publicly accessible court documents exist?

English

292

xyou@XYOU·21 Nis

@gui_penedo Awesome work. Will the remaining models also be released? And from your experience what model and data size do you need to see a significant difference in performance?

English

504

Guilherme Penedo@gui_penedo·21 Nis

We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!

English

326

1.6K

607.5K

xyou@XYOU·18 Nis

@yoavgo "collected" 😎

English

306

(((ل()(ل() 'yoav))))👾@yoavgo·18 Nis

"15T tokens collected from publicly available sources". what does "publicly available source" even mean?

English

26.6K

xyou@XYOU·9 Nis

@saattrupdan @SebastianB929 @occiglot Do you have the whole eval setup in containers? If so, I could help with compute.

English

OcciGlot@occiglot·8 Nis

We have some great new evaluation results to share that provided by the community. The German Occiglot model is the best in class on ScandEval. scandeval.com/german-nlg/ And our Spanish model achieves SOTA results in lexical word understanding.

English

1.9K

xyou@XYOU·9 Nis

@SebastianB929 @occiglot Pinging @saattrupdan who did the evals.

English

SebastianBoo@SebastianB929·9 Nis

@occiglot Any way to get Mixtral in the benchmark?

English

xyou@XYOU·5 Nis

@qinzytech @OpenAI @Meta Great work! Will the pretraining code be open source?

English

846

Zengyi Qin@qinzytech·4 Nis

Training LLMs can be much cheaper than previously thought. 0.1 million USD is sufficient for training LLaMA2-level LLMs🤯 While @OpenAI and @Meta use billions of dollars to train theirs, you can also train yours with much less money. Introducing our open-source project JetMoE: research.myshell.ai/jetmoe A thread 🧵

English

165

881

246.7K

xyou@XYOU·1 Nis

@BramVanroy @VSC_HPC If your cluster uses slurm you can catch the kill signal and save a checkpoint before that. See this script for an example. Line 14 and 293-300 do the magic. #file-bigscience-deepspeedmeg-example-sbatch-L293-298" target="_blank" rel="nofollow noopener">gist.github.com/malteos/71635c…

English

101

xyou@XYOU·25 Mar

@SebastianB929 Opengptx is an official government funded research project. Occiglot is a loose group of individuals from different organizations without any formal ties. We call it a research collective. You may also call it simply a discord server. And yes, the website needs to be improved.

English

SebastianBoo@SebastianB929·25 Mar

@XYOU Is Occiglot a research project like opengptx or how can i get a better understanding of it? The project page is a bit confusing :)

English

xyou@XYOU·25 Mar

@ZedDou1 @occiglot As mentioned in the readme, we suspect that this is due to the benchmarks being machine translated from English and based on English prompts.

English

Jordan@ZedDou1·8 Mar

@occiglot Nice to see multilinguality more and more addressed, great work! I do have a question though, how would you explain the gap in the evals in the 5 languages between your models (base and instruct) and the Mistral models which are mostly English? 🤔

English

OcciGlot@occiglot·7 Mar

Today, we are announcing Occiglot! A large-scale collaborative research collective focusing on open-source European LLMs. We invite anybody working on multilingual datasets, benchmarks, or models to get in touch/join our discord. occiglot.github.io/occiglot/posts…

English

181

31.8K

xyou@XYOU·24 Mar

@BramVanroy Have you tried tensor parallelism on the embedding layer? If I remember it correctly Bloom used this with its large vocab. @StasBekman

English

141

xyou@XYOU·17 Oca

@BramVanroy @ph_singer There is a high correlation between the weights of Mistral and Mixtral. So this seems pretty likely.

English

xyou@XYOU·18 Ara

@robertomasymas @burkov Check out "progressive growing". People did something similar already with BERT models.

English

Roberto Tomás C 🍉@robertomasymas·17 Ara

@burkov “SOLAR-10.7B incorporates the innovative Upstage Depth Up-Scaling. We then integrated Mistral 7B weights into the upscaled layers, and finally, continued pre-training for the entire model.” Honest question: how do you start with pretrained weights from a model of diffident size?

English

2.8K

BURKOV@burkov·17 Ara

SOLAR: an 11B model that beats every open model, including Mixtral, Yi-34B, Llama 2 70B, and Falcon 180B: huggingface.co/upstage/SOLAR-…

English

657

621.9K

ディスカバー

@MatthewBerman @OpenAI @hu_yifei @gui_penedo @pjox13 @mark_cummins @mlissner @yoavgo