xyou

479 posts

xyou banner
xyou

xyou

@XYOU

Berlin, Germany 参加日 Haziran 2009
1.5K フォロー中755 フォロワー
xyou
xyou@XYOU·
@MatthewBerman Sure about this? Given the current reproducibility crisis in ML research, I doubt that humans would achieve a much higher replication score.
English
0
0
0
14
Matthew Berman
Matthew Berman@MatthewBerman·
Which model won? Turns out Claude 3.5 Sonnet leads the pack, achieving a ~21% replication score on PaperBench! This is impressive, but, it shows there's still a gap compared to human PhD-level experts.
Matthew Berman tweet media
English
2
2
49
4.3K
Matthew Berman
Matthew Berman@MatthewBerman·
.@OpenAI dropped a new research paper showing AI agents are now capable of replicating cutting-edge AI research papers from scratch. This is one step closer to the Intelligence Explosion: AI that can discover new science and improve itself. Here’s what they learned: 🧵
Matthew Berman tweet media
English
37
150
1.3K
190.2K
xyou
xyou@XYOU·
4/ In academia, the work is very different. PhD students or even undergraduates are the ones doing most the actual research work. But as a PhD student, you need to decide whether you prioritize the project work over your own PhD work (papers and thesis).
English
0
0
0
171
xyou
xyou@XYOU·
3/ LLMs and other foundation models are no longer research artifacts but products. Frontier models are developed by dedicated teams of +100 people specialized across the whole stack (from low level hardware optimization over data to ML and UX topics).
English
1
0
0
198
Yifei Hu
Yifei Hu@hu_yifei·
I am currently working on an end-to-end OCR pipeline for research papers. Open Research Assistant needs high a quality OCR pipeline to work properly, so I really have to solve the OCR problem before making more progress in the OpenRA project. Good news: paper OCR will be solved soon.
English
8
1
121
17.9K
xyou
xyou@XYOU·
@gui_penedo @pjox13 That’s even better. I will share the data with you as soon it’s ready!
English
0
0
2
19
Guilherme Penedo
Guilherme Penedo@gui_penedo·
@XYOU @pjox13 We're happy to run a training on the same conditions, but you can find details on the model setup (we haven't posted the exact training script yet) and the exact eval code on our blogpost
English
1
0
2
41
Guilherme Penedo
Guilherme Penedo@gui_penedo·
We keep getting new pretraining datasets 🔥 Congratulations to the Matrix team for such a strong dataset!
Guilherme Penedo tweet media
English
1
10
70
18.1K
xyou
xyou@XYOU·
@gui_penedo @pjox13 We will release a filtered version of Colossal OSCAR soon. Is your training and evaluation script somewhere available? I would love to do the comparison with that version.
English
1
0
1
33
xyou
xyou@XYOU·
@mark_cummins For Germany, we have ~50B tokens of court decisions but that are only the publicly available ones and that represent ~1% of all court decisions. However, you won't need all for LLM training due to high duplicate ratio. @mlissner might have the US numbers.
English
1
0
1
41
Mark Cummins
Mark Cummins@mark_cummins·
@XYOU One other thing I forgot to include was court documents. Seems like you might know about that. Do you have any data on how many publicly accessible court documents exist?
English
2
0
0
292
xyou
xyou@XYOU·
@gui_penedo Awesome work. Will the remaining models also be released? And from your experience what model and data size do you need to see a significant difference in performance?
English
0
0
0
504
Guilherme Penedo
Guilherme Penedo@gui_penedo·
We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!
Guilherme Penedo tweet media
English
39
326
1.6K
607.5K
xyou
xyou@XYOU·
@yoavgo "collected" 😎
English
0
0
2
306
(((ل()(ل() 'yoav))))👾
"15T tokens collected from publicly available sources". what does "publicly available source" even mean?
English
22
5
87
26.6K
OcciGlot
OcciGlot@occiglot·
We have some great new evaluation results to share that provided by the community. The German Occiglot model is the best in class on ScandEval. scandeval.com/german-nlg/ And our Spanish model achieves SOTA results in lexical word understanding.
OcciGlot tweet media
English
2
4
21
1.9K
Zengyi Qin
Zengyi Qin@qinzytech·
Training LLMs can be much cheaper than previously thought. 0.1 million USD is sufficient for training LLaMA2-level LLMs🤯 While @OpenAI and @Meta use billions of dollars to train theirs, you can also train yours with much less money. Introducing our open-source project JetMoE: research.myshell.ai/jetmoe A thread 🧵
Zengyi Qin tweet media
English
53
165
881
246.7K
xyou
xyou@XYOU·
@BramVanroy @VSC_HPC If your cluster uses slurm you can catch the kill signal and save a checkpoint before that. See this script for an example. Line 14 and 293-300 do the magic. #file-bigscience-deepspeedmeg-example-sbatch-L293-298" target="_blank" rel="nofollow noopener">gist.github.com/malteos/71635c…
English
0
0
2
101
xyou
xyou@XYOU·
@SebastianB929 Opengptx is an official government funded research project. Occiglot is a loose group of individuals from different organizations without any formal ties. We call it a research collective. You may also call it simply a discord server. And yes, the website needs to be improved.
English
0
0
2
40
SebastianBoo
SebastianBoo@SebastianB929·
@XYOU Is Occiglot a research project like opengptx or how can i get a better understanding of it? The project page is a bit confusing :)
English
1
0
0
35
xyou
xyou@XYOU·
@ZedDou1 @occiglot As mentioned in the readme, we suspect that this is due to the benchmarks being machine translated from English and based on English prompts.
English
0
0
1
75
Jordan
Jordan@ZedDou1·
@occiglot Nice to see multilinguality more and more addressed, great work! I do have a question though, how would you explain the gap in the evals in the 5 languages between your models (base and instruct) and the Mistral models which are mostly English? 🤔
English
1
0
0
63
OcciGlot
OcciGlot@occiglot·
Today, we are announcing Occiglot! A large-scale collaborative research collective focusing on open-source European LLMs. We invite anybody working on multilingual datasets, benchmarks, or models to get in touch/join our discord. occiglot.github.io/occiglot/posts…
English
8
49
181
31.8K
xyou
xyou@XYOU·
@BramVanroy Have you tried tensor parallelism on the embedding layer? If I remember it correctly Bloom used this with its large vocab. @StasBekman
English
1
0
0
141
xyou
xyou@XYOU·
@BramVanroy @ph_singer There is a high correlation between the weights of Mistral and Mixtral. So this seems pretty likely.
English
0
0
0
28
xyou
xyou@XYOU·
@robertomasymas @burkov Check out "progressive growing". People did something similar already with BERT models.
English
0
0
2
98
Roberto Tomás C 🍉
Roberto Tomás C 🍉@robertomasymas·
@burkov “SOLAR-10.7B incorporates the innovative Upstage Depth Up-Scaling. We then integrated Mistral 7B weights into the upscaled layers, and finally, continued pre-training for the entire model.” Honest question: how do you start with pretrained weights from a model of diffident size?
English
3
0
8
2.8K