malteos

241 posts

malteos

@XYOU

Berlin, Germany Katılım Haziran 2009

1.5K Takip Edilen752 Takipçiler

malteos@XYOU·7 May

@RishiBommasani @percyliang The analogy for cloud vs local would be restaurant vs takeout. At the restaurant you better behave otherwise you get kicked out. At home you eat your food however you want.

English

rishi@RishiBommasani·5 May

I like the analogy. Notably in the restaurant world, only one of these even is afforded the word open. Option 3 is an "open kitchen" restaurant. (I don't think all such restaurants would appreciate the customer shouting at the chef but let's put that aside) Though maybe there is some mismatch in the analogy: - Option 1 is just "you get the food". Analogue is "you get the model". This probably collapses open weight with everything less open than it since we don't distinguish weights vs API in food as far as I can imagine, and certainly there is no local vs. cloud distinction for food - Option 2 is "you get the food and recipe". I think this is a bit of a mismatch with open source since recipe is transparency (i.e. information about how to build) but not the actual ingredients themselves (whereas you might/do have the dataset in some stronger sense with open-source). But, worth noting in both cases that you are not given the cooking infrastructure or compute infrastructure to consume the ingredients and produce the food. One other subtlety is open kitchen restaurants are not fully open due to constraints: chefs do prepwork so that the cook time in front of the diner is reasonable length (e.g. omakase restaurant needs to prepare rice in advance). That's fine because the customer doesn't need 100% open and to see every gory detail, but not fine for researchers.

English

1.7K

Percy Liang@percyliang·5 May

I find myself repeatedly explaining the difference between open-weight (DeepSeek), open-source (Olmo), open-development (Marin). Let's see if this restaurant analogy helps: - Open-weight: food is made behind closed doors, server brings you the dish - Open-source: food is made behind closed doors, server brings you the dish and the recipe - Open-development: you see the chef make the dish in the kitchen (and can shout suggestions while its cooking)!

English

914

75.9K

malteos@XYOU·4 Nis

@MatthewBerman Sure about this? Given the current reproducibility crisis in ML research, I doubt that humans would achieve a much higher replication score.

English

Matthew Berman@MatthewBerman·3 Nis

Which model won? Turns out Claude 3.5 Sonnet leads the pack, achieving a ~21% replication score on PaperBench! This is impressive, but, it shows there's still a gap compared to human PhD-level experts.

English

4.3K

Matthew Berman@MatthewBerman·3 Nis

.@OpenAI dropped a new research paper showing AI agents are now capable of replicating cutting-edge AI research papers from scratch. This is one step closer to the Intelligence Explosion: AI that can discover new science and improve itself. Here’s what they learned: 🧵

English

149

1.3K

190.2K

malteos@XYOU·9 Şub

4/ In academia, the work is very different. PhD students or even undergraduates are the ones doing most the actual research work. But as a PhD student, you need to decide whether you prioritize the project work over your own PhD work (papers and thesis).

English

201

malteos@XYOU·9 Şub

3/ LLMs and other foundation models are no longer research artifacts but products. Frontier models are developed by dedicated teams of +100 people specialized across the whole stack (from low level hardware optimization over data to ML and UX topics).

English

232

malteos@XYOU·14 Tem

@hu_yifei Did you already try Grobid? github.com/kermitt2/grobid

English

Yifei Hu@hu_yifei·13 Tem

I am currently working on an end-to-end OCR pipeline for research papers. Open Research Assistant needs high a quality OCR pipeline to work properly, so I really have to solve the OCR problem before making more progress in the OpenRA project. Good news: paper OCR will be solved soon.

English

121

17.9K

malteos@XYOU·13 Haz

@gui_penedo @pjox13 That’s even better. I will share the data with you as soon it’s ready!

English

Guilherme Penedo@gui_penedo·13 Haz

@XYOU @pjox13 We're happy to run a training on the same conditions, but you can find details on the model setup (we haven't posted the exact training script yet) and the exact eval code on our blogpost

English

Guilherme Penedo@gui_penedo·12 Haz

We keep getting new pretraining datasets 🔥 Congratulations to the Matrix team for such a strong dataset!

English

18.1K

malteos@XYOU·13 Haz

@gui_penedo @pjox13 We will release a filtered version of Colossal OSCAR soon. Is your training and evaluation script somewhere available? I would love to do the comparison with that version.

English

Guilherme Penedo@gui_penedo·13 Haz

@pjox13 @XYOU Interesting! Is the dataset obtained with this config released somewhere?

English

malteos@XYOU·13 May

@mark_cummins For Germany, we have ~50B tokens of court decisions but that are only the publicly available ones and that represent ~1% of all court decisions. However, you won't need all for LLM training due to high duplicate ratio. @mlissner might have the US numbers.

English

Mark Cummins@mark_cummins·12 May

@XYOU One other thing I forgot to include was court documents. Seems like you might know about that. Do you have any data on how many publicly accessible court documents exist?

English

293

malteos@XYOU·21 Nis

@gui_penedo Awesome work. Will the remaining models also be released? And from your experience what model and data size do you need to see a significant difference in performance?

English

505

Guilherme Penedo@gui_penedo·21 Nis

We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!

English

323

1.6K

607.6K

malteos@XYOU·18 Nis

@yoavgo "collected" 😎

English

307

(((ل()(ل() 'yoav))))👾@yoavgo·18 Nis

"15T tokens collected from publicly available sources". what does "publicly available source" even mean?

English

26.6K

malteos@XYOU·9 Nis

@saattrupdan @SebastianB929 @occiglot Do you have the whole eval setup in containers? If so, I could help with compute.

English

OcciGlot@occiglot·8 Nis

We have some great new evaluation results to share that provided by the community. The German Occiglot model is the best in class on ScandEval. scandeval.com/german-nlg/ And our Spanish model achieves SOTA results in lexical word understanding.

English

1.9K

malteos@XYOU·9 Nis

@SebastianB929 @occiglot Pinging @saattrupdan who did the evals.

English

SebastianBoo@SebastianB929·9 Nis

@occiglot Any way to get Mixtral in the benchmark?

English

malteos@XYOU·5 Nis

@qinzytech @OpenAI @Meta Great work! Will the pretraining code be open source?

English

847

Zengyi Qin@qinzytech·4 Nis

Training LLMs can be much cheaper than previously thought. 0.1 million USD is sufficient for training LLaMA2-level LLMs🤯 While @OpenAI and @Meta use billions of dollars to train theirs, you can also train yours with much less money. Introducing our open-source project JetMoE: research.myshell.ai/jetmoe A thread 🧵

English

165

879

246.7K

malteos@XYOU·1 Nis

@BramVanroy @VSC_HPC If your cluster uses slurm you can catch the kill signal and save a checkpoint before that. See this script for an example. Line 14 and 293-300 do the magic. #file-bigscience-deepspeedmeg-example-sbatch-L293-298" target="_blank" rel="nofollow noopener">gist.github.com/malteos/71635c…

English

102

malteos@XYOU·25 Mar

@SebastianB929 Opengptx is an official government funded research project. Occiglot is a loose group of individuals from different organizations without any formal ties. We call it a research collective. You may also call it simply a discord server. And yes, the website needs to be improved.

English

SebastianBoo@SebastianB929·25 Mar

@XYOU Is Occiglot a research project like opengptx or how can i get a better understanding of it? The project page is a bit confusing :)

English

malteos@XYOU·25 Mar

@ZedDou1 @occiglot As mentioned in the readme, we suspect that this is due to the benchmarks being machine translated from English and based on English prompts.

English

Jordan@ZedDou1·8 Mar

@occiglot Nice to see multilinguality more and more addressed, great work! I do have a question though, how would you explain the gap in the evals in the 5 languages between your models (base and instruct) and the Mistral models which are mostly English? 🤔

English

OcciGlot@occiglot·7 Mar

Today, we are announcing Occiglot! A large-scale collaborative research collective focusing on open-source European LLMs. We invite anybody working on multilingual datasets, benchmarks, or models to get in touch/join our discord. occiglot.github.io/occiglot/posts…

English

180

31.8K

malteos@XYOU·24 Mar

@BramVanroy Have you tried tensor parallelism on the embedding layer? If I remember it correctly Bloom used this with its large vocab. @StasBekman

English

143

malteos@XYOU·17 Oca

@BramVanroy @ph_singer There is a high correlation between the weights of Mistral and Mixtral. So this seems pretty likely.

English

malteos@XYOU·18 Ara

@robertomasymas @burkov Check out "progressive growing". People did something similar already with BERT models.

English

Roberto Tomás C 🍉@robertomasymas·17 Ara

@burkov “SOLAR-10.7B incorporates the innovative Upstage Depth Up-Scaling. We then integrated Mistral 7B weights into the upscaled layers, and finally, continued pre-training for the entire model.” Honest question: how do you start with pretrained weights from a model of diffident size?

English

2.8K

BURKOV@burkov·17 Ara

SOLAR: an 11B model that beats every open model, including Mixtral, Yi-34B, Llama 2 70B, and Falcon 180B: huggingface.co/upstage/SOLAR-…

English

656

621.9K

Keşfet

@RishiBommasani @percyliang @MatthewBerman @OpenAI @hu_yifei @gui_penedo @pjox13 @mark_cummins