Max Idahl

44 posts

Max Idahl

@maxidahl

AI Research @ ellamind

Katılım Ocak 2017

675 Takip Edilen73 Takipçiler

Alexander Doria@Dorialexander·1d

Actually we have started to use Propella internally to curate common corpus subcollections and run comparisons with other pretraining dataset. Really filling a missing piece of training/synthetic infra in Europe. huggingface.co/ellamind/prope…

Alexander Doria@Dorialexander

Great annotation work from @ellamindAI / OpenEuroLLM on French-Science-Commons less than 24 hours after release!

English

3.3K

Max Idahl@maxidahl·23h

@Noah64165746 @Dorialexander huggingface.co/datasets/ellam… here is the eval data, enjoy

English

Noah@Noah64165746·1d

@Dorialexander I'm suspicious of the lack of published annotation data from Gemini 3 Pro. Why do you think they did not publish it?

English

Max Idahl@maxidahl·2d

Annotations on @huggingface: hf.co/datasets/opene… Could maybe be useful for the exploration tool at french-science-commons.pleias.dev @Dorialexander ?

English

Max Idahl@maxidahl·2d

Annotations are already available. Looks to be very good data. Now go ahead and curate the best seed docs for synth data.

Alexander Doria@Dorialexander

And new data release: French-Science-Commons, the largest scientific corpus in French in open access including 1.25 million documents/42 million pages re-digitized with VLM (dots ocr).

English

Max Idahl@maxidahl·4d

@fujikanaeda In case you are interested in speedrunning a German version in collab with @ellamindAI , hit me up. We can take care of the locale work and also got some B200 compute to spare.

English

Eric W. Tramel@fujikanaeda·4d

Keep an eye out for Nemotron Personas coming to your locale in the future! Try out Personas for diversifying & grounding your synthetic data today in NeMo Data Designer, too :)

English

137

Eric W. Tramel@fujikanaeda·4d

Nemotron Personas France combines many of the things that make me so enthusiastic about Nvidia & open-source AI: - Open Data - Awesome AI Startup collaboration with @pleiasfr - NeMo Data Designer - Nemotron 3 Super NVFP4 - Nvidia GB200 Hardware Was nice to contribute to this!

Alexander Doria@Dorialexander

Breaking: @pleiasfr and @nvidia release the first open synthetic dataset for personas in Europe: Nemotron-Personas-France. 1M synthetic French persons, with rich imaginary lives grounded on (complex) demographic distribution.

English

4.3K

Max Idahl@maxidahl·6d

@julien_c Next up: - deleting entire rows & cols - update & delete entries via duckdb wasm using the SQL-console pretty please?

English

Julien Chaumond@julien_c·13 Mar

Dataset Editing has landed for Parquet Datasets on the HF Hub ✍️

English

14.2K

Max Idahl@maxidahl·12 Mar

@llm_wizard @fujikanaeda would love to have checkpoints as repo revisions, e,g., every 10k training steps or so. What would be perfect is a dense series of checkpoints from early-stage training, so one can compare different data mixes, say when training up to the first 500B tokens

English

Chris 🇨🇦@llm_wizard·12 Mar

@maxidahl @fujikanaeda huggingface.co/nvidia/NVIDIA-… What do you think we're missing outside of this?

English

Eric W. Tramel@fujikanaeda·12 Mar

i wonder what Joe Nemotron is up to today

English

1.7K

Max Idahl@maxidahl·12 Mar

@vanstriendaniel @DataPolars @huggingface Now you only need a better filter based on dutch annotations here: huggingface.co/datasets/opene…

English

Daniel van Strien@vanstriendaniel·11 Mar

74GB of Dutch PDFs, filtered and written back to the Hub - without touching local disk! Hub is your disk! I built a PoC adding sink_parquet for @DataPolars. to stream writes to @huggingface's new Storage Buckets via Xet. Constant memory ~18 min on a 2-vCPU machine.

English

1.2K

Max Idahl@maxidahl·21 Şub

Details here: github.com/EleutherAI/lm-…

English

Max Idahl@maxidahl·21 Şub

Running GPQA with lm-evaluation-harness? 79 examples are affected by regex preprocessing that should not be there. This affects 18% in GPQA and 20% in GQPA-diamond. For some all answer choices become identical! For some, chemical nomenclature is destroyed.

English

Max Idahl@maxidahl·16 Şub

Next up: Produce more annotations, do lots of filtering, and run lots of ablations to identify which properties are most important for model training.

English

Max Idahl@maxidahl·16 Şub

Cross-Language Content Variation in FineWeb-2 Different language, different data. We profiled FineWeb-2 across 6 languages and found that quality dimensions vary dramatically: 34.6% of German documents are heavy/pure marketing vs. only 20.3% in Spanish. Information density shows similarly large gaps. One-size-fits-all filtering won't work; multilingual curation needs language-specific strategies.

English

Max Idahl@maxidahl·16 Şub

propella-1 report is now up on arXiv: arxiv.org/abs/2602.12414 Pretraining dataset insights included🧵👇

Max Idahl@maxidahl

Time to propel open LLM training data curation to the next level. Releasing propella-1: small multilingual LLMs that annotate text documents for dataset curation at scale. 🧵👇

English

197

Max Idahl@maxidahl·31 Oca

@Dorialexander Doubt

English

Alexander Doria@Dorialexander·30 Oca

i’m going to turn into gary marcus.

English

1.3K

Alexander Doria@Dorialexander·30 Oca

great comms but "first AI-planned drive"???

Anthropic@AnthropicAI

On December 8, the Perseverance rover safely trundled across the surface of Mars. This was the first AI-planned drive on another planet. And it was planned by Claude.

English

3.8K

Max Idahl retweetledi

Alexander Doria@Dorialexander·28 Oca

Forget about mid/post-training, this is the year of synthetic pretraining.

English

2.7K

Max Idahl@maxidahl·24 Oca

The ease of filtering pretraining datasets with propella-annotations and @duckdb. Here is the quality filter opus came up with:

English

186

Max Idahl@maxidahl·19 Oca

propella-annotations for pleias/SYNTH now available: hf.co/datasets/opene… TLDR: It's a great dataset. - mostly analytical + explanatory content (as expected) - high information density - mostly evergreen information - zero commercial bias - zero PII @Dorialexander @pleiasfr

English

4.5K

Keşfet

@Noah64165746 @Dorialexander @huggingface @fujikanaeda @ellamindAI @pleiasfr @julien_c @llm_wizard