Max Idahl

44 posts

Max Idahl

Max Idahl

@maxidahl

AI Research @ ellamind

Katılım Ocak 2017
675 Takip Edilen73 Takipçiler
Noah
Noah@Noah64165746·
@Dorialexander I'm suspicious of the lack of published annotation data from Gemini 3 Pro. Why do you think they did not publish it?
English
2
0
0
43
Max Idahl
Max Idahl@maxidahl·
@fujikanaeda In case you are interested in speedrunning a German version in collab with @ellamindAI , hit me up. We can take care of the locale work and also got some B200 compute to spare.
English
0
1
1
32
Eric W. Tramel
Eric W. Tramel@fujikanaeda·
Keep an eye out for Nemotron Personas coming to your locale in the future! Try out Personas for diversifying & grounding your synthetic data today in NeMo Data Designer, too :)
English
2
0
2
137
Eric W. Tramel
Eric W. Tramel@fujikanaeda·
Nemotron Personas France combines many of the things that make me so enthusiastic about Nvidia & open-source AI: - Open Data - Awesome AI Startup collaboration with @pleiasfr - NeMo Data Designer - Nemotron 3 Super NVFP4 - Nvidia GB200 Hardware Was nice to contribute to this!
Alexander Doria@Dorialexander

Breaking: @pleiasfr and @nvidia release the first open synthetic dataset for personas in Europe: Nemotron-Personas-France. 1M synthetic French persons, with rich imaginary lives grounded on (complex) demographic distribution.

English
5
7
48
4.3K
Max Idahl
Max Idahl@maxidahl·
@julien_c Next up: - deleting entire rows & cols - update & delete entries via duckdb wasm using the SQL-console pretty please?
English
0
0
1
30
Julien Chaumond
Julien Chaumond@julien_c·
Dataset Editing has landed for Parquet Datasets on the HF Hub ✍️
English
3
15
89
14.2K
Max Idahl
Max Idahl@maxidahl·
@llm_wizard @fujikanaeda would love to have checkpoints as repo revisions, e,g., every 10k training steps or so. What would be perfect is a dense series of checkpoints from early-stage training, so one can compare different data mixes, say when training up to the first 500B tokens
English
1
0
1
25
Eric W. Tramel
Eric W. Tramel@fujikanaeda·
i wonder what Joe Nemotron is up to today
English
3
0
13
1.7K
Daniel van Strien
Daniel van Strien@vanstriendaniel·
74GB of Dutch PDFs, filtered and written back to the Hub - without touching local disk! Hub is your disk! I built a PoC adding sink_parquet for @DataPolars. to stream writes to @huggingface's new Storage Buckets via Xet. Constant memory ~18 min on a 2-vCPU machine.
Daniel van Strien tweet media
English
2
4
18
1.2K
Max Idahl
Max Idahl@maxidahl·
Running GPQA with lm-evaluation-harness? 79 examples are affected by regex preprocessing that should not be there. This affects 18% in GPQA and 20% in GQPA-diamond. For some all answer choices become identical! For some, chemical nomenclature is destroyed.
English
1
0
2
63
Max Idahl
Max Idahl@maxidahl·
Next up: Produce more annotations, do lots of filtering, and run lots of ablations to identify which properties are most important for model training.
English
0
0
0
22
Max Idahl
Max Idahl@maxidahl·
Cross-Language Content Variation in FineWeb-2 Different language, different data. We profiled FineWeb-2 across 6 languages and found that quality dimensions vary dramatically: 34.6% of German documents are heavy/pure marketing vs. only 20.3% in Spanish. Information density shows similarly large gaps. One-size-fits-all filtering won't work; multilingual curation needs language-specific strategies.
Max Idahl tweet media
English
1
0
0
33
Alexander Doria
Alexander Doria@Dorialexander·
i’m going to turn into gary marcus.
English
2
1
34
1.3K
Max Idahl retweetledi
Alexander Doria
Alexander Doria@Dorialexander·
Forget about mid/post-training, this is the year of synthetic pretraining.
Alexander Doria tweet media
English
1
4
66
2.7K
Max Idahl
Max Idahl@maxidahl·
The ease of filtering pretraining datasets with propella-annotations and @duckdb. Here is the quality filter opus came up with:
Max Idahl tweet media
English
0
0
1
186
Max Idahl
Max Idahl@maxidahl·
propella-annotations for pleias/SYNTH now available: hf.co/datasets/opene… TLDR: It's a great dataset. - mostly analytical + explanatory content (as expected) - high information density - mostly evergreen information - zero commercial bias - zero PII @Dorialexander @pleiasfr
English
1
0
14
4.5K