Alexander Doria

45.1K posts

Alexander Doria banner
Alexander Doria

Alexander Doria

@Dorialexander

building open ai infrastructure @pleiasfr

Katılım Nisan 2011
4K Takip Edilen22.4K Takipçiler
Sabitlenmiş Tweet
Alexander Doria
Alexander Doria@Dorialexander·
Breaking: @pleiasfr and @nvidia release the first open synthetic dataset for personas in Europe: Nemotron-Personas-France. 1M synthetic French persons, with rich imaginary lives grounded on (complex) demographic distribution.
Alexander Doria tweet media
English
33
90
714
76.8K
Alexander Doria
Alexander Doria@Dorialexander·
The project was led in cooperation the EU infrastructure OPERAS and the Chaire du Québec sur la découvrabilité des contenus scientifiques, under the support from @MinistereCC.
Français
1
0
4
537
Alexander Doria
Alexander Doria@Dorialexander·
And new data release: French-Science-Commons, the largest scientific corpus in French in open access including 1.25 million documents/42 million pages re-digitized with VLM (dots ocr).
Alexander Doria tweet media
English
4
18
123
6.8K
Alexander Doria
Alexander Doria@Dorialexander·
@omooretweets @a16z @Kantrowitz I’m not sure you’re really addressing the issue of SaaS high-margin evaporating in the context of transient/last mile use cases. Seems hardly scalable and under constant pressure from the next synth environment.
English
0
0
2
283
Olivia Moore
Olivia Moore@omooretweets·
How can AI application startups compete with the big labs and incumbents? I shared some of our thoughts on this @a16z with @Kantrowitz 👇
Olivia Moore tweet media
English
26
15
162
21.7K
Alexander Doria
Alexander Doria@Dorialexander·
Meanwhile multiple people in the field do seem to be prompting/tuning decoders all day and, yeah, it must feel narrow.
English
0
0
9
688
Alexander Doria
Alexander Doria@Dorialexander·
While attending an NLP conference today, I realized you’re actually much less *into* LLMs while building LLMs. It’s regex in tokenizer, encoders to filter data at scale, knowledge graphs for synth pipelines, rule-based systems to backtranslate… There’s a whole village.
English
5
0
73
3.2K
Alexander Doria
Alexander Doria@Dorialexander·
@leavittron I have really no idea (maybe except for actual ablations?), and I'm scouting every single specialized benchmark I can find to monitor next SYNTH iteration. Everything, everywhere, all at once.
English
0
0
1
59
Matthew Leavitt
Matthew Leavitt@leavittron·
@Dorialexander Strong agree. I'm curious whether there's ever a scenario in which you DON'T want to put end task-relevant data earlier into training (assuming you have unique tokens consituting at least ~0.5% of your pretraining budget)
English
1
0
2
65
umumu
umumu@umi33563·
@Dorialexander hmm, would there be a point (and a market) to train specialized enterprise SLMs from scratch then?
English
3
0
3
87
Alexander Doria
Alexander Doria@Dorialexander·
Breaking: @pleiasfr and @nvidia release the first open synthetic dataset for personas in Europe: Nemotron-Personas-France. 1M synthetic French persons, with rich imaginary lives grounded on (complex) demographic distribution.
Alexander Doria tweet media
English
33
90
714
76.8K
Alexander Doria
Alexander Doria@Dorialexander·
@IanBaer Oh yes totally. I like to complain all the time, can’t deny.
English
1
0
1
404
Ian
Ian@IanBaer·
@Dorialexander You are angry with Claude because you are angry with yourself. Free yourself from this, brother.
English
1
0
2
440
Alexander Doria
Alexander Doria@Dorialexander·
I feel it's very healthy to be regularly annoyed/angry at Claude. Otherwise you end up like Gary Tan.
English
16
16
534
15.1K