16.4K posts

Javier de la Rosa @versae@mastodon.social banner

Javier de la Rosa @[email protected]

@versae

Research Scientist (NLP) at @Nasjonalbibl AI-Lab. Formerly, @UNED, @stanfordCIDR, @CulturePlex. «sin peripecias de relieve»

Madrid, Spain 가입일 Nisan 2007

946 팔로잉1K 팔로워

고정된 트윗

Javier de la Rosa @[email protected]@versae·28 Nis

🚨Model alert!🚨 We're very proud to release the biggest decoder-only auto-regressive GPT model for Spanish ✍️ 🧠 Model: huggingface.co/bertin-project… 💾 Dataset: huggingface.co/datasets/berti… 💻 Demo: huggingface.co/spaces/bertin-…

English

224

Javier de la Rosa @[email protected] 리트윗함

Horacio Saggion@h_saggion·24 Şub

📢 More time to submit! ⏳ The READIxTSAR Workshop (Reading Difficulties + Text Simplification) has extended its deadline for #LREC2026 🙌 Don't miss the chance to showcase your research on #NLProc, #accessibility, and #simplification!

English

119

Javier de la Rosa @[email protected] 리트윗함

Siva Reddy@sivareddyg·9 Oca

McGill University (@mcgillu) has many open faculty and postdoctoral positions with generous funding packages, thanks to Impact+ grants, which are investing $2 billion to attract global talent to Canada 🇨🇦🇨🇦🇨🇦. Associate/Full Professor: $8 million startup package Assistant Professor: $600K startup package Postdoc: $70K (starting salary) If you are interested and work in the space of AI/ML/NLP/LLMs, please reach out to me. #AI #NLProc #ML

English

297

1.4K

194.9K

Javier de la Rosa @[email protected] 리트윗함

information labs@the_info_labs·18 Kas

Can national libraries shape AI? Javier de la Rosa (@versae) shows how Norway’s National Library uses 20 years of digitisation to power Norwegian language models, speech tech and privacy-centric public-interest AI. 🎧 share.transistor.fm/s/1fbc85ee 📺 youtu.be/_ou1wT5XkLE

YouTube

English

Javier de la Rosa @[email protected] 리트윗함

Mistral AI@MistralAI·4 Kas

Full stack devs, SWEs, MLEs, forward deployed engineers, research engineers, applied scientists: we are hiring! Join us and tackle cutting-edge challenges including physical AI, time series, material sciences, cybersecurity and many more. Positions available in Paris, London, Singapore, Amsterdam, NYC, SF, or remote. jobs.lever.co/mistral

English

100

1.2K

154.4K

Javier de la Rosa @[email protected]@versae·24 Eki

@VikParuchuri @vanstriendaniel @huggingface This is actually pretty good. Can the layout be done separately, extract the bounding boxes for reading order, and run again on them?

English

Vik Paruchuri@VikParuchuri·24 Eki

@vanstriendaniel @versae @huggingface It looks like it came out pretty well (generation ended early due to the 8192 token limit, so some content is missing at the end):

English

313

Daniel van Strien@vanstriendaniel·22 Eki

DeepSeek-OCR just got @vllm_project support 🚀 Currently processing @natlibscot's 27,915-page handbook collection with one command: Processing at ~350 images/sec on A100 Using @huggingface Jobs + @astral_sh uv - zero setup batch OCR! Will share final time + cost when done!

English

443

58.1K

Javier de la Rosa @[email protected]@versae·23 Eki

@vanstriendaniel @huggingface Nice! We are currently testing these models out too. Any chance you can add Chandra as well? It's the one that worked best for our newspaper collection so far.

English

Daniel van Strien@vanstriendaniel·22 Eki

The command (using @huggingface Jobs - serverless GPU compute) Full script at huggingface.co/datasets/uv-sc…

English

2.6K

Javier de la Rosa @[email protected]@versae·8 Tem

@vanstriendaniel Nice! Norwegian Bokmål is also here 🇳🇴 huggingface.co/datasets/NbAiL…

Norsk

9.7K

Daniel van Strien@vanstriendaniel·8 Tem

465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed! huggingface.co/blog/davanstri…

English

107

20.1K

Javier de la Rosa @[email protected] 리트윗함

NLP_SINAI@NLP_SINAI·27 Haz

¿Te gustaría formar parte del equipo humano que desarrollará los próximos LLMs en español? Estamos buscando ingenieros en informática deseosos de involucrarse en un proyecto ilusionante y transformador. Más info. aquí: linkedin.com/posts/nlp-sina…

Español

574

Javier de la Rosa @[email protected] 리트윗함

Lise Jaillant@lisejaillant·12 Haz

🚀 New Book Alert! Our edited collection Navigating Artificial Intelligence for Cultural Heritage Organisations is out now! 🎉 Explore 10 chapters on AI in archives, libraries, & museums by top experts. 📖 Free download: uclpress.co.uk/AIHeritage #AI #CulturalHeritage @UCLpress

English

1.1K

Javier de la Rosa @[email protected]@versae·29 May

@lirondos @sistema_arText @JulioGonzalo1 @ConstantineLig @UNEDNLP ¡Enhorabuena! Un trabajo excepcional, sin duda.

Español

110

e'lena 'alβ̞aɾeð me'ʝ̞að̞o@lirondos·29 May

A partir de ahora doctora Lirondos, por favor. Muchas gracias a quienes me acompañaron el martes en la defensa de mi tesis (@sistema_arText, @versae, @JulioGonzalo1, @ConstantineLig, compis de @UNEDNLP) y en general en estos años intensos. Estoy cansada y contenta.

Español

102

3.3K

Javier de la Rosa @[email protected]@versae·22 Nis

@giffmana We tried something like that, but obviously limiting A LOT the ablations. Impossible to cover it all. arxiv.org/abs/2412.09460

English

Lucas Beyer (bl16)@giffmana·21 Nis

"wow 0.06% per book, so with just 1667 books we should get 100%!" You're either: (a) poor at stats (b) never ran experiments (c) intentionally obtuse/just memeing. I'll give you the benefit of the doubt and assume it's (c). Think about it: what experiment needs to be conducted to come to such number? You need to train the same model twice, with only a single book removed as difference. But a single pair of runs doesn't mean much. Do the same pair of runs with a different init seed, or different data ordering seed, or different dataset mix, or... and you will most likely get a difference >0.06% for each run. Just look at these two figures below from "ResNet Strikes Back" showing 100 identical ResNet ImageNet trainings only changing seeds. 0.5% score range in one metric and 1.0% score range in another metric. You would need hundreds of runs with and hundreds of runs without one book to be able to reliably measure that book's impact (below the base "noise" level) while removing the other sources of variation in results. That would be very interesting, but also crazy expensive. And the result would differ per book. And differ per model scale. And differ per training duration. And differ per data mixture. And differ per eval looked at. So even ONE such (crazy expensive) experiment wouldn't mean much in general. So what they are saying is, a single book's influence is below the noise level. But again, even this would depend a lot on the setting. If the eval was "how good is model at niche topic X" and there's only two existing write-ups of topic X one of which being the book, the impact would probably be more than 0.06%. Btw, this is mostly a comment on people's reaction to their statement, not on their statement itself.

Andrew Curran@AndrewCurran_

Interesting legal argument from META; the use of a single book for pretraining boosts model performance by 'less than 0.06%.' Therefore, taken individually, a work has no economic value as training data.

English

301

47.4K

Javier de la Rosa @[email protected]@versae·22 Nis

@giffmana We tried something like that, but obviously limiting A LOT the ablations. Impossible to cover it all. arxiv.org/abs/2412.09460

English

Javier de la Rosa @[email protected]@versae·27 Şub

@michahu8 Will do! Last question, any experiments on tokenizer impact?

English

Michael Hu@michahu8·27 Şub

@versae that's awesome, love that. please reach out here or @ my nyu email with what you find!!

English

106

Michael Hu@michahu8·27 Şub

Training on a little 🤏 formal language BEFORE natural language can make pretraining more efficient! How and why does this work? The answer lies…Between Circuits and Chomsky. 🧵1/6👇

English

125

929

132.6K

Javier de la Rosa @[email protected]@versae·27 Şub

@michahu8 Awesome! I didn't get to the appendices yet 😅 Thanks for pointing that out. I'll be testing it on languages other than English, including extremely low-resource.

English

Michael Hu@michahu8·27 Şub

@versae 1. see arxiv.org/pdf/2408.10914, arxiv.org/abs/2409.04556, and the SQL paper that got posted in this thread 2. yes! we're wrapping up one larger experiment and then the code will be out. I also put the shuffle dyck generator in the appendix if you want to try it right now

English

299

Javier de la Rosa @[email protected] 리트윗함

Hanna Hajishirzi@HannaHajishirzi·24 Şub

Excited to drive innovation and push the boundaries of open, scientific AI research & development! 🚀 Join us at @allen_ai to shape the future of OLMo, Molmo, Tulu, and more. We’re hiring at all levels—apply now! 👇 #AI #Hiring Research Engineer job-boards.greenhouse.io/thealleninstit… Research Scientist job-boards.greenhouse.io/thealleninstit… Young Investigator job-boards.greenhouse.io/thealleninstit…

English

57.2K

Javier de la Rosa @[email protected] 리트윗함

ResearchTweet@ResearchTweet22·24 Şub

Multiple Fully Funded PhD, Postdoctoral Fellowships, Research Assistant, & Research Jobs at University of Oslo, Oslo, Norway #phd #opportunities #phdposition #fundedPhD #FullyFundedPhD #scholarship #research #Postdoctoral #Postdoc #job

English

171

26.2K

Javier de la Rosa @[email protected] 리트윗함

ResearchTweet@ResearchTweet22·19 Şub

Multiple Fully Funded PhD, Postdoctoral Fellowships, Research Assistant, & Research Jobs at Uppsala University, Uppsala, Sweden #phd #opportunities #phdposition #fundedPhD #FullyFundedPhD #scholarship #research #Postdoctoral #Postdoc #job

English

104

333

62.6K

Javier de la Rosa @[email protected] 리트윗함

Manu Romero@mrm8488·12 Şub

🚀 We're Hiring Applied AI Engineers! 🚀 Do you write clean, efficient Python? Are you familiar with AI frameworks? Do you thrive in a collaborative team? If that sounds like you, DM me now! Let's build the future of AI together. 💡🤖

English

1.8K

Javier de la Rosa @[email protected]@versae·28 Oca

@maballesterosv En general, sí. Pero depende de la abilidad concreta que se le espera un LLM hoy en día. La mayoría de los modelos son entrenamientos base (pre-training), sin capacidad para seguir instrucciones o los diálogos (post-training).

Español

Mike Ballesteros@maballesterosv·28 Oca

@versae Súper interesante el trabajo, Javier. Enhorabuena. Si no he entendido mal, es la calidad del material 2editado" (riqueza lingüística, coherencia y rigor) lo que obra la magia, ¿verdad?

Español

Javier de la Rosa @[email protected]@versae·28 Oca

Our work on copyright and LLMs is finally out 😎 I know it's not DeepSeek or Qwen, but I still think is very interesting stuff!

Javier de la Rosa @[email protected]@versae

📢✨ PAPER ALERT ✨📢 Ever wondered whether copyrighted material **actually** makes for better LLMs? 🤔 We asked the same question—and the results are in! Our paper is accepted at NoDaLiDa/Baltic-HLT 2025, and the pre-print is live: 🔗 arxiv.org/abs/2412.09460 🧵👇

English

206

탐색

@mcgillu @VikParuchuri @vanstriendaniel @huggingface @vllm_project @natlibscot @astral_sh @UCLpress