ParaCrawl

52 posts

ParaCrawl banner
ParaCrawl

ParaCrawl

@ParaCrawl

Katılım Aralık 2018
5 Takip Edilen234 Takipçiler
ParaCrawl
ParaCrawl@ParaCrawl·
Hi there, three new Bonus ParaCrawl languages have been just released: - English- Azerbaijani - English-Tajik - English-Armenian Go to the ParaCrawl website, scroll down to Bonus Languages (Low-Resource), download your preferred version: paracrawl.eu
English
1
1
4
751
ParaCrawl retweetledi
HPLT
HPLT@hplt_eu·
HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇
English
0
3
6
500
ParaCrawl retweetledi
HPLT
HPLT@hplt_eu·
Interested in Open and Community-Driven MT initiatives? CrowdMT is for you! 🎙️Invited speakers from Wikimedia Foundation and Apertium announced. 📜Accepted papers and abstracts announced. Time to register at events.tuni.fi/eamt23/registr… Details: hplt-project.org/events
English
0
2
1
305
ParaCrawl
ParaCrawl@ParaCrawl·
And Icelandic!
English
0
0
2
65
ParaCrawl
ParaCrawl@ParaCrawl·
Parallel (en-*) and monolingual new corpora from #MaCoCu just released. Included languages: Albanian Bosnian Bulgarian Croatian Macedonian Maltese Montenegrin Serbian Slovene Turkish
Taja Kuzman Pungeršek@TajaKuzman

We've published new #MaCoCu web corpora for 11 under-resourced languages! 56 million documents, 17 BILLION words (monolingual corpora) and 580 million words (English-X parallel corpora) were just uploaded to the CLARIN.SI repository (clarin.si/repository/xml…) 🥳

English
1
0
7
456
ParaCrawl retweetledi
Prompsit
Prompsit@Prompsit·
#MT people: submission date extended for the CrowdMT workshop to present works on Open Source and Community-Driven MT: 21st April 2023! Abstracts and papers wanted! You wanted also in Tampere, for the whole #EAMT23 conference or at least for this workshop on the 15th of June!
Prompsit tweet media
English
0
4
7
2.3K
ParaCrawl
ParaCrawl@ParaCrawl·
@aihkas @BramVanroy @Nils_Reimers @huggingface @Reverso_ Hi @aihkas, sorry for the late reply. ParaCrawl website has a "Notice and take down policy" section with contact e-mail. Anonymized versions of ParaCrawl corpora (ROAM) were released to avoid these issues. We will make sure that your personal data gets removed, if still present.
English
1
0
1
0
Nils Reimers
Nils Reimers@Nils_Reimers·
@BramVanroy @huggingface @ParaCrawl Not sure about ParaCrawl. More familiar with e.g. WikiMatrix, CCMatrix, NLLG etc. These are all on sentence level. ParaCrawl 2016 is also only available for 5 languages: Not really exciting. Not sure if there are newer paracrawl corpora.
English
1
0
0
0
ParaCrawl
ParaCrawl@ParaCrawl·
A new ParaCrawl parallel corpus is available! 🌍 languages: Polish-Czech 🎒 size: 24 million sentences 🗒️ license: CC0 🎯 location: paracrawl.eu bonus section 🧐 more info: paracrawl.eu/moredata
English
1
2
4
0
ParaCrawl retweetledi
Prompsit
Prompsit@Prompsit·
Indeed, this is the first data release of the #Macocu effort. You will find both monolingual and bilingual (with English) corpora on ELRC-Share and CLARIN repositories and the website. Insights coming soon! Most of the code also ready for you to try it out!
Clarin.si@ClarinSlovenia

Massive AND high-quality corpora for Bulgarian, Croatian, Slovene, Macedonian, Icelandic, Maltese and Turkish, collected by the #MaCoCu project, are now available in our repository! Check them out and share the word: ➡️macocu.eu ➡️clarin.si/repository/xml…

English
0
3
7
0
ParaCrawl
ParaCrawl@ParaCrawl·
@vince62s Hi, publications coming soon, but see here MT results (spoiler, all BLEUs go up in V9): Also, yes, v9 and all the rest of versions are shuffled.
ParaCrawl tweet media
English
1
0
0
0
Vincent Nguyen
Vincent Nguyen@vince62s·
@ParaCrawl Great work, did you publish the MT results based on V9 ? also can you please tell if v9 is pre-shuffled or not ? Cheers.
English
1
0
0
0
ParaCrawl
ParaCrawl@ParaCrawl·
Summer was for work! Now #ParaCrawl v9 corpora are done and again bigger than the previous ones!🤩 Extrinsic evaluation through MT almost finished and, according to old BLEU and new COMET, the quality of the MT output improves! 🥳 We will share corpora and more results soon!🕑
English
1
3
20
0
ParaCrawl
ParaCrawl@ParaCrawl·
We're back with more language resources: English-Ukrainian parallel corpus with aprox. 13M sentence pairs has been released. More info and downloads: paracrawl.eu/news/item/17-e… Please, spread the word and use it!
English
0
15
23
0
ParaCrawl
ParaCrawl@ParaCrawl·
@jjon1910 Please try again, it was not you, but a typo in a script.😑 Thanks for reporting the issue and for your interest in being the first one downloader! 🤩
English
1
0
1
0
Josef Jon
Josef Jon@jjon1910·
@ParaCrawl None of the links seem to work for me (NoSuchKey). Is that something on my side, or is the upload not finished yet?
English
1
0
0
0
ParaCrawl
ParaCrawl@ParaCrawl·
Very clear TODO from #ParaCrawl's last stakeholder board meeting: we need better language identification, specially for closely-related languages and for under-resourced ones. Such a basic thing! Trying here to improve current results mixing Fastext and Hunspell, take a look👇
English
0
1
6
0