Kushal Tatariya

20 posts

Kushal Tatariya banner
Kushal Tatariya

Kushal Tatariya

@KushalTatariya

I am a PhD student at KU Leuven in multilingual and low-resource NLP. I love talking languages!

Leuven, Belgium Beigetreten Ağustos 2013
114 Folgt57 Follower
Kushal Tatariya
Kushal Tatariya@KushalTatariya·
Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting. 5/5
English
0
0
1
74
Kushal Tatariya
Kushal Tatariya@KushalTatariya·
We evaluate the downstream impact of quality filtering on Wikipedia by training tiny monolingual pretrained models for each Wikipedia to find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for LRLs. 4/5
English
1
0
0
82
Kushal Tatariya
Kushal Tatariya@KushalTatariya·
Our work on quality estimation for non-English Wikipedia articles is finally out in the wild 👀. It spread before we had the chance to publicise it haha, but watch out for our upcoming thread on this next week!
WikiResearch@WikiResearch

"How Good is Your Wikipedia?" a critical analysis of the Wikipedia content quality beyond English, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. (Tatariya et al, 2024) arxiv.org/pdf/2411.05527

English
0
2
8
798
Kushal Tatariya
Kushal Tatariya@KushalTatariya·
This was my first venture into language model intepretability, and I've learnt a lot of cool things during this project. I hope everyone finds it an interesting read!
English
0
0
0
75
Kushal Tatariya
Kushal Tatariya@KushalTatariya·
Additionally, we examine variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input level can facilitate earlier learning of surface-level features.
English
1
0
0
81
Kushal Tatariya retweetet
LAGoM NLP
LAGoM NLP@lagom_nlp·
Leuven goes to Leiden in 10 days! We'll be presenting two posters about data quality of non-English Wikipedias and about typologically informed language sampling 👀 see you there!
Suzan Verberne 🤹‍♀️@suzan

10 days until #CLIN34! The 34th edition of Computational Linguistics in the Netherlands, held at @UniLeiden this year. We present the program on clin34.leidenuniv.nl Registration is still possible!

English
0
2
3
259
Kushal Tatariya retweetet
Raj Dabre
Raj Dabre@prajdabre·
The camera ready version is now up! arxiv.org/abs/2310.19567 We hope to present this at ACL next year. To summarize our contributions: 1. The first ever benchmark for Creole NLP 2. 8 NLP tasks and 28 Creoles 3. Human generated/checked data Hopefully this is used as a starting point for future work on Creoles.
Raj Dabre@prajdabre

CreoleVal has been accepted to TACL! Hearty congratulations to all authors, especially @heather_nlp who grinded hard to get this work complete! Updated manuscript to go up soon. This marks my 4th paper on Creoles!

English
2
5
39
7.8K
Kushal Tatariya
Kushal Tatariya@KushalTatariya·
Spoiler: We find that PLMs do get more influenced by Hindi words to predict negative emotions, and by English words to predict positive emotions. Moreover, the PLMs may also overgeneralise this learning to examples where it does not apply.
English
0
0
0
60
Kushal Tatariya
Kushal Tatariya@KushalTatariya·
We use LIME and token-level language ID to examine the effect of language on emotion prediction across 3 PLMs finetuned on a Hinglish emotion classification dataset.
English
1
0
0
69