Ritesh Sarkhel

141 posts

Ritesh Sarkhel

Ritesh Sarkhel

@sarkhelritesh

I mine multimodal data | PhD Retweets, Likes, Replies are not endorsements | Opinions are personal

Katılım Ocak 2011
343 Takip Edilen44 Takipçiler
Ritesh Sarkhel retweetledi
Kyunghyun Cho
Kyunghyun Cho@kchonyc·
hi rufus ...
Kyunghyun Cho tweet mediaKyunghyun Cho tweet mediaKyunghyun Cho tweet media
Indonesia
0
1
9
3.4K
Vivek Gupta
Vivek Gupta@keviv9·
I’m beyond excited to share some amazing news: I've accepted an Assistant Professor position at @SCAI_ASU Arizona State University @ASU in Tempe, AZ! 🌵🎓 I'll be starting this thrilling new chapter from Fall 2024. Phoenix @CityofPhoenixAZ, here I come!🚀🌞(Mr ->Dr. ->Prof.) -1/3
English
89
11
437
37.5K
Ritesh Sarkhel
Ritesh Sarkhel@sarkhelritesh·
@yunyao_li Happy to help on either task. Please feel free to DM if you're still looking for emergency reviewers.
English
0
0
1
189
Yunyao Li
Yunyao Li@yunyao_li·
Dear all, I need a few emergency reviewers for manuscripts related to (1) handwriting recognition; (2) common NLP tasks (e.g. QA, NER). If you have strong publications in these areas and have bandwidth to review within the next 2-3 weeks, please DM me. Thanks.
English
7
5
22
6.2K
Ritesh Sarkhel retweetledi
Mert
Mert@mertdumenci·
you: i use Claude 3 Opus for coding me: i use the Amazon Shopping app for coding
Mert tweet media
English
38
581
5.9K
512.5K
Ritesh Sarkhel retweetledi
fly51fly
fly51fly@fly51fly·
[CL] Noise-Aware Training of Layout-Aware Language Models R Sarkhel, X Ren, L B Costa, G Su, V Perot, Y Xie, E Koukoumidis, A Nandi [Google & The Ohio State University] (2024) arxiv.org/abs/2404.00488 - The paper proposes a Noise-Aware Training (NAT) method to train layout-aware language models for information extraction from visually rich documents in a scalable way. - NAT utilizes weakly labeled documents supplemented with limited human-labeled documents to train the model, avoiding expensive human annotation effort. - To prevent performance degradation due to noisy weak labels, NAT estimates the confidence of each training sample and incorporates it as an uncertainty measure during training. - Experiments show NAT-trained models outperform transfer learning baselines in terms of macro F1 score while requiring significantly less human labeling effort. - Key aspects of NAT include sample reweighting, weight thresholding, noise-aware loss, and sequential fine-tuning on corpora augmented with weak and synthetic labels.
fly51fly tweet mediafly51fly tweet mediafly51fly tweet mediafly51fly tweet media
English
1
5
7
743
Ritesh Sarkhel
Ritesh Sarkhel@sarkhelritesh·
NAT introduces a systematic way to train layout-aware #LLMs on noisy documents. It reduces labeling cost w.o. drop in perf and works in multi-lingual settings out-of-the-box. Thank you for the shout out @_akhaliq. This was truly a labor of love during my time @GoogleAI
AK@_akhaliq

Noise-Aware Training of Layout-Aware Language Models A visually rich document (VRD) utilizes visual features along with linguistic cues to disseminate information. Training a custom extractor that identifies named entities from a document requires a large number of

English
0
0
0
91
Ritesh Sarkhel
Ritesh Sarkhel@sarkhelritesh·
@kdd_news It also gives a peek of 🏋‍ShopBench 🏋, a massive #LLM evaluation benchmark curated in-house to mimic the nuances of real-world online shopping complexities.
English
1
0
0
50
Ritesh Sarkhel
Ritesh Sarkhel@sarkhelritesh·
📢 We're hosting a @kdd_news Cup competition & giving away cash prizes ✨ The Massively Multi-Task Online Shopping Challenge invites #LLM researchers to try their hands on a set of tasks that has an outsized impact on online shopping experience (1/n) #LLM #GenAI #LLM #AI #Amzn
English
1
0
0
91
Pratyush Maini
Pratyush Maini@pratyushmaini·
An exciting data curation paper came out from Google. I had to call @goyalsachin007 because the results challenged my prior beliefs about web scraped data quite dramatically. Read his thread to see what we think is happening. 👀 Into the age of pre-train like you fine-tune (1/n)
Sachin Goyal @ ICLR’26 🇧🇷🏖️@goyalsachin007

"Reducing LLM training data by 90%? Misleading! 🚫 It's simply aligning pretraining with downstream evaluation tasks or downstream finetuning style. 1. Authors use FLANT5 for curation, which is already finetuned on many tasks used for downstream evaluation in this work. (1/n)

English
2
8
53
21.5K
Ritesh Sarkhel retweetledi
Ash Jogalekar
Ash Jogalekar@curiouswavefn·
1/n: There are some academic papers that are so brilliantly and so accessibly written and so universal in scope that they transcend disciplines and stand as timeless testaments to both great thinking and great writing. Here's a short personal selection:
English
97
1.3K
8.1K
1.4M
Ritesh Sarkhel retweetledi
Rob Donnelly
Rob Donnelly@RobDonnelly47·
Friends don't let friends make bad charts! Chenxin Li, pulled together a lot of great advice for data visualization, with clear "do this, not that" examples for each item. Here are a few of my favorites, see the link below for more.
Rob Donnelly tweet mediaRob Donnelly tweet mediaRob Donnelly tweet media
English
27
1.1K
5.2K
573.5K