
Thomas O'Brien
665 posts

Thomas O'Brien
@listenaddress
Making research tools




How exercising with family and friends can boost physical performance as well as improving mental health 🏃 @oxsocsci's Dr @Arran_Davis shares the science behind the benefits of exercising together. 🎬 | @bbcideas








📚 Awesome Information Retrieval 🔍 I’ve compiled a list of some of my favorite IR papers from the past few years. If you’re new to the field and want to understand how Transformer-based retrieval models work before building your RAG application, this should serve as a great starting point. You'll read about techniques like: - Late interaction with ColBERT - Hard negative mining with ANCE - Knowledge distillation with MarginMSE - Sparse lexical expansion with SPLADE - Synthetic query generation with InPars - Generative Information Retrieval with DSI - Masked auto-encoder pre-training with RetroMAE - Instruction tuning with TART - Large-scale IR pre-training with E5 - Multi-purpose models with BGE-M3 - LLM adaptation techniques with LLM2Vec ... to name a few. It was a challenge, but I narrowed it down to 16 essential papers, with many more included in the extended version. Without further yapping, here is the list: - ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (Khattab et al., 2020) arxiv.org/abs/2004.12832 - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) arxiv.org/abs/2005.11401 - Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (Xiong et al., 2020) arxiv.org/abs/2007.00808 - Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation (Hofstätter et al., 2020) arxiv.org/abs/2010.02666 - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (Thakur et al., 2021) arxiv.org/abs/2104.08663 - SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking (Formal et al., 2021) arxiv.org/abs/2107.05720 - InPars: Data Augmentation for Information Retrieval using Large Language Models (Bonifacio et al., 2022) arxiv.org/abs/2202.05144 - Transformer Memory as a Differentiable Search Index (Tay et al., 2022) arxiv.org/abs/2202.06991 - RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder (Xiao et al., 2022) arxiv.org/abs/2205.12035 - Promptagator: Few-shot Dense Retrieval From 8 Examples (Dai et al., 2022) arxiv.org/abs/2209.11755 - MTEB: Massive Text Embedding Benchmark (Muennighoff et al., 2022) arxiv.org/abs/2210.07316 - Task-aware Retrieval with Instructions (Asai et al., 2022) arxiv.org/abs/2211.09260 - Text Embeddings by Weakly-Supervised Contrastive Pre-training (Wang et al., 2022) arxiv.org/abs/2212.03533 - BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation (Chen et al., 2024) arxiv.org/abs/2402.03216 - Generative Representational Instruction Tuning (Muennighoff et al., 2024) arxiv.org/abs/2402.09906 - LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (BehnamGhader et al., 2024) arxiv.org/abs/2404.05961
























