Gullal S. Cheema

700 posts

Gullal S. Cheema

@Gullal7

Research Assistant @l3s_luh Previously Marie Sklodowska Curie ESR (PhD) @TIBHannover, Germany MUWS Workshop: https://t.co/XrHoXlMWsx Views are personal.

Hanover, Lower Saxony Katılım Eylül 2018

166 Takip Edilen69 Takipçiler

Gullal S. Cheema retweetledi

Sebastian Raschka@rasbt·25 Şub

x.com/i/article/2026…

ZXX

192

1.1K

93.3K

Gullal S. Cheema retweetledi

Nathan Lambert@natolambert·17 Şub

Some papers i was reading in figuring this out: Apr 2025 — R2E-Gym (AgentGym) arxiv.org/abs/2504.07164 Apr 2025 — SWE-smith arxiv.org/abs/2504.21798 May 2025 — RandomWorld arxiv.org/abs/2506.11045 May 2025 — Reasoning Gym arxiv.org/abs/2505.24760 Jun 2025 — random-crypto arxiv.org/abs/2506.02048 Jan 2026 — Endless Terminals arxiv.org/abs/2601.16443 Feb 2026 — Agent World Model (AWM) arxiv.org/abs/2602.10090

English

263

13.1K

Gullal S. Cheema retweetledi

Thomas Wolf@Thom_Wolf·8 Eyl

3 trillions tokens finely distilled from more than a petabyte of PDF files We’ve just released FinePDF, the latest addition to the Fineweb datasets

English

604

123.7K

Gullal S. Cheema retweetledi

Thomas Wolf@Thom_Wolf·7 Eyl

This is huge Continuing our foundational work to enable anyone to train state of the art AI model, we’re thrilled to release « FinePDFs » 3T tokens of textual data that until now was locked away in PDFs, arguably some of the highest quality publicly available data out there. We gathered FinePDF to create the largest permissively licensed corpus sourced exclusively from PDFs. Amazingly challenging infra and processing work, h/t to the fineweb team

Hynek Kydlíček@HKydlicek

We are releasing 📄 FinePDFs: the largest PDF dataset spanning over half a billion documents! - Long context: Documents are 2x longer than web text - 3T tokens from high-demand domains like legal and science. - Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.

English

488

70.7K

Gullal S. Cheema retweetledi

Andi Marafioti@andimarafioti·4 Eyl

Fuck it. Today, we open source FineVision: the finest curation of datasets for VLMs, over 200 sources! > 20% improvement across 10 benchmarks > 17M unique images > 10B answer tokens > New capabilities: GUI navigation, pointing, counting FineVision 10x’s open-source VLMs.

English

114

942

131K

Gullal S. Cheema retweetledi

Femke Plantinga@femke_plantinga·5 Ağu

Stop optimizing your retrieval. Fix your chunking first. It's not your embedding model, prompt engineering, or vector database. It's your chunking strategy creating invisible walls between your users and the information they need. 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 is the pre-processing step of splitting texts into smaller pieces of text, or "chunks". Each chunk becomes the unit of information that gets vectorized and stored in your vector database. Here are 6 essential chunking techniques you need to know: → 𝗙𝗶𝘅𝗲𝗱-𝗦𝗶𝘇𝗲 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗥𝗲𝗰𝘂𝗿𝘀𝗶𝘃𝗲 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁-𝗕𝗮𝘀𝗲𝗱 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗟𝗟𝗠-𝗕𝗮𝘀𝗲𝗱 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗟𝗮𝘁𝗲 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/blog/late-chun… 💡 𝗣𝗿𝗼 𝘁𝗶𝗽s: • There's no one-size-fits-all chunking strategy. • Your choice affects both information retrieval and the amount of contextual information provided to your RAG system. • Start simple with fixed-size chunking, then experiment based on your specific use case. → Are you dealing with technical documentation? Try document-based. → Working with conversational data? Semantic chunking might be your best bet. A brief introduction to chunking: weaviate.io/developers/aca…

English

182

999

64.4K

Gullal S. Cheema retweetledi

Jean de Dieu Nyandwi@Jeande_d·14 Tem

Reinforcement Learning of Large Language Models, Spring 2025(UCLA) Great set of new lectures on reinforcement learning of LLMs. Covers a wide range of topics related to RLxLLMs such as basics/foundations, test-time compute, RLHF, and RL with verifiable rewards(RLVR).

English

227

1.3K

76K

Gullal S. Cheema retweetledi

Daniel Khashabi 🕊️@DanielKhashabi·7 Tem

What’s really going on inside LLMs when they handle non-English queries? @BafnaNiyati's recent work introduces the **translation barrier hypothesis**, a framework for understanding multilingual model behavior. This hypothesis says that : (1) Multilingual generation, internally, follows a "task-solving"→"translation" cascade. (2) Translation failure *despite task-solving success* is a large part of the overall failures. That is, the model often solves the task but fails to articulate the answer. Highlighting a key result in the figure: when we inspect intermediate layers, we see that models often solve the task in the wrong (off-target_ language; that is, high off-target accuracy early on. Only in the later layers does the answer get translated into the intended language. Paper: huggingface.co/papers/2506.22…

Niyati Bafna@BafnaNiyati

📢When LLMs solve tasks with a mid-to-low resource input/target language, their output quality is poor. We know that. But can we pin down what breaks inside the LLM? We introduce the 💥translation barrier hypothesis💥 for failed multilingual generation. arxiv.org/abs/2506.22724

English

Gullal S. Cheema retweetledi

himanshu@himanshustwts·28 Haz

went through this but don't just only skim over it ig. every question is a good research paper and worth a read.

English

141

2.4K

226.1K

Gullal S. Cheema retweetledi

Unsloth AI@UnslothAI·24 Haz

We made a Guide on mastering LoRA Hyperparameters, so you can learn to fine-tune LLMs correctly! Learn to: • Train smarter models with fewer hallucinations • Choose optimal: learning rates, epochs, LoRA rank, alpha • Avoid overfitting & underfitting 🔗docs.unsloth.ai/get-started/fi…

English

128

677

25.1K

Gullal S. Cheema retweetledi

Anshul Kundaje@anshulkundaje·19 Haz

Ok a few quick things. Most CS students who get into elite PhD programs in AI especially have already usually published multiple first author papers in "top" conferences. 1/

Suvansh Sanjeev@SuvanshSanjeev

i left my phd before joining openai working in industry demands more rigor – you don’t just need to convince reviewer 2 with a nice graph and an ego-cite, it better actually work if it’s underwriting billions in research investment not saying it always pans out that way in practice. yolo culture is pervasive and “research-quality code” abounds. i certainly don’t have a clean conscience there but some of the best breakthroughs come from people without academic preconceptions, with the discipline to build things that actually work

English

1.3K

281.4K

Gullal S. Cheema retweetledi

Aran Komatsuzaki@arankomatsuzaki·18 Haz

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets Presents an autoregressive U-Net that processes raw bytes and learns hierarchical token representation Matches strong BPE baselines, with deeper hierarchies demonstrating promising scaling trends

English

355

59.7K

Gullal S. Cheema retweetledi

Sebastian Raschka@rasbt·18 Haz

Understanding and Coding KV Caching From Scratch -- The Extended Edition magazine.sebastianraschka.com/p/coding-the-k…

English

105

751

44K

Gullal S. Cheema retweetledi

Sagnik@saagnikkk·20 May

🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models” From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮 And this isn’t a one-off. The pattern holds across RL algorithms and models. 🧵A Deep Dive

English

132

907

192.5K

Gullal S. Cheema retweetledi

Kenneth Stanley@kenneth0stanley·20 May

Could a major opportunity to improve representation in deep learning be hiding in plain sight? Check out our new position paper: Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis. The idea stems from a little-known observation about networks trained to output a single image: when they are discovered through an unconventional open-ended search process, their representations are incredibly elegant and exhibit astonishing modular decomposition. In contrast, when SGD (successfully) learns to output the same image its underlying representation is fractured, entangled - an absolute mess! This stark difference in the underlying representation of the same "good" output behavior carries deep lessons for deep learning. It shows you cannot judge a book by its cover - an LLM with all the right responses could similarly be a mess under the hood. But also, surprisingly, it shows us that it doesn't have to be this way! Without the unique examples in this paper that were discovered through open-ended search, we might assume neural representation has to be a mess. These results show that is clearly untrue. We can now imagine something better because we can actually see it is possible. We give several reasons why this matters: generalization, creativity, and learning are all potentially impacted. The paper shows examples to back up these concerns, but in brief, there is a key insight: Representation is not only important for what you're able to do now, but for where you can go from there. The ability to imagine something new (and where your next step in weight space can bring you) depends entirely upon how you represent the world. Generalization, creativity, and learning itself depend upon this critical relationship. Notice the difference in appearance between the nearby images to the skull in weight space shown in the top-left and top-right image strips of the attached graphic. The difference in semantics is stark. The insight that representation could be better opens up a lot of new paths and opportunities for investigation. It raises new urgency to understand the representation underlying foundation models and LLMs while exposing all kinds of novel avenues for potentially improving them, from making learning processes more open-ended to manipulating architectures and algorithms. Don't mistake this paper as providing comfort for AI pessimists. By exposing a novel set of stark and explicit differences between conventional learning and something different, it can act as an accelerator of progress as opposed to a tool of pessimism. At the least, the discussion it provokes should be quite illuminating.

English

159

990

163.7K

Gullal S. Cheema retweetledi

Kyunghyun Cho@kchonyc·20 May

it's been more than a decade since KD was proposed, and i've been using it all along .. but why does it work? too many speculations but no simple explanation. @_sungmin_cha and i decided to see if we can come up with the simplest working description of KD in this work. we ended up with a very simple explanation for any mixture distribution, starting from a mixture of gaussians. the key hypothesis was that using lower-entropic, approximate sampling from a teacher, results in a higher-precision but lower-recall student. since an autogressive LM is nothing but an infinite cascade of mixture distribution, we confirm this with SmolLM (thanks, @huggingface!) this is probably not the complete picture of KD, but i can definitely sleep better after writing down and confirming this minimal working explanation. as an extra take-away, this implies that our eval tends to be overly precision focused. we should really think of what we lose in terms of recalls, as this directly relates to what we miss out for whom when we build these large-scale, general-purpose models.

English

380

48.3K

Gullal S. Cheema retweetledi

DailyPapers@HuggingPapers·18 May

J1 just launched on Hugging Face A Reinforcement Learning recipe for training Thinking-LLM-as-a-Judge models. It trains J1-Llama-8B and J1-Llama-70B that outperform existing models.

English

10.9K

Gullal S. Cheema retweetledi

Kevin Patrick Murphy@sirbayes·20 May

I am pleased to announce a new version of my RL tutorial. Major update to the LLM chapter (eg DPO, GRPO, thinking), minor updates to the MARL and MBRL chapters and various sections (eg offline RL, DPG, etc). Enjoy! arxiv.org/abs/2412.05265

English

435

2.4K

122.1K

Gullal S. Cheema retweetledi

Yangyi Chen@YangyiChen6666·17 May

🐂🍺Introducing our recent preprint: Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training! We present PRIOR, a simple vision-language pre-training algorithm that addresses the challenge of irrelevant textual content in image-caption pairs. PRIOR enhances pre-training by implementing differential weighting in the next token prediction (NTP) loss function, effectively prioritizing image-related tokens during training.

English

147

16.5K

Gullal S. Cheema retweetledi

Rohan Paul@rohanpaul_ai·17 May

Adapting pretrained LLMs for vision tasks often degrades their language abilities or requires full retraining. X-Fusion introduces a dual-tower design. It freezes LLM weights, adding a trainable vision tower. This enables multimodal tasks while preserving original language skills. Methods Explored in this Paper 🔧: → X-Fusion uses separate, trainable vision weights in each layer alongside frozen language layers. → It processes vision tokens conditioned on text for generation. It extracts visual features for understanding. → Outputs from text and vision blocks combine selectively. Vision blocks can initialize from language layers. → Training uses autoregressive language loss (weight 0.2) and image diffusion loss (weight 1.0). 📌 Frozen LLM core preserves strong language skills, vital for coherent multimodal reasoning. 📌 Decoupled vision tower allows flexible, modality-specific architectural changes without LLM alteration. 📌 Clean image-to-text data strategy directly boosts unified model performance across tasks. ---------------------------- Paper - arxiv. org/abs/2504.20996v1 Paper Title: "X-Fusion: Introducing New Modality to Frozen LLMs"

English

142

8.1K

Keşfet

@BafnaNiyati @_sungmin_cha @huggingface @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates