Gullal S. Cheema

700 posts

Gullal S. Cheema

Gullal S. Cheema

@Gullal7

Research Assistant @l3s_luh Previously Marie Sklodowska Curie ESR (PhD) @TIBHannover, Germany MUWS Workshop: https://t.co/XrHoXlMWsx Views are personal.

Hanover, Lower Saxony Katılım Eylül 2018
166 Takip Edilen69 Takipçiler
Gullal S. Cheema retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
Some papers i was reading in figuring this out: Apr 2025 — R2E-Gym (AgentGym) arxiv.org/abs/2504.07164 Apr 2025 — SWE-smith arxiv.org/abs/2504.21798 May 2025 — RandomWorld arxiv.org/abs/2506.11045 May 2025 — Reasoning Gym arxiv.org/abs/2505.24760 Jun 2025 — random-crypto arxiv.org/abs/2506.02048 Jan 2026 — Endless Terminals arxiv.org/abs/2601.16443 Feb 2026 — Agent World Model (AWM) arxiv.org/abs/2602.10090
English
4
33
263
13.1K
Gullal S. Cheema retweetledi
Thomas Wolf
Thomas Wolf@Thom_Wolf·
3 trillions tokens finely distilled from more than a petabyte of PDF files We’ve just released FinePDF, the latest addition to the Fineweb datasets
Thomas Wolf tweet media
English
19
80
604
123.7K
Gullal S. Cheema retweetledi
Thomas Wolf
Thomas Wolf@Thom_Wolf·
This is huge Continuing our foundational work to enable anyone to train state of the art AI model, we’re thrilled to release « FinePDFs » 3T tokens of textual data that until now was locked away in PDFs, arguably some of the highest quality publicly available data out there. We gathered FinePDF to create the largest permissively licensed corpus sourced exclusively from PDFs. Amazingly challenging infra and processing work, h/t to the fineweb team
Hynek Kydlíček@HKydlicek

We are releasing 📄 FinePDFs: the largest PDF dataset spanning over half a billion documents! - Long context: Documents are 2x longer than web text - 3T tokens from high-demand domains like legal and science. - Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.

English
18
44
488
70.7K
Gullal S. Cheema retweetledi
Andi Marafioti
Andi Marafioti@andimarafioti·
Fuck it. Today, we open source FineVision: the finest curation of datasets for VLMs, over 200 sources! > 20% improvement across 10 benchmarks > 17M unique images > 10B answer tokens > New capabilities: GUI navigation, pointing, counting FineVision 10x’s open-source VLMs.
Andi Marafioti tweet media
English
23
114
942
131K
Gullal S. Cheema retweetledi
Femke Plantinga
Femke Plantinga@femke_plantinga·
Stop optimizing your retrieval. Fix your chunking first. It's not your embedding model, prompt engineering, or vector database. It's your chunking strategy creating invisible walls between your users and the information they need. 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 is the pre-processing step of splitting texts into smaller pieces of text, or "chunks". Each chunk becomes the unit of information that gets vectorized and stored in your vector database. Here are 6 essential chunking techniques you need to know: → 𝗙𝗶𝘅𝗲𝗱-𝗦𝗶𝘇𝗲 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗥𝗲𝗰𝘂𝗿𝘀𝗶𝘃𝗲 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁-𝗕𝗮𝘀𝗲𝗱 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗟𝗟𝗠-𝗕𝗮𝘀𝗲𝗱 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/learn/knowledg… → 𝗟𝗮𝘁𝗲 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 weaviate.io/blog/late-chun… 💡 𝗣𝗿𝗼 𝘁𝗶𝗽s: • There's no one-size-fits-all chunking strategy. • Your choice affects both information retrieval and the amount of contextual information provided to your RAG system. • Start simple with fixed-size chunking, then experiment based on your specific use case. → Are you dealing with technical documentation? Try document-based. → Working with conversational data? Semantic chunking might be your best bet. A brief introduction to chunking: weaviate.io/developers/aca…
Femke Plantinga tweet media
English
12
182
999
64.4K
Gullal S. Cheema retweetledi
Jean de Dieu Nyandwi
Jean de Dieu Nyandwi@Jeande_d·
Reinforcement Learning of Large Language Models, Spring 2025(UCLA) Great set of new lectures on reinforcement learning of LLMs. Covers a wide range of topics related to RLxLLMs such as basics/foundations, test-time compute, RLHF, and RL with verifiable rewards(RLVR).
Jean de Dieu Nyandwi tweet mediaJean de Dieu Nyandwi tweet media
English
6
227
1.3K
76K
Gullal S. Cheema retweetledi
Daniel Khashabi 🕊️
Daniel Khashabi 🕊️@DanielKhashabi·
What’s really going on inside LLMs when they handle non-English queries? @BafnaNiyati's recent work introduces the **translation barrier hypothesis**, a framework for understanding multilingual model behavior. This hypothesis says that : (1) Multilingual generation, internally, follows a "task-solving"→"translation" cascade. (2) Translation failure *despite task-solving success* is a large part of the overall failures. That is, the model often solves the task but fails to articulate the answer. Highlighting a key result in the figure: when we inspect intermediate layers, we see that models often solve the task in the wrong (off-target_ language; that is, high off-target accuracy early on. Only in the later layers does the answer get translated into the intended language. Paper: huggingface.co/papers/2506.22…
Daniel Khashabi 🕊️ tweet media
Niyati Bafna@BafnaNiyati

📢When LLMs solve tasks with a mid-to-low resource input/target language, their output quality is poor. We know that. But can we pin down what breaks inside the LLM? We introduce the 💥translation barrier hypothesis💥 for failed multilingual generation. arxiv.org/abs/2506.22724

English
0
6
12
2K
Gullal S. Cheema retweetledi
himanshu
himanshu@himanshustwts·
went through this but don't just only skim over it ig. every question is a good research paper and worth a read.
himanshu tweet media
English
14
141
2.4K
226.1K
Gullal S. Cheema retweetledi
Unsloth AI
Unsloth AI@UnslothAI·
We made a Guide on mastering LoRA Hyperparameters, so you can learn to fine-tune LLMs correctly! Learn to: • Train smarter models with fewer hallucinations • Choose optimal: learning rates, epochs, LoRA rank, alpha • Avoid overfitting & underfitting 🔗docs.unsloth.ai/get-started/fi…
Unsloth AI tweet media
English
13
128
677
25.1K
Gullal S. Cheema retweetledi
Gullal S. Cheema retweetledi
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets Presents an autoregressive U-Net that processes raw bytes and learns hierarchical token representation Matches strong BPE baselines, with deeper hierarchies demonstrating promising scaling trends
Aran Komatsuzaki tweet media
English
3
54
355
59.7K
Gullal S. Cheema retweetledi
Sagnik
Sagnik@saagnikkk·
🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models” From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮 And this isn’t a one-off. The pattern holds across RL algorithms and models. 🧵A Deep Dive
Sagnik tweet media
English
17
132
907
192.5K
Gullal S. Cheema retweetledi
Kenneth Stanley
Kenneth Stanley@kenneth0stanley·
Could a major opportunity to improve representation in deep learning be hiding in plain sight? Check out our new position paper: Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis. The idea stems from a little-known observation about networks trained to output a single image: when they are discovered through an unconventional open-ended search process, their representations are incredibly elegant and exhibit astonishing modular decomposition. In contrast, when SGD (successfully) learns to output the same image its underlying representation is fractured, entangled - an absolute mess! This stark difference in the underlying representation of the same "good" output behavior carries deep lessons for deep learning. It shows you cannot judge a book by its cover - an LLM with all the right responses could similarly be a mess under the hood. But also, surprisingly, it shows us that it doesn't have to be this way! Without the unique examples in this paper that were discovered through open-ended search, we might assume neural representation has to be a mess. These results show that is clearly untrue. We can now imagine something better because we can actually see it is possible. We give several reasons why this matters: generalization, creativity, and learning are all potentially impacted. The paper shows examples to back up these concerns, but in brief, there is a key insight: Representation is not only important for what you're able to do now, but for where you can go from there. The ability to imagine something new (and where your next step in weight space can bring you) depends entirely upon how you represent the world. Generalization, creativity, and learning itself depend upon this critical relationship. Notice the difference in appearance between the nearby images to the skull in weight space shown in the top-left and top-right image strips of the attached graphic. The difference in semantics is stark. The insight that representation could be better opens up a lot of new paths and opportunities for investigation. It raises new urgency to understand the representation underlying foundation models and LLMs while exposing all kinds of novel avenues for potentially improving them, from making learning processes more open-ended to manipulating architectures and algorithms. Don't mistake this paper as providing comfort for AI pessimists. By exposing a novel set of stark and explicit differences between conventional learning and something different, it can act as an accelerator of progress as opposed to a tool of pessimism. At the least, the discussion it provokes should be quite illuminating.
Kenneth Stanley tweet media
English
50
159
990
163.7K
Gullal S. Cheema retweetledi
Kyunghyun Cho
Kyunghyun Cho@kchonyc·
it's been more than a decade since KD was proposed, and i've been using it all along .. but why does it work? too many speculations but no simple explanation. @_sungmin_cha and i decided to see if we can come up with the simplest working description of KD in this work. we ended up with a very simple explanation for any mixture distribution, starting from a mixture of gaussians. the key hypothesis was that using lower-entropic, approximate sampling from a teacher, results in a higher-precision but lower-recall student. since an autogressive LM is nothing but an infinite cascade of mixture distribution, we confirm this with SmolLM (thanks, @huggingface!) this is probably not the complete picture of KD, but i can definitely sleep better after writing down and confirming this minimal working explanation. as an extra take-away, this implies that our eval tends to be overly precision focused. we should really think of what we lose in terms of recalls, as this directly relates to what we miss out for whom when we build these large-scale, general-purpose models.
Kyunghyun Cho tweet mediaKyunghyun Cho tweet mediaKyunghyun Cho tweet mediaKyunghyun Cho tweet media
English
8
48
380
48.3K
Gullal S. Cheema retweetledi
DailyPapers
DailyPapers@HuggingPapers·
J1 just launched on Hugging Face A Reinforcement Learning recipe for training Thinking-LLM-as-a-Judge models. It trains J1-Llama-8B and J1-Llama-70B that outperform existing models.
DailyPapers tweet media
English
3
14
58
10.9K
Gullal S. Cheema retweetledi
Kevin Patrick Murphy
Kevin Patrick Murphy@sirbayes·
I am pleased to announce a new version of my RL tutorial. Major update to the LLM chapter (eg DPO, GRPO, thinking), minor updates to the MARL and MBRL chapters and various sections (eg offline RL, DPG, etc). Enjoy! arxiv.org/abs/2412.05265
Kevin Patrick Murphy tweet media
English
22
435
2.4K
122.1K
Gullal S. Cheema retweetledi
Yangyi Chen
Yangyi Chen@YangyiChen6666·
🐂🍺Introducing our recent preprint: Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training! We present PRIOR, a simple vision-language pre-training algorithm that addresses the challenge of irrelevant textual content in image-caption pairs. PRIOR enhances pre-training by implementing differential weighting in the next token prediction (NTP) loss function, effectively prioritizing image-related tokens during training.
Yangyi Chen tweet media
English
3
28
147
16.5K
Gullal S. Cheema retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Adapting pretrained LLMs for vision tasks often degrades their language abilities or requires full retraining. X-Fusion introduces a dual-tower design. It freezes LLM weights, adding a trainable vision tower. This enables multimodal tasks while preserving original language skills. Methods Explored in this Paper 🔧: → X-Fusion uses separate, trainable vision weights in each layer alongside frozen language layers. → It processes vision tokens conditioned on text for generation. It extracts visual features for understanding. → Outputs from text and vision blocks combine selectively. Vision blocks can initialize from language layers. → Training uses autoregressive language loss (weight 0.2) and image diffusion loss (weight 1.0). 📌 Frozen LLM core preserves strong language skills, vital for coherent multimodal reasoning. 📌 Decoupled vision tower allows flexible, modality-specific architectural changes without LLM alteration. 📌 Clean image-to-text data strategy directly boosts unified model performance across tasks. ---------------------------- Paper - arxiv. org/abs/2504.20996v1 Paper Title: "X-Fusion: Introducing New Modality to Frozen LLMs"
Rohan Paul tweet media
English
7
33
142
8.1K