eddy

479 posts

eddy

@eddy_data3

Working on reasoning and synthetic data

Katılım Mayıs 2022

5K Takip Edilen438 Takipçiler

Sabitlenmiş Tweet

eddy@eddy_data3·6 Şub

Such a rewarding experience (pun intended) collaborating with @tongyx361 @xiangyue96 @sirius_ctrl @gneubig! We hope our results are useful to the community 🙏

Xiang Yue@xiangyue96

Demystifying Long CoT Reasoning in LLMs arxiv.org/pdf/2502.03373 Reasoning models like R1 / O1 / O3 have gained massive attention, but their training dynamics remain a mystery. We're taking a first deep dive into understanding long CoT reasoning in LLMs! 11 Major Takeaways👇(long threads)

English

2.4K

eddy retweetledi

Andrej Karpathy@karpathy·27 Haz

The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing: - Natively multimodal text/vision/audio at both input and output. - Matryoshka-style architecture allowing a dial of capability up and down at test time. - Reasoning, also with a dial. (system 2) - Aggressively tool-using. - On-device finetuning LoRA slots for test-time training, personalization and customization. - Delegates and double checks just the right parts with the oracles in the cloud if internet is available. It doesn't know that William the Conqueror's reign ended in September 9 1087, but it vaguely recognizes the name and can look up the date. It can't recite the SHA-256 of empty string as e3b0c442..., but it can calculate it quickly should you really want it. What LLM personal computing lacks in broad world knowledge and top tier problem-solving capability it will make up in super low interaction latency (especially as multimodal matures), direct / private access to data and state, offline continuity, sovereignty ("not your weights not your brain"). i.e. many of the same reasons we like, use and buy personal computers instead of having thin clients access a cloud via remote desktop or so.

Omar Sanseviero@osanseviero

I’m so excited to announce Gemma 3n is here! 🎉 🔊Multimodal (text/audio/image/video) understanding 🤯Runs with as little as 2GB of RAM 🏆First model under 10B with @lmarena_ai score of 1300+ Available now on @huggingface, @kaggle, llama.cpp, ai.dev, and more

English

391

1.3K

10.7K

1.3M

eddy@eddy_data3·8 Şub

@LiJunnan0409 @tongyx361 @xiangyue96 @sirius_ctrl @gneubig Thanks Junnan!

English

Li Junnan@LiJunnan0409·8 Şub

@eddy_data3 @tongyx361 @xiangyue96 @sirius_ctrl @gneubig great work!

English

105

eddy@eddy_data3·6 Şub

Such a rewarding experience (pun intended) collaborating with @tongyx361 @xiangyue96 @sirius_ctrl @gneubig! We hope our results are useful to the community 🙏

Xiang Yue@xiangyue96

English

2.4K

eddy@eddy_data3·7 Şub

@itsyuhao @tongyx361 @xiangyue96 @sirius_ctrl @gneubig Thanks Yuhao!

English

Yuhao Yang@itsyuhao·7 Şub

@eddy_data3 @tongyx361 @xiangyue96 @sirius_ctrl @gneubig Congrats, 🎉🎊 Eddy!

English

eddy@eddy_data3·7 Şub

@samcwl Hi Sam! Thanks for your comments. Yes we definitely could have added an ablation for filtered webinstruct data in table 2. At that time, we were more focused on assessing the impact of noise in the SFT data, so we also preserved the noise in its correctness.

English

Sam Ching@samcwl·7 Şub

Maybe I read it wrong, but curious why the setup differed between SFT + RL (Table 2) and RL only (Table 3). Would be interested in the impact of filtered data on SFT + RL / SFT only setups (Table 2).

English

110

Sam Ching@samcwl·7 Şub

One qn: > 5.1: In contrast, for data from WebInstruct without fully reliable supervision signals but with a much larger scale, we sample one response per prompt from the teacher model without filtration. -> why not use filtered subset and do rejection sampling for SFT?

English

128

eddy@eddy_data3·6 Şub

@YouJiacheng @xiangyue96 Hi Jiacheng, I've attached the curves below. The reward going down is correlated with the exceed rate going up -- so when more generations exceed the context length, the model receives less rewards. What is your expectation for rewards when the exceed rate goes up?

English

You Jiacheng@YouJiacheng·6 Şub

@xiangyue96 can you also share the average reward and averge KL curves? this is very sus. if the reward goes down, then there should be some bugs in RL. if the reward goes up, it might got hacked.

English

130

You Jiacheng@YouJiacheng·6 Şub

hmmmm why? if the model exceeds the ctx window it can't get the reward, why it will learn this behavior?

Xiang Yue@xiangyue96

Takeaway 3: CoT length does not always scale up in a stable fashion. We observed that models increased their CoT length during RL training, eventually reaching the context window limit. This led to a decline in training accuracy due to CoTs exceeding the allowable window size.

English

1.6K

eddy retweetledi

Jiayi Pan@jiayi_pirate·24 Oca

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: github.com/Jiayi-Pan/Tiny… Here's what we learned 🧵

English

193

1.2K

6.3K

1.7M

eddy retweetledi

Li Junnan@LiJunnan0409·15 Ara

Introducing 🔥Aria-Chat🔥, our latest multimodal chat model optimized for open-ended and multi-round dialogs! It outperforms Aria by 7 points on WildVision-Bench, offering enhanced reliability and stronger multilingual support. Download the model now: huggingface.co/rhymes-ai/Aria…

English

2.6K

eddy retweetledi

Xiang Yue@xiangyue96·7 Ara

🚨 Our latest work evaluates the synthetic data generation abilities of different LLMs. Key findings: - Strong task-solvers ≠ strong synthetic data generators. - Bigger isn’t always better! E.g., Llama3.1-8B can outperform a 405B model in certain settings. - More synthetic data from weaker models > Less data from stronger models. - Intrinsic evaluation metrics + LLM-as-Judge could predict the data generation ability of LLMs. 🔗arxiv.org/pdf/2412.03679 Check out more details in @seungonekim 's thread👇

Seungone Kim@seungonekim

#NLProc Just because GPT-4o is 17 times more expensive than GPT-4o-mini, does that mean it generates synthetic data 17 times better? Introducing the AgoraBench, a benchmark for evaluating data generation capabilities of LMs.

English

9.5K

eddy retweetledi

World Labs@theworldlabs·2 Ara

We’ve been busy building an AI system to generate 3D worlds from a single image. Check out some early results on our site, where you can interact with our scenes directly in the browser! worldlabs.ai/blog 1/n

English

197

677

850.9K

eddy retweetledi

Rhymes.AI@rhymes_ai_·22 Eki

New Model From Rhymes! ✨ We're thrilled to announce, Allegro — a small and efficient open-source text-to-video model that transforms your text into stunning 6-second videos at 15 FPS and 720p! 🚀 🩷 Explore Allegro - Gallery: rhymes.ai/allegro_gallery - @huggingface: huggingface.co/rhymes-ai/Alle… - @github: github.com/rhymes-ai/Alle… - Blog: rhymes.ai/blog-details/a… - Report: arxiv.org/abs/2410.15458 - License: Apache 2.0 📽️ Try Allegro Want to be the first to try Allegro directly on Discord? Join the waitlist and get ready to explore! forms.gle/JhA7BaKvZoeJYQ… We can't wait to see the amazing creations you'll make with Allegro! #AIVideo #TextToVideo #Allegro

English

114

22.2K

eddy retweetledi

Rhymes.AI@rhymes_ai_·10 Eki

🚀 Introducing Aria from @rhymes_ai_ : The first open-source, multimodal native MoE model! Aria uses 3.9B parameters per token, excelling in multimodal & language tasks. It features a 64K token context window, captioning 256-frame videos in 10 seconds. Lightweight, fast, & efficient. Check out some cool demos! Clear advantages over Pixtral-12B and Llama3.2-11B across a range of multimodal, language, and coding tasks. Also competitive against proprietary models like GPT-4o and Gemini-1.5 on some multimodal tasks 💪 Now open on @huggingface & @github with open weights & code 🤗 (Apache 2.0)! Read more: rhymes.ai/blog-details/a…

English

132

12.3K

eddy retweetledi

doomslide@doomslide·30 Eyl

prooftech is fundamentally anarchist

doomslide@doomslide

llms dont generalize ood => llm agents will always need babysitting => automated babysitting requires verification (formal/heuristic) => cheap proofgen + codegen catapults modularity of software => gigantic closed centralized systems fall off => incumbents are worried about it

English

8.3K

eddy retweetledi

roon@tszzl·8 Eyl

the cognitive profile of humans is not generally distributed across all cognitive skills. for example we’re unreasonably good at modeling our friends theory of mind and super bad at learning ten languages or difficult math (except on the margins)

English

843

74.1K

eddy retweetledi

Xiang Yue@xiangyue96·5 Eyl

🚀 Introducing MMMU-Pro: A more robust version of MMMU arxiv.org/pdf/2409.02813… After launching MMMU, we received valuable feedback from the community: 1️⃣ Some questions were answerable without even seeing the images. 2️⃣ Models didn’t always "know" the answer but found shortcuts from the options provided. 3️⃣ Performance was heavily tied to LLMs, with minimal impact from the vision module. To tackle these issues, we implemented the following improvements: 🔍 1. Filtering Text-Only Answerable Questions 🔄 2. Augmenting Candidate Options up to 10 by Human Experts. 🖼️ 3. Vision-Only Input Setting: where questions are embedded directly in images, requiring the model to rely purely on visual input. ✨ Why We Added Vision-Only Input Setting? 1. From a foundational perspective, this setting forces AI to genuinely "see" and "read" at the same time—challenging a core human cognitive skill: the seamless integration of visual and textual information. 2. From an application standpoint, this approach mirrors how users naturally interact with AI systems—by sharing screenshots or photos, without meticulously separating text from images. 📊 Key Results: Performance on MMMU-Pro is notably lower compared to MMMU, ranging from 16.8% to 26.9% across various models. The ranking of models is generally similar to the original but we also observe less robust ones— for example, GPT-4o mini proved less robust than GPT-4o and other proprietary models, showing significant drops in performance on the augmented set. 🔬 More in-depth analysis can be found in the threads below! 👇

English

143

25.1K

eddy retweetledi

Nous Research@NousResearch·26 Ağu

What if you could use all the computing power in the world to train a shared, open source AI model? Preliminary report: github.com/NousResearch/D… Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by 1000x to 10,000x without relying on amortized analysis, and matches AdamW+All-Reduce in convergence rates. This enables low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware. DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs. Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models. This increased competition fosters innovation, drives progress, and ultimately benefits society as a whole. This research is thanks to the hard work of @bloc97_ @theemozilla @apyh__ @UmerHAdil. We invite researchers interested in exploring this area to join us in our quest.

English

222

568

3.3K

1.1M

eddy retweetledi

Thomas Wolf@Thom_Wolf·18 Ağu

It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in synthetic data and small language models. This post is prompted by the coming release of an instant, in-browser model called SmolLM360 (link at the end) The journey started as Loubna and Anton, two of the leads of the BigCode and StarCoder projects were looking for a new topic to explore. Around that time Microsoft had released Phi1, a small model (1.7B), trained for half on synthetic data, which showed very impressive code capabilities and which was followed by Phi1.5 extending this approach to natural language. Benchmark number were really impressive but with the training dataset being kept private, people claimed that Phi1 was gaming the benchmark and maybe trained on very similar examples. Really intrigued by the performances and the secret here, Loubna and Anton decided to explore creating a large synthetic dataset. This led to the release of Cosmopedia 1 in winter 2024, 25B of synthetically created data by the best model at the time, Mixtral-8-7B and an associated model. Performance were fine but still falling somehow short of what Phi-1 and Phi-1.5 were showing so they decided to go deeper. A first breakthrough came in the spring when they dived in the audience of the synthetic prompt they were using. Let me explain a bit. In synthetic data generation, you ask a language model to general educative content on a topic of your choice. But educational content can span a very large spectrum of language and complexity, from content for toddlers up to graduate content for PhDs. Just like there is no need to teach PhD-level concept to toddlers, many of the prompts they were first using were creating educational content that was far too complex for the small model they were training and focusing the target audience for the model on middle-school helped tremendously on many benchmarks (apart from MMLU which is typically PhD level as well). A second and much more bittersweet discovery came later in the month as we were staying in a hotel in Lausanne in Switzerland, doing one of these remote « get-together-to-work-in-a-nice-place » we often have at Hugging Face. As a side project, to help Guilherme working on the release of a large dataset that would become FineWeb, they explored using similar prompt engineering techniques as they’d been using, but this time to filter a large chunk of the web, asking a LLM to rate the educational content of webpage instead of writing it from scratch. Using this heavily filtered web data made performances directly going up and passing all the other models of similar size, including Phi-1.5, on most benchmarks. This was bitter sweet in that, while the performances were higher than we’d never seen, they had also spend so much time crafting synthetic data prompt to discover that heavily filtering the web was still better and much more diverse with more than 1.3 trillion tokens available even when filtering heavily (in comparison to the difficulty to scale the size and diversity of synthetic data). Extending the same approach to code data, heavily filtering The Stack, the largest code dataset in the world, using prompt and language models also proved amazingly powerful, pushing the performances of a model which was stuck around 13% on HumanEval (a python code benchmark) up to above 20% out of the box. Boom! Are synthetic data still useful? Yes, but the web is so big and diverse that synthetic data really make more sense for some domain specific part where the right data is lacking, say reasoning or math. Now, right as they were excited by these new discoveries and results, they were joined by a new intern Elie, who proved a great specialist of various trainings techniques and they decided to push the experiments to the limits in term of model size, going from 1.7B down to 360M and even 170M, aka the sizes of the old GPT1, BERT and GPT2, to see how small a model could be while still keeping good performances. One of the recipes for these good performances proved to be simply training for longer and longer, ignoring the usual wisdom that dictated you should avoid training smaller models for too long. Right now even these very small model end up being trained on multiple trillions of tokens, just like they larger counterparts. Another element of the recipes they discovered was to anneal the data, which means keeping a special set of high quality training data for the last part of the training. This lead to training last week a 360M models (this is more than 1000 times smaller than current frontier models like Llama-3.1-405B) which was showing amazing performances on the benchmarks, beating all <500M models and even some larger. So what’s next? Alignment and benchmarks Given their small size these models are still struggling to answer very complex or graduate-level math/code questions. That’s perfectly fine because you don’t really need a model to be able to solve the math olympics in your daily life. But one problem is that our evals usually contains a mix of complex and simpler questions leading to noise in how we evaluate these simple models. Another problem is alignment, how to fine-tune these models to follow instructions. We’ve ben developing datasets and techniques which work really well for larger models (SFT, DPO, PPO, etc) but if you try the « Instant Smol » demo you’ll see that the aligned smog models are still lacking on this aspect and this comes likely from the alignment datasets for LLM which contained many concept too complex for small models (math, reasoning, etc) and lack simpler tasks that they are well designed for (grammar correction, translation, etc) So what’s next for SmolLM? It’s going to be a really exciting year for them. A 360M parameters models is basically 360 MB size which is tiny in today’s web sizes (much smaller than many videos), it’s also basically instantaneous responses (>50-70 tok/s in browser) as it runs locally so with the knowledge around these models being progressively uncovered, I can see them being used everywhere more and more locally, with private data which don’t leave your computer, with instant response, with small size and more energy efficiency versus larger models. An exciting year for Smol LMs 🚀

English

111

507

108.7K

eddy retweetledi

Andrej Karpathy@karpathy·18 Tem

LLM model size competition is intensifying… backwards! My bet is that we'll see models that "think" very well and reliably that are very very small. There is most likely a setting even of GPT-2 parameters for which most people will consider GPT-2 "smart". The reason current models are so large is because we're still being very wasteful during training - we're asking them to memorize the internet and, remarkably, they do and can e.g. recite SHA hashes of common numbers, or recall really esoteric facts. (Actually LLMs are really good at memorization, qualitatively a lot better than humans, sometimes needing just a single update to remember a lot of detail for a long time). But imagine if you were going to be tested, closed book, on reciting arbitrary passages of the internet given the first few words. This is the standard (pre)training objective for models today. The reason doing better is hard is because demonstrations of thinking are "entangled" with knowledge, in the training data. Therefore, the models have to first get larger before they can get smaller, because we need their (automated) help to refactor and mold the training data into ideal, synthetic formats. It's a staircase of improvement - of one model helping to generate the training data for next, until we're left with "perfect training set". When you train GPT-2 on it, it will be a really strong / smart model by today's standards. Maybe the MMLU will be a bit lower because it won't remember all of its chemistry perfectly. Maybe it needs to look something up once in a while to make sure.

Artificial Analysis@ArtificialAnlys

GPT-4o Mini, announced today, is very impressive for how cheap it is being offered 👀 With a MMLU score of 82% (reported by TechCrunch), it surpasses the quality of other smaller models including Gemini 1.5 Flash (79%) and Claude 3 Haiku (75%). What is particularly exciting is that it is also to be offered at a cheaper price than these models. The reported price is $0.15/1M input tokens and $0.6/1M output tokens. With such a cheap price for input tokens and its large 128k context window, it will be very compelling for long context use-cases (including large document RAG). @OpenAI have clearly made a very high quality model relative to its size (pricing can indicate size due to the direct relationship to compute cost). The model seems a worthy successor to GPT3.5 Turbo as OpenAI's smallest model and the model used for ChatGPT's free version.

English

190

920

7.5K

1.4M

eddy retweetledi

Mira@_Mira___Mira_·8 Tem

> Obstinacy is a reflexive resistance to changing one's ideas. If you have a model M of the world which predicts X with confidence X', and you observe ~X, then: M -> X = ~M or X = ~M The most obstinate may reject the evidence for ~X. It's usually noisy. Maybe you made a mistake. Maybe doing it better would work. Maybe it was unlucky. Otherwise you have to change your model(~M). But if you unroll your justification into a sequence of primitive steps, there's the decision of where to apply the update. The very obstinate would do "the smallest change to M". Like an if statement checking for exactly the situation. As you decrease obstinancy, you might abstract over a property of the object. Decrease further and you might change a minor concept. The least possible obstinate would be "The original M is the least complex model for the original evidence. Given this new evidence, I will throw everything away and recalculate a new least complex model from scratch." So it seems obstinancy might be "the size of updates to your model you're willing to consider". Rejecting the evidence means a change of size 0. > The optimal amount of obstinacy is not zero. I think with unbounded compute it would actually be zero. But given bounded compute, obstinancy is a heuristic for scoping a necessary increase in complexity. Your model has made many predictions in the past. To reevaluate it, you'd need to "backtest" all those predictions against any change. Shallow updates are easier to do because the scope of predictions is smaller. He also says obstinancy can look like "stupidity", which matches a complexity definition(Kolmogorov complexity is a classic definition of intelligence). So I wonder if it can be reduced to "description sizes", "memory of past predictions", "working memory", "serial length of inferences", "total inference steps".

Paul Graham@paulg

The Right Kind of Stubborn: paulgraham.com/persistence.ht…

English

1.8K

eddy retweetledi

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·10 Tem

@_xjdr Thinking about emergence, feels that we can semi-rigorously say which tasks are hard-locked by serial operation depth or by width/fundamental richness limits on LLM representations, and can only be approached asymptotically with compute in small models; and which are… just hard.

English

7.4K

Keşfet

@LiJunnan0409 @tongyx361 @xiangyue96 @sirius_ctrl @gneubig @itsyuhao @samcwl @YouJiacheng