Benjamin Bergner

60 posts

Benjamin Bergner

Benjamin Bergner

@bergbenj

PhD student at HPI

Berlin Katılım Temmuz 2014
107 Takip Edilen166 Takipçiler
Swaroop Mishra
Swaroop Mishra@Swarooprm7·
We should respect Deepseek for the great results, but don't call it a breakthrough. I also like the low-key nature of the Deepseek squad. They have produced good reasoning models in the past, too. So it's definitely not a one-time hit. Having said this, the tech-report era doesn't mention detailed ablations that may need careful attention from the community. For example, if GRPO is really needed?
English
11
5
67
15.9K
Benjamin Bergner
Benjamin Bergner@bergbenj·
For verifiable rewards, how could this be scaled beyond easily verifiable math and coding problems to arbitrary tasks? Or could it be that a few math/coding problems are sufficient to learn general reasoning across tasks?
English
0
0
0
39
Benjamin Bergner
Benjamin Bergner@bergbenj·
Is it the quality of the base model? Is it the training process (RL vs. SFT)? Is it PPO vs. GRPO for RL? Is it verifiable rewards vs. using a reward model?
English
1
0
0
59
Benjamin Bergner
Benjamin Bergner@bergbenj·
What are the main reasons why DeepSeek-R1, even the Zero version, works so well?
English
1
0
0
193
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
I wrote a blogpost "On the speed of ViTs and CNNs". Addresses the following concerns I often hear: - worry about ViTs speed at high resolution. - how high resolution do I need? - is it super important to keep the aspect ratio? I think @ylecun might like it too! Link below
Lucas Beyer (bl16) tweet media
English
24
91
702
101.1K
dr. jack morris
dr. jack morris@jxmnop·
have been thinking about this a lot, and I think the most impactful topic to work on right now in AI is 𝗔𝗴𝗲𝗻𝘁𝘀 we can’t just fake agency using prompting, though. we need to build it into the next generation of language models: agent-native LLMs here’s my logic: 1. it seems obvious that language models can’t ever do everything in one forward pass, simply because all the information required will never be initially available. imagine trying to write a history textbook, or implement a new feature in an app, without consulting any APIs or looking anything up, storing intermediate work, etc. real life problems require systems that don’t just immediately output an answer 2. current language models are only trained to do next token prediction, with a trivial RLHF layer on top. we can make much better agents by training them on more generic tasks from the get-go 3. this goal requires new datasets, task definitions, and training techniques; all of which we don’t have yet (and I’d guess will take *years* of research)
English
27
20
317
53.4K
Benjamin Bergner
Benjamin Bergner@bergbenj·
@scottgeng00 Is the answer also NO if you train on much more synthetic data than you could retrieve from the original dataset?
English
0
0
0
69
Scott Geng
Scott Geng@scottgeng00·
Will training on AI-generated synthetic data lead to the next frontier of vision models?🤔 Our new paper suggests NO—for now. Synthetic data doesn't magically enable generalization beyond the generator's original training set. 📜: arxiv.org/abs/2406.05184 Details below🧵(1/n)
English
18
91
471
111.4K
Benjamin Bergner
Benjamin Bergner@bergbenj·
@ylecun Why not window attention in the first layers, followed by global attention?
English
0
0
0
98
Yann LeCun
Yann LeCun@ylecun·
A short post on the best architectures for real-time image and video processing. TL;DR: use convolutions with stride or pooling at the low levels, and stick self-attention circuits at higher levels, where feature vectors represent objects. PS: ready to bet that Tesla FSD uses convolutions (or perhaps more complex *local* operators) at the low levels, combined with more global circuits at higher levels (perhaps using self-attention). Transformers on low-level patch embeddings are a complete waste of electrons.
Yann LeCun@ylecun

I'm not saying ViTs are not practical (we use them). I'm saying they are way too slow and inefficient to be practical for real-time processing of high-resolution images and video. [Also, @sainingxie's work on ConvNext has shown that they are just as good as ViTs if you do it right. But whatever]. You need at least a few Conv layers with pooling and stride before you stick self-attention circuits. Self-attention is equivariant to permutations, which is completely nonsensical for low-level image/video processing (having a single strided conv at the front-end to 'patchify' also doesn't make sense). Global attention is also nonsensical (and not scalable), since correlations are highly local in images and video. At high level, once features represent objects, then it makes sense to use self-attention circuits: what matters is the relationships and interactions between objects, not their positions. This type of hybrid architecture was inaugurated by the DETR system by @alcinos26 and collaborators. as I've said since the DETR work, my favorite family of architectures is conv/stride/pooling at the lower levels, and self-attention circuits at the higher levels.

English
61
110
1.4K
748.4K
Benjamin Bergner
Benjamin Bergner@bergbenj·
@mkwng cool idea. just wanted to try on your linked website but got an error when inserting a random location+distance: "An error occurred while generating the trail. Please try again."
English
1
0
3
502
michael
michael@mkwng·
A friend was asking our group chat for any apps that can take a starting location and generate random running trails. No one had a good answer. So, I fired up claude.ai, google colab, and repl.it and screen recorded myself whipping together a UI to do exactly that < 30 minutes. It's live here: trail-generator.replit.app
English
36
77
946
315.1K
dr. jack morris
dr. jack morris@jxmnop·
is anyone still using encoder-decoder models (T5, BART, etc.)? if so -- for what?
English
87
15
390
198.7K
Benjamin Bergner retweetledi
Andrii Skliar 🇺🇦
Andrii Skliar 🇺🇦@avskliar·
🚀 Excited to share our latest work "Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding" now on arXiv! We're taking strides in making language models faster & more efficient on text generation tasks like translation & summarization.🔍 [arxiv.org/abs/2402.16844]
English
2
14
52
10.9K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
🧶PaLI-3 achieves SOTA across many vision-language (and video!) tasks while being 10x smaller than its predecessor PaLI-X. At only 5B parameters, it's also smaller (and stronger) than the concurrent Fuyu-8B model, though sadly we cannot release the model (props to @AdeptAILabs)
Lucas Beyer (bl16) tweet media
English
8
59
430
122.8K
Benjamin Bergner
Benjamin Bergner@bergbenj·
Regarding point 8: Doesn't memory usage/training time depend on sequence length and training dataset? What have you used for your reported numbers?
Sebastian Raschka@rasbt

I ran hundreds if not thousands of LoRA & QLoRA experiments to finetune open-source LLMs, and here’s what I learned: 1. Despite the inherent randomness of LLM training (or when training models on GPUs in general), the outcomes remain remarkably consistent across multiple runs. 2. QLoRA presents a trade-off that might be worthwhile if you're constrained by GPU memory. It offers 33% memory savings at the cost of a 33% increase in runtime. 3. When finetuning LLMs, the choice of optimizer shouldn't be a major concern. While SGD on its own is suboptimal, there's minimal variation in outcomes whether you employ AdamW, SGD with a scheduler, or AdamW with a scheduler. 4. While Adam is often labeled a memory-intensive optimizer due to its introduction of two new parameters for every model parameter, this doesn't significantly affect the peak memory demands of the LLM. This is because the majority of the memory is allocated for large matrix multiplications rather than retaining extra parameters. 5. For static datasets, iterating multiple times as done in multi-epoch training might not be beneficial. It often deteriorates the results, probably due to overfitting. 6. If you're incorporating LoRA, ensure it's applied across all layers, not just to the Key and Value matrices, to maximize model performance. 7. Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank's value. 8. 7B models can be finetuned efficiently within a few hours on a single GPU possessing 14 Gb of RAM. With a static dataset, optimizing an LLM to excel across all benchmark tasks is unattainable. Addressing this requires diverse data sources, or perhaps LoRA might not be the ideal tool.

English
1
0
2
3.7K
Benjamin Bergner
Benjamin Bergner@bergbenj·
Visit the #ICLR poster session for a chat! Poster session 2, Mon 1 May 16:30 - 18:30 CEST MH1-2-3-4 #25 Kigali, Rwanda
English
0
0
0
115