Benjamin Bergner

60 posts

Benjamin Bergner

@bergbenj

PhD student at HPI

Berlin Katılım Temmuz 2014

107 Takip Edilen166 Takipçiler

Sabitlenmiş Tweet

Benjamin Bergner@bergbenj·27 Nis

Ever got out-of-memory errors while training neural nets?⚡️ Happy to present Iterative Patch Selection, a simple method to process arbitrarily large images with consumer-grade GPUs, at #ICLR2023 📜Paper: openreview.net/forum?id=QCrw0… 🐍Code: github.com/benbergner/ips Find out more 🧵

English

540

Benjamin Bergner@bergbenj·13 Haz

🚀 Heading to #CVPR2025? Check out our Token Cropr poster — a token pruning method that boosts inference throughput across quite a few vision tasks! 📍 Friday, 06/13 | 4–6 PM CDT 📌 ExHall D, Poster #416 🔗 cvpr.thecvf.com/virtual/2025/p… 👀 @CVPR #CVPR25

English

208

Benjamin Bergner@bergbenj·27 Oca

@Swarooprm7 Various open questions: x.com/bergbenj/statu…

Benjamin Bergner@bergbenj

What are the main reasons why DeepSeek-R1, even the Zero version, works so well?

English

104

Swaroop Mishra@Swarooprm7·25 Oca

We should respect Deepseek for the great results, but don't call it a breakthrough. I also like the low-key nature of the Deepseek squad. They have produced good reasoning models in the past, too. So it's definitely not a one-time hit. Having said this, the tech-report era doesn't mention detailed ablations that may need careful attention from the community. For example, if GRPO is really needed?

English

15.9K

Benjamin Bergner@bergbenj·26 Oca

For verifiable rewards, how could this be scaled beyond easily verifiable math and coding problems to arbitrary tasks? Or could it be that a few math/coding problems are sufficient to learn general reasoning across tasks?

English

Benjamin Bergner@bergbenj·26 Oca

Is it the quality of the base model? Is it the training process (RL vs. SFT)? Is it PPO vs. GRPO for RL? Is it verifiable rewards vs. using a reward model?

English

Benjamin Bergner@bergbenj·26 Oca

What are the main reasons why DeepSeek-R1, even the Zero version, works so well?

English

193

Benjamin Bergner@bergbenj·19 Ağu

@giffmana @ylecun Thanks for your post. Just wanted to add that if you work with very high resolution images (megapixel/gigapixel) and small GPUs, IPS Transformer might be interesting: arxiv.org/abs/2210.13007

English

511

Lucas Beyer (bl16)@giffmana·19 Ağu

I wrote a blogpost "On the speed of ViTs and CNNs". Addresses the following concerns I often hear: - worry about ViTs speed at high resolution. - how high resolution do I need? - is it super important to keep the aspect ratio? I think @ylecun might like it too! Link below

English

702

101.1K

Benjamin Bergner@bergbenj·17 Tem

@jxmnop Do you have an example for such a new task/dataset?

English

dr. jack morris@jxmnop·17 Tem

have been thinking about this a lot, and I think the most impactful topic to work on right now in AI is 𝗔𝗴𝗲𝗻𝘁𝘀 we can’t just fake agency using prompting, though. we need to build it into the next generation of language models: agent-native LLMs here’s my logic: 1. it seems obvious that language models can’t ever do everything in one forward pass, simply because all the information required will never be initially available. imagine trying to write a history textbook, or implement a new feature in an app, without consulting any APIs or looking anything up, storing intermediate work, etc. real life problems require systems that don’t just immediately output an answer 2. current language models are only trained to do next token prediction, with a trivial RLHF layer on top. we can make much better agents by training them on more generic tasks from the get-go 3. this goal requires new datasets, task definitions, and training techniques; all of which we don’t have yet (and I’d guess will take *years* of research)

English

317

53.4K

Benjamin Bergner@bergbenj·12 Tem

@scottgeng00 Is the answer also NO if you train on much more synthetic data than you could retrieve from the original dataset?

English

Scott Geng@scottgeng00·10 Tem

Will training on AI-generated synthetic data lead to the next frontier of vision models?🤔 Our new paper suggests NO—for now. Synthetic data doesn't magically enable generalization beyond the generator's original training set. 📜: arxiv.org/abs/2406.05184 Details below🧵(1/n)

English

471

111.4K

Benjamin Bergner@bergbenj·31 May

@ylecun Why not window attention in the first layers, followed by global attention?

English

Yann LeCun@ylecun·30 May

A short post on the best architectures for real-time image and video processing. TL;DR: use convolutions with stride or pooling at the low levels, and stick self-attention circuits at higher levels, where feature vectors represent objects. PS: ready to bet that Tesla FSD uses convolutions (or perhaps more complex *local* operators) at the low levels, combined with more global circuits at higher levels (perhaps using self-attention). Transformers on low-level patch embeddings are a complete waste of electrons.

Yann LeCun@ylecun

I'm not saying ViTs are not practical (we use them). I'm saying they are way too slow and inefficient to be practical for real-time processing of high-resolution images and video. [Also, @sainingxie's work on ConvNext has shown that they are just as good as ViTs if you do it right. But whatever]. You need at least a few Conv layers with pooling and stride before you stick self-attention circuits. Self-attention is equivariant to permutations, which is completely nonsensical for low-level image/video processing (having a single strided conv at the front-end to 'patchify' also doesn't make sense). Global attention is also nonsensical (and not scalable), since correlations are highly local in images and video. At high level, once features represent objects, then it makes sense to use self-attention circuits: what matters is the relationships and interactions between objects, not their positions. This type of hybrid architecture was inaugurated by the DETR system by @alcinos26 and collaborators. as I've said since the DETR work, my favorite family of architectures is conv/stride/pooling at the lower levels, and self-attention circuits at the higher levels.

English

110

1.4K

748.4K

Benjamin Bergner@bergbenj·17 Nis

@mkwng cool idea. just wanted to try on your linked website but got an error when inserting a random location+distance: "An error occurred while generating the trail. Please try again."

English

502

michael@mkwng·17 Nis

A friend was asking our group chat for any apps that can take a starting location and generate random running trails. No one had a good answer. So, I fired up claude.ai, google colab, and repl.it and screen recorded myself whipping together a UI to do exactly that < 30 minutes. It's live here: trail-generator.replit.app

English

946

315.1K

Benjamin Bergner@bergbenj·3 Nis

@jxmnop You can combine large encoders with small decoders for efficient generation: arxiv.org/abs/2402.16844

English

dr. jack morris@jxmnop·3 Nis

is anyone still using encoder-decoder models (T5, BART, etc.)? if so -- for what?

English

390

198.7K

Benjamin Bergner@bergbenj·11 Mar

@Euclaise_ A bit related: arxiv.org/abs/1904.09324

English

Benjamin Bergner@bergbenj·5 Mar

When do you open the Berlin office?

Elon Musk@elonmusk

Join the @xAI London office!

English

221

Benjamin Bergner retweetledi

Andrii Skliar 🇺🇦@avskliar·27 Şub

🚀 Excited to share our latest work "Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding" now on arXiv! We're taking strides in making language models faster & more efficient on text generation tasks like translation & summarization.🔍 [arxiv.org/abs/2402.16844]

English

10.9K

Benjamin Bergner@bergbenj·21 Eki

@giffmana @AdeptAILabs @brainshawn @XiaohuaZhai Great work. Have you also trained smaller models that you would be able to release, similar to Microsoft's GIT?

English

281

Lucas Beyer (bl16)@giffmana·20 Eki

🧶PaLI-3 achieves SOTA across many vision-language (and video!) tasks while being 10x smaller than its predecessor PaLI-X. At only 5B parameters, it's also smaller (and stronger) than the concurrent Fuyu-8B model, though sadly we cannot release the model (props to @AdeptAILabs)

English

430

122.8K

Benjamin Bergner@bergbenj·13 Eki

Regarding point 8: Doesn't memory usage/training time depend on sequence length and training dataset? What have you used for your reported numbers?

Sebastian Raschka@rasbt

I ran hundreds if not thousands of LoRA & QLoRA experiments to finetune open-source LLMs, and here’s what I learned: 1. Despite the inherent randomness of LLM training (or when training models on GPUs in general), the outcomes remain remarkably consistent across multiple runs. 2. QLoRA presents a trade-off that might be worthwhile if you're constrained by GPU memory. It offers 33% memory savings at the cost of a 33% increase in runtime. 3. When finetuning LLMs, the choice of optimizer shouldn't be a major concern. While SGD on its own is suboptimal, there's minimal variation in outcomes whether you employ AdamW, SGD with a scheduler, or AdamW with a scheduler. 4. While Adam is often labeled a memory-intensive optimizer due to its introduction of two new parameters for every model parameter, this doesn't significantly affect the peak memory demands of the LLM. This is because the majority of the memory is allocated for large matrix multiplications rather than retaining extra parameters. 5. For static datasets, iterating multiple times as done in multi-epoch training might not be beneficial. It often deteriorates the results, probably due to overfitting. 6. If you're incorporating LoRA, ensure it's applied across all layers, not just to the Key and Value matrices, to maximize model performance. 7. Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank's value. 8. 7B models can be finetuned efficiently within a few hours on a single GPU possessing 14 Gb of RAM. With a static dataset, optimizing an LLM to excel across all benchmark tasks is unattainable. Addressing this requires diverse data sources, or perhaps LoRA might not be the ideal tool.

English

3.7K

Benjamin Bergner@bergbenj·30 Nis

Check It Out #ICLR #ICLR2023

English

219

Benjamin Bergner@bergbenj·27 Nis

Visit the #ICLR poster session for a chat! Poster session 2, Mon 1 May 16:30 - 18:30 CEST MH1-2-3-4 #25 Kigali, Rwanda

English

115

Benjamin Bergner@bergbenj·27 Nis

Last but not least: This project would not have been possible without @LippertChr and @aravindhm_ Thank you!

English

Benjamin Bergner@bergbenj·27 Nis

English

540

Keşfet

@CVPR @Swarooprm7 @giffmana @ylecun @jxmnop @scottgeng00 @mkwng @Euclaise_