gautham

397 posts

gautham

@capemox

distributed search @couchbase | @blevesearch member, @huggingface contributor | contrastive loss enjoyer

Bengaluru, India 가입일 Temmuz 2023

434 팔로잉85 팔로워

고정된 트윗

gautham@capemox·1 May

Pretty big release of bleve out! GPU accelerated vector search (I helped on this hehe), binary quantization and a lot more! github.com/blevesearch/bl…

English

345

gautham@capemox·4h

Qwen2.5-7B and GPT-3.5-turbo in 2026?

Garry Tan@garrytan

Qwen2.5‑7B Instruct apparently is GPT-3.5-turbo level

English

407

gautham@capemox·5h

@lateinteraction fill in the slop

English

Omar Khattab@lateinteraction·5h

your novel idea, when you ask an llm to fill in the details

Drew Breunig@dbreunig

We need a name for this, because Armin is putting his finger on a problem that’s everywhere: people running their writing through an LLM because they think it makes it clearer, when in actuality it sands off all the detail.

English

161

10.6K

gautham@capemox·7h

@drexalt @mattjustram @yjoonjang Dupmae has a technique to train both the CLS as well as the other regular tokens. I agree that the CLS train may not be useful at all, but the others might be?

English

jonah@drexalt·9h

@capemox @mattjustram @yjoonjang I am pretty sure it would not be, since ColBERT isn’t using the MLM head anymore and the CLS token-maxxing doesn’t seem super aligned. I think the LateOn contrastive pre train probably stronger

English

gautham@capemox·1d

Wow ettin does NOT like being fine-tuned for splade. Distilling didn't work at all, it's either representation collapse or not sparse enough trying a few different loss functions now, I've tried MarginMSE so far

English

829

gautham@capemox·13h

Would have loved to be more aggressive in terms of batch size and amount of data but I'm kinda GPU poor (my poor RTX 4060m)

English

gautham@capemox·14h

huggingface.co/capemox/ettin-… Trained on subsets of GooAQ, Wikipedia Sections and NQ for 2.5 million total data points

English

gautham@capemox·14h

I've released an ettin 68m checkpoint with contrastive pretraining! It's a much better starting point for downstream embedding tasks. Will be releasing similar checkpoints for the other smaller encoders in ettin. Check out the pretrain in the link in the comments

English

1.2K

gautham@capemox·14h

Can't believe they're following the Samsung Galaxy naming scheme lmao

English

gautham@capemox·14h

@antoine_chaffin @mattjustram Got confused by the language in the paper :/ I think I was over eager to assign blame to MNTP

English

Antoine Chaffin@antoine_chaffin·15h

@capemox @mattjustram This is false though We trained both « pure » encoders and « pure » encoders We also released decoders turned into encoders through MNTP continued pre-training, but those aren’t the base encoders

English

gautham@capemox·18h

@drexalt @mattjustram @yjoonjang btw @drexalt have you tried dupmae with colbert? curious if it's a good conditioning for multivector

English

jonah@drexalt·1d

@capemox @mattjustram this might serve as a jumping off point from @yjoonjang, I tried DupMAE on NeoBERT and had good results github.com/yjoonjang/Mode…

English

122

gautham@capemox·1d

@antoine_chaffin @bclavie Thanks for the share!

English

Antoine Chaffin@antoine_chaffin·1d

@bclavie @capemox btw @capemox I think this is relevant to make ModernBERT works with sparse: arxiv.org/abs/2601.17500

English

gautham@capemox·1d

@TheAdamEvans @yjoonjang Thanks a lot!

English

adamame 🌾@TheAdamEvans·1d

@capemox creates a much smaller, fixed, (sparse) dim to work with downstream, basically turns it into an encoder for you might have better isotropic properties and behave better for your use case 🤷‍♂️ gonna read those @yjoonjang linked papers, seems like that's pretty much it

English

gautham@capemox·1d

@yjoonjang @TheAdamEvans Yeah for sure. I'll start with ettin models first

English

Youngjoon Jang@yjoonjang·1d

@TheAdamEvans @capemox Yeah thats what the paper tells. But if you start with qwen, I think you'll need mntp training for bidirectional attention (Thats what the authors did)

English

gautham@capemox·1d

@drexalt @mattjustram @yjoonjang I already have a "pretraining" library that I was planning to open source lmao didn't know this existed

English

gautham@capemox·1d

@TheAdamEvans zamn, tell me more basically, layer pruning qwen3.5-2b right? but why would qwen scope be useful, that's for interpretability right?

English

adamame 🌾@TheAdamEvans·1d

@capemox what if you used Qwen3.5-2B W32K-L0_100 at layer 11 or 15 as a starting point not exactly tiny but not huge either and qwen scope outputs are oh-so-smooth semantically

English

gautham@capemox·1d

@mattjustram Wow I totally missed this lmao. You're right, this could defo be the issue. I might try some dupmae style or contrastive pretraining to see if it helps

English

104

Jheng-Hong Yang@mattjustram·1d

@capemox i just realized they further trained the encoders with MNTP. i think it ruined the MLM init for SPLADE

English

gautham@capemox·1d

@mattjustram encoders. I've tried the smaller ones until 150m params

English

100

Jheng-Hong Yang@mattjustram·1d

@capemox encoder? decoder?

Français

gautham@capemox·1d

With CC now having auto mode on pro, I can just let it loop every 5 mins to check if the run is going well. It just wished me good night.

English

101

gautham@capemox·1d

Workshop I wanted to submit to doesn't have remote attendance

GIF

English

128

탐색

@lateinteraction @drexalt @mattjustram @yjoonjang @antoine_chaffin @bclavie @TheAdamEvans @elonmusk