Minish

57 posts

Minish

@minishlab

Building Model2Vec, SemHash, and Vicinity. Check out our GitHub here: https://t.co/WAoFTUEQ6O. We are also on HuggingFace: https://t.co/1khDD6Y4YB

Katılım Ocak 2025

14 Takip Edilen100 Takipçiler

Minish@minishlab·28 Nis

Semble: github.com/MinishLab/semb… Benchmarks: github.com/MinishLab/semb… How it works: #how-it-works" target="_blank" rel="nofollow noopener">github.com/MinishLab/semb… Model: huggingface.co/minishlab/poti…

English

Minish@minishlab·28 Nis

Main features: - Fast: indexes a full codebase in ~250 ms and answers queries in ~1.5 ms, all on CPU - Accurate: on par with transformers - MCP server: drop-in tool for Claude Code, Cursor, Codex, OpenCode, and any other MCP-compatible agent - Zero setup: no API keys, no GPU 🧵

English

Minish@minishlab·28 Nis

Today we're releasing Semble, a fast and accurate code search library built for agents 🤖! We're also releasing potion-code-16M, a small code-specialized static embedding model that powers Semble. 🧵

English

169

Minish@minishlab·6 Eki

We also have a new blogpost on model size reduction where we showcase how to reduce model size by a factor of 15, creating a 6MB model (!) without impacting performance. Links: Release notes: github.com/MinishLab/mode… Blogpost: minish.ai/blog/2025-10-0…

English

Minish@minishlab·6 Eki

Model2Vec 0.7.0 is out now, as well as a blogpost on model size reduction techniques! This release features a large number of ways to improve the distillation process. - Vocabulary quantization - Configurable pooling - A number of small improvements and bugfixes 🧵

English

376

Minish@minishlab·24 Tem

@casper_hansen_ Thanks for the feedback (and for using SemHash)! That's a good idea, we can add something to our readme. There's also a HF space where you can use it directly on the hub: huggingface.co/spaces/minishl…

English

Casper Hansen@casper_hansen_·24 Tem

btw @minishlab, would recommend adding an example like the one in my screenshot that shows how to use semhash with a Huggingface dataset

English

575

Minish retweetledi

Casper Hansen@casper_hansen_·23 Tem

semhash is so so so convenient and fast

English

Minish@minishlab·12 Haz

We have a new website (and name): minish.ai We’ve been working on an improved website for a while, and it’s finally here. It has documentation for all our packages as well as our blog. More things coming soon! 🚀

English

228

Minish retweetledi

slm tokens@tulkenss·4 Haz

Some guy forked our "model2vec-rs" crate, and put it under the "model2vec" name on crates io and then didn't tell us about it. See here: crates.io/crates/model2v… Like what's the goal here except name squatting.

English

288

Minish@minishlab·4 Haz

- Smaller tokenizers: all tokenizers are now 40% smaller, at no cost to anyone. A blog post with experimental results is coming in the next couple of days.

English

Minish@minishlab·4 Haz

- Model improvements: nearly all distilled models will perform better, especially in STS and clustering tasks. A while ago we published a blog post on modernbert not working, but we now found out why, and fixed it! 🧵

English

Minish@minishlab·4 Haz

We just released model2vec 0.6.0! This is a big release, containing many big improvements 🔥 GitHub release: github.com/MinishLab/mode… PyPi release: pypi.org/project/model2… 🧵

English

118

Minish@minishlab·3 Haz

@tomaarsen Thanks for sharing our deduplication space! For those who are interested in applying this in their own workflows, this is powered by SemHash: github.com/MinishLab/semh…

English

120

Minish retweetledi

tomaarsen@tomaarsen·3 Haz

The deduplication Space by @minishlab just got a fresh update, allowing you to remove near duplicates in (training) datasets. Details in 🧵

English

2.6K

Minish@minishlab·3 Haz

@ben_burtenshaw Thanks for sharing our deduplication space and adding some shiny new features! For those who are interested in applying this in their own workflows, this is powered by SemHash: github.com/MinishLab/semh…

English

Minish retweetledi

Ben Burtenshaw@ben_burtenshaw·3 Haz

Do not sleep on deduplication! Use this FREE app for semantic deduplication of multiple massive datasets. This is how it works: - You pick one all more datasets from the Hub - It make a semantic embedding of each row - It remove removes near duplicates based on a threshold like 0.9 - You can push the deduplicated dataset back to a new repo, and get to work.

English

3.1K

Minish@minishlab·30 May

Release notes: github.com/MinishLab/toke…

English

Minish@minishlab·30 May

All you need is ~100k-1M documents, a sentence transformer, and some time, and your static model will be a lot better than before. See the snippet below for the recipe that will reproduce potion-base-8m (you should use more than 500 docs though). 🧵

English

Minish@minishlab·30 May

We just released tokenlearn 0.2.0! Tokenlearn is our post-distill distillation framework, and is what we use to train our potion models. 🧵

English

Keşfet

@casper_hansen_ @tomaarsen @ben_burtenshaw @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates