Minish

57 posts

Minish

Minish

@minishlab

Building Model2Vec, SemHash, and Vicinity. Check out our GitHub here: https://t.co/WAoFTUEQ6O. We are also on HuggingFace: https://t.co/1khDD6Y4YB

Katılım Ocak 2025
14 Takip Edilen100 Takipçiler
Minish
Minish@minishlab·
Main features: - Fast: indexes a full codebase in ~250 ms and answers queries in ~1.5 ms, all on CPU - Accurate: on par with transformers - MCP server: drop-in tool for Claude Code, Cursor, Codex, OpenCode, and any other MCP-compatible agent - Zero setup: no API keys, no GPU 🧵
English
1
0
1
40
Minish
Minish@minishlab·
Today we're releasing Semble, a fast and accurate code search library built for agents 🤖! We're also releasing potion-code-16M, a small code-specialized static embedding model that powers Semble. 🧵
Minish tweet media
English
1
1
7
169
Minish
Minish@minishlab·
Model2Vec 0.7.0 is out now, as well as a blogpost on model size reduction techniques! This release features a large number of ways to improve the distillation process. - Vocabulary quantization - Configurable pooling - A number of small improvements and bugfixes 🧵
Minish tweet media
English
2
1
8
376
Casper Hansen
Casper Hansen@casper_hansen_·
btw @minishlab, would recommend adding an example like the one in my screenshot that shows how to use semhash with a Huggingface dataset
English
1
0
3
575
Minish retweetledi
Casper Hansen
Casper Hansen@casper_hansen_·
semhash is so so so convenient and fast
Casper Hansen tweet media
English
2
1
16
2K
Minish
Minish@minishlab·
We have a new website (and name): minish.ai We’ve been working on an improved website for a while, and it’s finally here. It has documentation for all our packages as well as our blog. More things coming soon! 🚀
English
0
1
9
228
Minish retweetledi
slm tokens
slm tokens@tulkenss·
Some guy forked our "model2vec-rs" crate, and put it under the "model2vec" name on crates io and then didn't tell us about it. See here: crates.io/crates/model2v… Like what's the goal here except name squatting.
English
1
1
5
288
Minish
Minish@minishlab·
- Smaller tokenizers: all tokenizers are now 40% smaller, at no cost to anyone. A blog post with experimental results is coming in the next couple of days.
English
0
0
0
49
Minish
Minish@minishlab·
- Model improvements: nearly all distilled models will perform better, especially in STS and clustering tasks. A while ago we published a blog post on modernbert not working, but we now found out why, and fixed it! 🧵
English
1
0
1
52
Minish
Minish@minishlab·
@tomaarsen Thanks for sharing our deduplication space! For those who are interested in applying this in their own workflows, this is powered by SemHash: github.com/MinishLab/semh…
English
0
0
8
120
Minish retweetledi
tomaarsen
tomaarsen@tomaarsen·
The deduplication Space by @minishlab just got a fresh update, allowing you to remove near duplicates in (training) datasets. Details in 🧵
tomaarsen tweet media
English
2
9
85
2.6K
Minish
Minish@minishlab·
@ben_burtenshaw Thanks for sharing our deduplication space and adding some shiny new features! For those who are interested in applying this in their own workflows, this is powered by SemHash: github.com/MinishLab/semh…
English
0
0
2
48
Minish retweetledi
Ben Burtenshaw
Ben Burtenshaw@ben_burtenshaw·
Do not sleep on deduplication! Use this FREE app for semantic deduplication of multiple massive datasets. This is how it works: - You pick one all more datasets from the Hub - It make a semantic embedding of each row - It remove removes near duplicates based on a threshold like 0.9 - You can push the deduplicated dataset back to a new repo, and get to work.
Ben Burtenshaw tweet media
English
3
4
19
3.1K
Minish
Minish@minishlab·
All you need is ~100k-1M documents, a sentence transformer, and some time, and your static model will be a lot better than before. See the snippet below for the recipe that will reproduce potion-base-8m (you should use more than 500 docs though). 🧵
English
1
0
1
30
Minish
Minish@minishlab·
We just released tokenlearn 0.2.0! Tokenlearn is our post-distill distillation framework, and is what we use to train our potion models. 🧵
Minish tweet media
English
1
1
2
61