Superlinked

651 posts

Superlinked banner
Superlinked

Superlinked

@superlinked

Self-hosted inference for search & document processing.

Katılım Eylül 2019
309 Takip Edilen673 Takipçiler
Superlinked
Superlinked@superlinked·
We just launched native @Weaviate support for the Superlinked Inference Engine. The interesting one here is SIEDocumentEnricher. It combines embedding with entity extraction and classification at index time, which means Weaviate's Query Agent gets a rich metadata surface to work with. So a natural language query like "show me legal documents mentioning Google" resolves into the right vector search plus filters automatically. pip install sie-weaviate Check it out in our docs: buff.ly/hAjuC1M
Superlinked tweet media
English
0
2
6
219
Superlinked retweetledi
Antoine Chaffin
Antoine Chaffin@antoine_chaffin·
Just watched this talk from @f_makraduli It is very interesting because it highlights a key takeaway: specialised models outperforms LLMs for specific tasks (routing, retrieval, reranking), but serving them is challenging because it is less explored and there are a lot of different models/architecture/inputs/outputs As someone who worked on serving my two loved ones (ModernBERT and ColBERT), this resonates!
Superlinked@superlinked

Most embedding infrastructure assumes you know exactly which model you want ahead of time. This talk starts where that assumption breaks. @f_makraduli walks through the real profiling mistakes, infrastructure gaps, and production constraints that led to building an embedding inference engine designed for dynamic model loading, hot-swapping, and memory-aware eviction instead of brittle one-model-per-container deployments. If you're working on small-model inference, embeddings, or GPU infrastructure, this is a practical look at what breaks in the real world and how to design around it. Check it out here: buff.ly/S1HZCZB Dive into the SIE repo here: buff.ly/EBnNglg

English
2
3
13
1.8K
Superlinked
Superlinked@superlinked·
Most embedding infrastructure assumes you know exactly which model you want ahead of time. This talk starts where that assumption breaks. @f_makraduli walks through the real profiling mistakes, infrastructure gaps, and production constraints that led to building an embedding inference engine designed for dynamic model loading, hot-swapping, and memory-aware eviction instead of brittle one-model-per-container deployments. If you're working on small-model inference, embeddings, or GPU infrastructure, this is a practical look at what breaks in the real world and how to design around it. Check it out here: buff.ly/S1HZCZB Dive into the SIE repo here: buff.ly/EBnNglg
Superlinked tweet media
English
0
4
8
2K
Superlinked
Superlinked@superlinked·
We're now a native Haystack integration. The sie-haystack package gives you SIE embedders (dense, sparse, ColBERT, image), cross-encoder rerankers, and zero-shot extractors as first-class Haystack 2.0 components. Everything routes through one endpoint, so you can build a full RAG pipeline, swap models with a config change, and not spin up new infrastructure for each one. pip install sie-haystack Check it out in our docs: buff.ly/be2Crv7
Superlinked tweet media
English
0
4
5
371
Superlinked
Superlinked@superlinked·
We sometimes hear clients talking about running 700B parameter models, but most AI tasks don’t actually need them! A huge amount of real-world work can be done with small, task-specific models. Instead of forcing one giant model to do everything, you combine a few specialized models together to solve the problem. Because these models are only a few billion parameters, they fit comfortably on standard 16–24GB GPUs. That means lower latency, dramatically lower cost, and infrastructure that is much easier to run in your own cloud. This shift toward Small Language Models is a big part of what we discuss in our latest guest appearance on the AI Powered Search , where @svonava gives a preview of the Superlinked Inference Engine and how we think about running many models in production. If you are building AI systems today, it is worth asking whether the biggest model is really the right tool for the job. Thanks to @treygrainer and @softwaredoug for having us!
English
0
2
3
368
Superlinked
Superlinked@superlinked·
How long does it take your team to get a new model into production? If the answer is anything more than a config change, @f_makraduli 's talk at AI Engineer Europe is worth your time. On April 10th, Filip will walk through the small-model infrastructure problem we kept finding one layer deeper than expected, and what we built in response. The short version: five small models should not require five GPUs running at single-digit utilisation. A new model on HuggingFace should not require days of Docker builds and infra tickets. And the gap between a working model server and a production system that scales, monitors itself and costs nothing when idle should not require months of in-house work. We fixed all three. Come find us in London to find out how. buff.ly/nnWlpYU #AIEngineer #Embeddings #MLOps #OpenSource #Superlinked
Superlinked tweet media
English
0
0
0
90
Superlinked
Superlinked@superlinked·
We would like to announce that our co-founders Daniel and Ben have launched a side hustle. SUPER INKED Tattoo Studio will be opening its books to paying customers April 1st at 12pm PST. That’s right, they dropped the L, because in this business *we don’t take no Ls.* Ben has 1 month of experience with a tattoo gun and has been using Daniel as a test-dummy on a daily basis. We call his technique “vibe tatting” and so far the results have been great, enabling Daniel to fulfill his dream of having an entire arm sleeve of vague illegible scribbles. SUPER INKED can be trusted to produce the best quality, definitely not AI-generated flash sheets, with designs that will be professionally embedded, just like a vector (but more permanent) into your skin by one of our tech-team-turned-tattooists. Like this post for 50% off face tattoos!
Superlinked tweet media
English
0
0
0
54
Superlinked
Superlinked@superlinked·
Right now SO many companies are paying per token for LLM APIs. At scale, that gets expensive very quickly. What’s interesting is that in many cases there are open models with similar capabilities that you can run yourself. The difference is that instead of paying per token, you are paying for GPU infrastructure. The gap between those two models of pricing can easily be one or two orders of magnitude. That is why more teams are starting to look seriously at self-hosting. If you can run the models reliably in your own environment, the cost savings become hard to ignore. @Svonava talks about this shift and why infrastructure for running many specialized models efficiently is becoming an important part of modern AI systems.
English
0
2
3
228
Superlinked
Superlinked@superlinked·
The self hosting small models is an increasingly emerging topic as of late, but where's the evidence? The team was in Belgrade last week, presenting alongside @TopK and @Perplexity, answering that exact question. @f_makraduli presented "The Case for Self-Hosting Small Models". *TLDR: Small models are quietly winning in production AI.* Open source has exploded to over 2.6M models, and open-weight systems are now only about 1 to 3 months behind proprietary frontier models. In some cases, they already match top-tier performance at a fraction of the cost At the same time, task-specific models consistently outperform general LLMs where it matters. They are faster, cheaper, easier to run, and trained on more relevant data. That is why they power things like search, ranking, and extraction in real systems today It appears the future is not one giant model, but many smaller models doing specific jobs to a better standard. Thanks to @KayaVC for the invite!
Superlinked tweet media
English
0
1
3
71
Superlinked
Superlinked@superlinked·
GPUs can deliver hundreds of TFLOPS, so why are they often underutilised during inference? Because the real constraint is often memory bandwidth, not compute. With small batches, GPUs spend much of their time waiting for data to move through memory. The compute cores sit idle because weights and activations cannot be fetched fast enough. Increase the batch size and things start to change. Memory access becomes more efficient, the GPU stays busy doing matrix multiplications, and the bottleneck shifts from memory bandwidth to raw compute. That transition is key to understanding why batching matters so much for inference performance. Filip's article breaks down this shift clearly and explains how it shapes real world GPU utilization. Check it out here: buff.ly/E0dbSHD
Superlinked tweet media
English
0
0
0
34
Superlinked
Superlinked@superlinked·
If you are running search or large scale data processing, you have probably experienced: -Rising API costs. -Experimenting until something breaks in production. -Memory constraints and throughput ceilings that block real workloads. We're working on an alternative... On Feb 27 at 4 PM GMT, @Svonava will preview the Superlinked Inference Engine, our open source software for running Small Language Models in your own cloud. Join us on Maven Live for Optimizing Search & Data Processing with Self-hosted SLMs. We’ll cover: • When SLMs beat LLMs for search and data tasks • How to support 35+ model architectures and LoRAs in production • Designing a multi model cluster pushing 1M tokens per second • How teams cut 95%+ of managed API costs Daniel will be joining AI-Search masterminds @treygrainger and @softwaredoug for this free lightning lesson! Join us here: buff.ly/amM5yBI
Superlinked tweet media
English
0
2
3
337
Superlinked
Superlinked@superlinked·
“System X is fast because it’s written in Rust.” Is this true 100% of the time? Most people assume embedding inference speed comes down to the code they write. Python versus Rust, frameworks etc. In practice, almost none of that is decisive. What really affects embedding latency is memory. GPUs are extremely fast at calculations but comparatively slow at moving data. Generating an embedding is mostly about reading and writing large model weights and intermediate tensors instead of crunching numbers. That is why techniques like Flash Attention (used by popular inference model TEI) matter. They reorganise computation so more work stays in fast on chip cache instead of repeatedly hitting slower GPU memory. Quantisation helps for the same reason. Smaller weights mean less data to move. If you want faster embeddings, start thinking about memory, cache locality, and data movement to realise some actual gains. Or better yet, read Filip’s full deep-dive on the matter here: buff.ly/Kq1y8kZ
Superlinked tweet media
English
0
0
2
97
Superlinked
Superlinked@superlinked·
Using open-source solutions to productionise your embeddings can get you a long way, but the efficiency problem that faces ML and AI Engineers still needs solving… *Some models can generate dense, sparse, and multi vector embeddings in one pass, but today you usually need multiple API calls because these outputs are handled separately. *Running and testing multiple models in production is costly and complex, with limited support for serving many models efficiently when VRAM is constrained. *Differences in embeddings, pooling strategies, and model quirks require careful handling by users, and current systems lack flexible ways to support new model types without code changes. @f_makraduli takes a deep dive into the existing open source inference solutions, what they do well, and what they’re ultimately missing to make everyone’s jobs easier (and to get the most out of your GPUs). Check out the article here: buff.ly/U4bsFOB
Superlinked tweet media
English
0
1
6
234
Superlinked
Superlinked@superlinked·
Problems with your text-embedding models? Filip explains the common issues with the traditional approach to search + embeddings. Superlinked has a smarter approach, using a MIXTURE of embeddings instead. Check out the video to find out more.
English
1
2
4
434
Superlinked
Superlinked@superlinked·
Think you know the vector embeddings space well? Think again! Your embeddings are wrong! @Svonava will open the hood on today’s “state-of-the-art” text and image embeddings at GenAI Week 2025, Silicon Valley on Thursday 17 July, 2:00 – 2:40 PM (PT). Why attend? See the breaking point: examples of pre-trained embeddings failing on tasks that look trivial on paper. Learn how the big players fix it: a peek into FAANG-style models that fuse dozens of real-world signals (price, location, co-purchase graphs, margins and more). Walk away with a blueprint: a Mixture-of-Encoders strategy you can replicate without a research lab. Two case studies: A fashion retailer that unlocked seven-figure incremental revenue. A jobs marketplace that boosted matching quality while cutting infra costs. If you build search, recommendations or retrieval pipelines, this session will save you months of trial and error. 👉 Register here buff.ly/Q6wTVyG and add our keynote “Your Embeddings Are Wrong” to your schedule. Follow Daniel for the chance to get your hands on free tickets to the conference. See you in Santa Clara! 🎟️ hashtag#AI hashtag#GenAI hashtag#VectorSearch hashtag#RecommenderSystems hashtag#MachineLearning
Superlinked tweet media
English
0
2
3
329
Superlinked retweetledi
Qdrant
Qdrant@qdrant_engine·
Smarter search ≠ more keywords. It means understanding meaning, filtering fast, and ranking by real intent. 🚀 On May 15th we’re live with @superlinked showing how it all works in production with @qdrant_engine. 👉 RSVP: lu.ma/p30sy66f
Qdrant tweet media
English
1
3
17
921
Superlinked
Superlinked@superlinked·
Multi-vector is a loaded term these days - it can mean late interaction-compatible representations where one model output is not pooled into a single vector but we use a set of vectors to represent the given (usually string) input. This is done to increase the accuracy of representing that one aspect of our data. It can also mean, like in our case, the ability to run different models to capture *different* aspects of the data, each producing one or multiple vectors. We do this to encode numerical/geospatial and other properties along the textual/image properties of our data objects. Obviously both approaches can be combined for maximum retrieval accuracy and control :-)
English
1
0
1
42
GeraDeluxer
GeraDeluxer@GeraDeluxer·
A very good example of this is @superlinked
GeraDeluxer tweet media
Weaviate AI Database@weaviate_io

Traditional vector embeddings represent entire documents as single vectors. But what if we could capture more nuanced relationships? Enter 𝗺𝘂𝗹𝘁𝗶-𝘃𝗲𝗰𝘁𝗼𝗿 𝗲𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝘀. 𝗪𝗵𝗮𝘁 𝗮𝗿𝗲 𝘁𝗵𝗲𝘆? Instead of one vector per document, multi-vector embeddings (like ColBERT) represent each document with multiple vectors. For example: • Single vector: [0.0412, 0.1056, 0.5021,...] • Multi-vector: [[0.0543,...], [0.0123,...], [0.4299,...]] 𝗪𝗵𝘆 𝗮𝗿𝗲 𝘁𝗵𝗲𝘆 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹? Multi-vector embeddings enable "late interaction" - a technique that matches individual parts of texts rather than comparing them as whole units. This preserves fine-grained meaning and enables more precise matching. 𝗛𝗼𝘄 𝗶𝘁 𝘄𝗼𝗿𝗸𝘀: 1. Each token/part of text gets its own vector 2. During a search, each query vector finds its best match in the document 3. Individual matches are combined for a final similarity score 𝗞𝗲𝘆 𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀: • Better handling of word order • More precise phrase matching • Improved search accuracy for longer texts 𝗧𝗿𝗮𝗱𝗲-𝗼𝗳𝗳𝘀 𝘁𝗼 𝗖𝗼𝗻𝘀𝗶𝗱𝗲𝗿: - Generally larger sizes (longer text ➡️ larger vectors) - Higher memory & storage costs - Increased inference & search time 𝗜𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻: Weaviate v1.30 now supports multi-vector embeddings for production environments through: 1. ColBERT model integration (via @JinaAI_ ) 2. Custom multi-vector embeddings 3. Quantization techniques for multi-vector embeddings Want to learn more? Join our upcoming technical session: lu.ma/weaviate-relea…

English
1
0
2
60
Superlinked
Superlinked@superlinked·
🌟 We just hit 1K GitHub Stars! Huge thanks to everyone who starred, shared, or contributed 💖 If you're building smart search or recommendations with natural language, check out our OSS framework 👇 🔗 buff.ly/HbePxh5
Superlinked tweet media
English
0
1
7
313
Superlinked
Superlinked@superlinked·
🔍 Personalised search is evolving — semantic relevance isn’t enough. With Superlinked, blend meaning and time using recency_space + negative_filter to surface fresh, relevant results. Build an agentic AI research agent 👇 📘 buff.ly/M3lfnQPbuff.ly/l1RhkMd
Superlinked tweet media
English
0
1
4
222