Daniel Lopes
21.6K posts

Daniel Lopes
@danielvlopes
Co-Founder/CTO @ https://t.co/y1CgNp1F3e. Prev. Co-founder @Canopy_is (@37signals spin-off). EIR @Techstars SF. Ex-product & web lead @ifttt. @indievc alum.







50 cents of compute for 500 dollars of value



Split Smarter, Not Random: The Semantic Chunking Guide. 📚💡 Most RAG systems fail before they begin. They used outdated chunking methods that: ✂️ Slice texts by characters count 🚸 Break paragraphs without regard for meaning Imagine reading a book where someone randomly tore pages in half. That's what traditional chunking does to your data. Semantic chunking is a smarter approach that follows meaning. In this VectorHub's deep-dive Ashish Abraham breaks down three approaches: 1️⃣ Embedding-Similarity Based Chunking ▪️ The system determines where to break text by comparing the similarity between consecutive sentences. ▪️ Using a sliding window approach, it calculates the cosine similarity of sentence embeddings. ▪️ If the similarity drops below a set threshold, the system identifies a semantic shift and marks the point to split the chunk. Like listening to a playlist: you can tell when one song ends and another begins. Embedding Chunking spots those natural transitions between ideas. 2️⃣ Hierarchical-Clustering Based Chunking ▪️ The system analyzes relationships between all sentences at once, not just neighbors. It starts by measuring how similar each sentence is to every other sentence in the text. ▪️ These similarities create a hierarchy—like a family tree of ideas. When sentences show strong similarity, they cluster together into small groups. ▪️ These small groups then merge into larger ones based on how closely they relate. Like organizing a library: books get grouped by topic, then broader categories, until you have a natural organization that makes sense. 3️⃣ LLM-Based Chunking This newest approach uses LLMs to chunk text based on semantic understanding. ▪️ The first step is to feed the text to an LLM with specific chunking instructions. ▪️ The LLM then identifies key ideas and how they connect, rather than just measuring similarity. ▪️ When it spots a complete thought or concept, it groups these propositions into coherent chunks. Imagine having a skilled editor who knows exactly where to break your text for maximum clarity. ⚙️ Which method will produce optimal outcomes depends on your use case: ▪️ Want precision? Go with LLM-Chunking ▪️ Want speed? Go with Embedding-Similarity ▪️ Need to preserve relationships? Go with Hierarchical-Clustering Ready to implement? Get the full technical breakdown👇


















