jasmine wang

80 posts

jasmine wang

jasmine wang

@jasminechenwang

Yoga, Cognitive Science, Open Source technology, Startups, Sunset at the Beach

Katılım Haziran 2022
43 Takip Edilen48 Takipçiler
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
Only 2 weeks away from Data Engineering Open Forum 2026 in SF on April 16! Join us for "Powering Netflix's Multimodal Feature Engineering at Scale" and dive into how @netflix curates multimodal features across large video & image corpora, with LanceDB serving as the core storage and query layer for multimodal data.
LanceDB tweet media
English
1
2
5
372
jasmine wang retweetledi
Xuanwo
Xuanwo@OnlyXuanwo·
Working at @lancedb is kinda interesting, because we are forced to re-think everything at 100B scale. How big is 100B? 100B rows of 768-dim embeddings ≈ 300TB of raw vectors alone, before text, images, or indexes. At 1M writes/sec, it still takes 28 hours non-stop to fill 100B rows. The Milky Way has ~100-400B stars. We're basically building a database at galaxy scale. If you've read to the end and think this is cool, DM me your resume. We're looking for sharp minds to join us.
English
9
10
169
15.1K
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
1/3 Geospatial support just landed in Lance. And no new storage format work was required. Because Lance is Arrow-native, GeoArrow extension types work out of the box. Geometry columns are preserved end-to-end with zero special casing.
LanceDB tweet media
English
1
5
18
738
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
1/4 Branching for ML data shouldn’t slow down production. Iceberg branching → shared metadata bottlenecks. Delta shallow clone → isolation, but loses Git-like UX. We want both. Here’s how Lance unifies branching, tagging, and shallow clone for AI workloads 🧵
LanceDB tweet media
English
1
2
7
536
jasmine wang retweetledi
Julien Chaumond
Julien Chaumond@julien_c·
in case you missed it @lancedb and HF are partnering up to unlock the next generation of large dataset storage on the Hub 🔥 And it's fire! - Supports storing embeddings (and their indexes) directly alongside the data - Vector search / similarity search is built-in - Large multimodal datasets (text, images, video) just use the hf:// prefix: db = lancedb. connect("hf://datasets/julien-c/hub-stats-lance") 🔥🔥
Julien Chaumond tweet media
English
0
16
79
8.2K
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
1/6 Here’s a quick example of how to read @huggingface datasets via LanceDB. Start with opening a LanceDB connection to a dataset on the Hub using the hf:// prefix path.
LanceDB tweet media
English
1
2
5
269
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
1/5 Large multimodal blobs don’t have to break dataset workflows. Images and videos are often treated as external files, separate from metadata and indexes. Once datasets get large, that split makes exploration, curation, and training painful. Lance changes that on the 🤗 @huggingface Hub. 🧵👇
GIF
English
2
9
21
2.5K
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
1/5 @lancedb 🫶🏻 @duckdb We’re happy to announce a new Lance extension for DuckDB! You can simply install this extension in DuckDB and point at your Lance datasets from within a DuckDB CLI or a Python script, while getting 𝗳𝘂𝗹𝗹 𝗦𝗤𝗟 𝗰𝗮𝗽𝗮𝗯𝗶𝗹𝗶𝘁𝗶𝗲𝘀 𝗼𝗻 𝘁𝗼𝗽 𝗼𝗳 𝗟𝗮𝗻𝗰𝗲 without copying your data!
LanceDB tweet media
English
1
8
17
2.1K
Brian Zhan
Brian Zhan@brianzhan1·
After three years at CRV, I am stepping onto Striker Venture Partners' founding team, leading the firm's AI investments. Thanks to @BusinessInsider for covering the move.
Brian Zhan tweet media
English
79
27
371
468.3K
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
1/7 🎨 In a world of infinite scroll, discovering art still feels like searching for a needle in a haystack. With SemanticDotArt, we flipped the question: What if you searched by mood, not just metadata? See how we did this in @lancedb 👇🏽
English
3
10
18
3K
jasmine wang retweetledi
changhiskhan
changhiskhan@changhiskhan·
This is a big milestone for Lance format. The F3 paper (dl.acm.org/doi/10.1145/37…) verified that Lance has THE fastest random access, essential for search, shuffle, and many other AI workloads. But it incorrectly assumed it was because of lack of compression. With 2.1, we show that, yes indeed you can have your cake and eat it too. Not only does this release come with major improvements on compression without sacrificing performance, it also includes goodies like JSON and better nested data support. It’s also a proof point of how extensible the encodings are in Lance. You can read our blog post for all the fun details
LanceDB@lancedb

💾 Lance File 2.1 Is Now Stable 🥳 Big news from the LanceDB team — Lance File Format 2.1 is officially stable❗️ This release solves one of the biggest challenges from 2.0: 👉 adding compression without sacrificing *random access performance.

English
4
11
45
6K
jasmine wang retweetledi
Apache Spark
Apache Spark@ApacheSpark·
Join us for our webinar on Apache Spark™ and Lance Spark Connector with Jack Ye (@lancedb) on September 25! 👏 Learn how the Lance Spark Connector enables Apache Spark™ to work with Lance’s AI-native multimodal storage. ✅ We’ll look at how Spark can handle embeddings, images, videos, and documents with random access, indexing, and vector/blob support. We'll also cover integration with Hive Metastore, @unitycatalog_io, and examples of workflows for ingestion, analytics, feature engineering, and retrieval-augmented generation—using one dataset, without format conversions. 🔗 REGISTER: luma.com/76o36xuk 📅 September 25, 2025 ⏰ 9:30 – 10:30 AM PST 📍 Online #apachespark #spark #oss #opensource #lancedb #lance #sparkconnector @2twitme
Apache Spark tweet media
English
1
1
4
1.1K
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
When building a columnar file reader, it becomes clear that 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗶𝘀 𝗻𝗼𝘁 𝗷𝘂𝘀𝘁 𝗮𝗻 𝗮𝗯𝘀𝘁𝗿𝗮𝗰𝘁 𝗰𝗼𝗻𝗰𝗲𝗽𝘁. (t.ly/3AyJh) It is the set of rules that determines how every byte of data is stored and accessed on disk. A few months ago, Weston Pace set out to 𝘀𝗼𝗹𝘃𝗲 𝗮 𝘁𝗿𝗶𝗰𝗸𝘆 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 𝗶𝗻 𝗟𝗮𝗻𝗰𝗲. Small values like integers and booleans benefit from maximum compression, even if that means a bit of read amplification. Large values like vector embeddings, images, and documents need lightning fast random access without excessive RAM overhead.
English
1
2
18
2.1K
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
The data prep bottleneck for fine-tuning LLMs is a common challenge. 𝗢𝘂𝗿 𝗻𝗲𝘄 𝗶𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻 𝘄𝗶𝘁𝗵 𝗠𝗲𝘁𝗮'𝘀 𝗦𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗗𝗮𝘁𝗮 𝗞𝗶𝘁 𝗵𝗲𝗿𝗲 𝘁𝗼 𝗳𝗶𝘅 𝘁𝗵𝗮𝘁! It simplifies the entire workflow with a 𝘀𝘁𝗿𝗮𝗶𝗴𝗵𝘁𝗳𝗼𝗿𝘄𝗮𝗿𝗱 𝗖𝗟𝗜 for generating high-quality, synthetic datasets. The package 𝘂𝘀𝗲𝘀 𝘁𝗵𝗲 𝗟𝗮𝗻𝗰𝗲 𝗳𝗼𝗿𝗺𝗮𝘁, so you can store and retrieve massive multimodal datasets. 𝗗𝗼𝗰𝘀: lancedb.com/docs/integrati…
English
0
1
3
577
jasmine wang retweetledi
Andriy Mulyar
Andriy Mulyar@andriy_mulyar·
- built by solid db people and hackable (we have a contributor at nomic to it) - used by top ai companies / labs / products for it's nice properties when used in a training loops (e.g. midjourney has been using it since 2023) so probably not going anywhere - feels like the right way to work for lots of embeddings from the devex perspective in both their low level API (lance) and db wrapper
English
1
2
11
716
jasmine wang retweetledi
LanceDB
LanceDB@lancedb·
We just published a 𝗻𝗲𝘄 𝗯𝗹𝗼𝗴 (lancedb.com/blog/multimoda…) on what the 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 actually does. The Lakehouse is 𝗳𝗼𝗿 𝘁𝗵𝗼𝘀𝗲 working with a mix of text, images, audio, and structured data - 𝘄𝗵𝗼 𝘄𝗶𝘀𝗵 𝘁𝗼 𝗮𝘃𝗼𝗶𝗱 𝘁𝗵𝗲 𝗽𝗮𝗶𝗻 of manual configuration. You can use it to build real AI systems without dealing with orchestration, DAGs, or custom infrastructure.
LanceDB tweet media
English
1
3
11
1.3K