Weston Pace

41 posts

Weston Pace

@westoncpace

Data plumber Note: This account is only for promoting my work. More personable posting happens on my account on the "butterfly shaped alternative".

Katılım Temmuz 2022

15 Takip Edilen98 Takipçiler

Weston Pace@westoncpace·28 Nis

@BEBischof Every time I get new glasses I think everything is a little clearer and my vision must've gotten worse but it's been decades and it hasn't really changed. So I think my glasses are just always dirty except for that 2 week window when they are new.

English

Bryan Bischof fka Dr. Donut@BEBischof·28 Nis

i got new glasses which means everything looks weird and i have a headache

English

169

Weston Pace@westoncpace·27 Nis

@HoytEmerson @tech_optimist @jezell So if my query pattern changes (for example, I'm doing a lot of find-by-id queries) and suddenly I want fine-grained zone maps, then I don't want to have to rewrite all my data files. I just want to retrain my index with a smaller zone size.

English

Weston Pace@westoncpace·27 Nis

@HoytEmerson @tech_optimist @jezell We don't index in the file. We index at the table level. Here is a pretty good example of some scholarly reasoning (db.cs.cmu.edu/papers/2025/p1…). For us it's because we really want to avoid rewrites to data files unless data actually changes.

English

Hoyt Emerson@HoytEmerson·26 Nis

You can build a production data service with Arrow Flight + DuckDB in under 100 lines of Python. Server reads Parquet from S3, client gets results as Arrow batches. That's a real query engine and no warehouse bills.

English

215

18.7K

Weston Pace retweetledi

LanceDB@lancedb·29 Oca

𝗖𝗼𝗺𝗯𝗶𝗻𝗶𝗻𝗴 𝗮 𝘀𝗰𝗵𝗲𝗱𝘂𝗹𝗲𝗿 𝗿𝗲𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗶𝗼_𝘂𝗿𝗶𝗻𝗴, 𝘄𝗲 𝘄𝗲𝗿𝗲 𝗮𝗯𝗹𝗲 𝘁𝗼 𝗮𝗰𝗵𝗶𝗲𝘃𝗲 𝟭,𝟱𝟬𝟬,𝟬𝟬𝟬 𝗜𝗢𝗣𝗦 𝘄𝗶𝘁𝗵 𝗟𝗮𝗻𝗰𝗲 In this blog, Weston Pace deep dives into benchmarking improvements we’ve made in Lance, and how we aim to achieve 1 million IOPS in real systems.

English

344

Weston Pace@westoncpace·27 Eki

@techalexpr Maybe a boring answer but we focus on what our users & customers need. We want to know what's out there but don't spend too much time integrating it until there is a user asking for it with a legitimate use case.

English

Tech Alex@techalexpr·10 Eki

@westoncpace All these new formats keep things interesting. At LanceDB, how do you decide which innovations are worth integrating for real impact? Would love to hear which one feels like a future game-changer!

English

Weston Pace@westoncpace·3 Eki

Lot's of work being done on file formats lately. I think I count 5 new formats (Lance, Nimble, Vortex, FastLanes, F3) now. It's definitely something we follow at LanceDB and it can be confusing to track. So here is my very biased head-canon (positivity edition!)

English

3.4K

Weston Pace@westoncpace·3 Eki

Hope this helps, it's fun to see so much exciting innovation in a space that's been relatively quiet for many years!

English

173

Weston Pace@westoncpace·3 Eki

F3 is from a joint project between CMU and Tsinghua University. They have tackled the "forwards compatibility" problem by storing WASM decoders with the data so that old readers can read data written by futuristic writers.

English

198

Weston Pace retweetledi

LanceDB@lancedb·10 Eyl

LanceDB's own @OnlyXuanwo will be speaking at Rust China Conference 2025 on September 14, 2:00–2:30 PM: rustcc.cn/2025conf/ His session will dive into one of the biggest challenges in the deep learning era: making sense of multimodal data and embedding vectors at scale. @OnlyXuanwo will highlight: ✅ How AI systems need both lakehouse-level governance and millisecond-level semantic retrieval ✅ Lance’s approach to unified heterogeneous data storage, efficient vector indexing, and incremental write consistency ✅ How its Rust kernel enables millisecond-level random reads and writes on object storage through column clustering + embedded ANN indexing, MVCC, and adaptive compaction If you’re building next-generation AI systems, this is a session you won’t want to miss. 🚀

English

596

Weston Pace@westoncpace·29 Tem

@baggiponte @YingjunWu @lancedb github.com/lancedb/lance/… 😉

QME

Weston Pace@westoncpace·29 Tem

@baggiponte @YingjunWu @lancedb Lance is columnar atm but packed struct lets you group one or more columns to pack as rows. Still need someone to add support for packed structs of variable width columns and then we can go full row mode.

English

Yingjun Wu 🤘@YingjunWu·28 Tem

Do we need an open data format designed for row-oriented storage? Parquet is great but not suitable for point access or short range scans. Should we design a new format?

English

3.3K

Weston Pace@westoncpace·28 Tem

@mgill25 A new file format will replace Parquet one day after Parquet replaces CSV. 80% of Parquet use cases work just fine with Parquet. The spots that don't (e.g. db storage format) have already replaced Parquet (DuckDb, Databricks, Snowflake, BigQuery etc.)

English

113

Manish Gill@mgill25·27 Tem

Why hasn’t any file format not seen the widespread adoption that Parquet has? Many new research papers come out about improvements in file formats but none gain momentum.

English

5.7K

Weston Pace retweetledi

LanceDB@lancedb·2 Haz

𝗟𝗮𝗻𝗰𝗲 𝗗𝗲𝗲𝗽 𝗗𝗶𝘃𝗲: 𝗥𝗲𝗽𝗲𝘁𝗶𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗗𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻 𝗟𝗲𝘃𝗲𝗹𝘀 When you’re scanning large datasets, especially if you’re training ML models or running analytics, you need smaller file sizes and faster reads. When your data includes things like lists, optional fields, or nested objects, columnar formats need a way to store that complexity in a clean and efficient way.

English

1.1K

Weston Pace retweetledi

LanceDB@lancedb·25 Şub

Why build a new table format for ML? 🤔 Here's the thoughts and design of Lance's format in the new blog by @westoncpace, give it a read and see how Lance Table Format solves challenges that existing formats miss – from wide data to efficient indexing. blog.lancedb.com/designing-a-ta…

English

5.4K

Weston Pace@westoncpace·2 Ara

@iavins Columnar storage makes fewer, larger requests. Syscall overhead is trivial. We're getting close to needing it at LanceDB though where we're trying to marry columnar storage with random access. Syscall overhead still isn't the bottleneck yet.

English

v@iavins·30 Kas

What are some databases that use io_uring for disk i/o? Are there any modern databases which avoid using io_uring for whatever reason? some I know which do use: Scylla (probably the first?), sled, TigerBeetle, CedarDB.

English

122

15.8K

Weston Pace retweetledi

LanceDB@lancedb·25 Eki

Process terabytes of data without needing terabytes of memory: blog.lancedb.com/columnar-file-… We continue our deep dive into the #Lance file format and explain how Lance uses #backpressure to balance parallelism, speedy I/O, streaming computation, and limited RAM usage. In this blog, @westoncpace explains how our file reader's unique scheduling makes it so easy to configure backpressure settings that you'll hopefully never have to.

English

5.3K

Keşfet

@BEBischof @HoytEmerson @tech_optimist @jezell @techalexpr @OnlyXuanwo @baggiponte @YingjunWu