Weston Pace

41 posts

Weston Pace

Weston Pace

@westoncpace

Data plumber Note: This account is only for promoting my work. More personable posting happens on my account on the "butterfly shaped alternative".

Katılım Temmuz 2022
15 Takip Edilen98 Takipçiler
Weston Pace
Weston Pace@westoncpace·
@BEBischof Every time I get new glasses I think everything is a little clearer and my vision must've gotten worse but it's been decades and it hasn't really changed. So I think my glasses are just always dirty except for that 2 week window when they are new.
English
0
0
1
15
Weston Pace
Weston Pace@westoncpace·
@HoytEmerson @tech_optimist @jezell So if my query pattern changes (for example, I'm doing a lot of find-by-id queries) and suddenly I want fine-grained zone maps, then I don't want to have to rewrite all my data files. I just want to retrain my index with a smaller zone size.
English
1
0
1
26
Hoyt Emerson
Hoyt Emerson@HoytEmerson·
You can build a production data service with Arrow Flight + DuckDB in under 100 lines of Python. Server reads Parquet from S3, client gets results as Arrow batches. That's a real query engine and no warehouse bills.
English
9
8
215
18.7K
Weston Pace retweetledi
LanceDB
LanceDB@lancedb·
𝗖𝗼𝗺𝗯𝗶𝗻𝗶𝗻𝗴 𝗮 𝘀𝗰𝗵𝗲𝗱𝘂𝗹𝗲𝗿 𝗿𝗲𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝗶𝗼_𝘂𝗿𝗶𝗻𝗴, 𝘄𝗲 𝘄𝗲𝗿𝗲 𝗮𝗯𝗹𝗲 𝘁𝗼 𝗮𝗰𝗵𝗶𝗲𝘃𝗲 𝟭,𝟱𝟬𝟬,𝟬𝟬𝟬 𝗜𝗢𝗣𝗦 𝘄𝗶𝘁𝗵 𝗟𝗮𝗻𝗰𝗲 In this blog, Weston Pace deep dives into benchmarking improvements we’ve made in Lance, and how we aim to achieve 1 million IOPS in real systems.
LanceDB tweet media
English
1
3
5
344
Weston Pace
Weston Pace@westoncpace·
@techalexpr Maybe a boring answer but we focus on what our users & customers need. We want to know what's out there but don't spend too much time integrating it until there is a user asking for it with a legitimate use case.
English
0
0
0
11
Tech Alex
Tech Alex@techalexpr·
@westoncpace All these new formats keep things interesting. At LanceDB, how do you decide which innovations are worth integrating for real impact? Would love to hear which one feels like a future game-changer!
English
1
0
0
52
Weston Pace
Weston Pace@westoncpace·
Lot's of work being done on file formats lately. I think I count 5 new formats (Lance, Nimble, Vortex, FastLanes, F3) now. It's definitely something we follow at LanceDB and it can be confusing to track. So here is my very biased head-canon (positivity edition!)
English
2
2
20
3.4K
Weston Pace
Weston Pace@westoncpace·
Hope this helps, it's fun to see so much exciting innovation in a space that's been relatively quiet for many years!
English
0
0
0
173
Weston Pace
Weston Pace@westoncpace·
F3 is from a joint project between CMU and Tsinghua University. They have tackled the "forwards compatibility" problem by storing WASM decoders with the data so that old readers can read data written by futuristic writers.
English
1
0
1
198
Weston Pace retweetledi
LanceDB
LanceDB@lancedb·
LanceDB's own @OnlyXuanwo will be speaking at Rust China Conference 2025 on September 14, 2:00–2:30 PM: rustcc.cn/2025conf/ His session will dive into one of the biggest challenges in the deep learning era: making sense of multimodal data and embedding vectors at scale. @OnlyXuanwo will highlight: ✅ How AI systems need both lakehouse-level governance and millisecond-level semantic retrieval ✅ Lance’s approach to unified heterogeneous data storage, efficient vector indexing, and incremental write consistency ✅ How its Rust kernel enables millisecond-level random reads and writes on object storage through column clustering + embedded ANN indexing, MVCC, and adaptive compaction If you’re building next-generation AI systems, this is a session you won’t want to miss. 🚀
LanceDB tweet media
English
1
1
4
596
Weston Pace
Weston Pace@westoncpace·
@baggiponte @YingjunWu @lancedb Lance is columnar atm but packed struct lets you group one or more columns to pack as rows. Still need someone to add support for packed structs of variable width columns and then we can go full row mode.
English
1
0
2
34
Yingjun Wu 🤘
Yingjun Wu 🤘@YingjunWu·
Do we need an open data format designed for row-oriented storage? Parquet is great but not suitable for point access or short range scans. Should we design a new format?
English
8
2
24
3.3K
Weston Pace
Weston Pace@westoncpace·
@mgill25 A new file format will replace Parquet one day after Parquet replaces CSV. 80% of Parquet use cases work just fine with Parquet. The spots that don't (e.g. db storage format) have already replaced Parquet (DuckDb, Databricks, Snowflake, BigQuery etc.)
English
0
0
0
113
Manish Gill
Manish Gill@mgill25·
Why hasn’t any file format not seen the widespread adoption that Parquet has? Many new research papers come out about improvements in file formats but none gain momentum.
English
11
0
30
5.7K
Weston Pace retweetledi
LanceDB
LanceDB@lancedb·
𝗟𝗮𝗻𝗰𝗲 𝗗𝗲𝗲𝗽 𝗗𝗶𝘃𝗲: 𝗥𝗲𝗽𝗲𝘁𝗶𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗗𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻 𝗟𝗲𝘃𝗲𝗹𝘀 When you’re scanning large datasets, especially if you’re training ML models or running analytics, you need smaller file sizes and faster reads. When your data includes things like lists, optional fields, or nested objects, columnar formats need a way to store that complexity in a clean and efficient way.
LanceDB tweet media
English
2
3
17
1.1K
Weston Pace retweetledi
LanceDB
LanceDB@lancedb·
Why build a new table format for ML? 🤔 Here's the thoughts and design of Lance's format in the new blog by @westoncpace, give it a read and see how Lance Table Format solves challenges that existing formats miss – from wide data to efficient indexing. blog.lancedb.com/designing-a-ta…
English
0
3
15
5.4K
Weston Pace
Weston Pace@westoncpace·
@iavins Columnar storage makes fewer, larger requests. Syscall overhead is trivial. We're getting close to needing it at LanceDB though where we're trying to marry columnar storage with random access. Syscall overhead still isn't the bottleneck yet.
English
0
0
1
41
v
v@iavins·
What are some databases that use io_uring for disk i/o? Are there any modern databases which avoid using io_uring for whatever reason? some I know which do use: Scylla (probably the first?), sled, TigerBeetle, CedarDB.
English
10
7
122
15.8K
Weston Pace retweetledi
LanceDB
LanceDB@lancedb·
Process terabytes of data without needing terabytes of memory: blog.lancedb.com/columnar-file-… We continue our deep dive into the #Lance file format and explain how Lance uses #backpressure to balance parallelism, speedy I/O, streaming computation, and limited RAM usage. In this blog, @westoncpace explains how our file reader's unique scheduling makes it so easy to configure backpressure settings that you'll hopefully never have to.
English
0
10
61
5.3K