Marco Slot

616 posts

Marco Slot

Marco Slot

@marcoslot

Mostly tweets about Postgres, Snowflake Postgres, and Postgres extensions. Formerly Crunchy Data, Microsoft, Citus Data, AWS, TCD, VU

Haarlem, Nederland Katılım Mayıs 2009
662 Takip Edilen1.4K Takipçiler
Sabitlenmiş Tweet
Marco Slot
Marco Slot@marcoslot·
The modern data warehouse for PostgreSQL has arrived! Crunchy Data Warehouse is PostgreSQL enhanced for fast analytics and data pipelines, powered by Iceberg and DuckDB, with easy data lake query/import/export, fast local disk cache, managed for you by the team at Crunchy Data.
Crunchy Data@crunchydata

Announcing Crunchy Data Warehouse! A next-generation Postgres-native data warehouse. Full Iceberg support for fast analytical queries and transactions, built on unmodified Postgres to support the features and ecosystem you love. crunchydata.com/blog/crunchy-d…

English
1
10
68
6K
Marco Slot
Marco Slot@marcoslot·
@tzkb Yes, it can read Iceberg tables managed by a REST catalog, though support is still somewhat limited (currently only tested with Polaris, can only configure 1 catalog) #issuecomment-3641347703" target="_blank" rel="nofollow noopener">github.com/Snowflake-Labs…
English
1
0
4
158
こば -Koba as a DB engineer-
pg_lake、勘違いしてた。IcebergのカタログをPostgreSQLに持つ構成なんだね。外部にあるカタログを見にいって、Icebergにクエリするものかと思ってた。それも出来るのかな。
日本語
2
0
5
1.5K
Marco Slot
Marco Slot@marcoslot·
A deep dive on why Postgres and Iceberg belong together 🙂
Stanislav Kozlovski@kozlovski

Your Postgres is 100x slower than traditional OLAP engines. A deceptively simple OSS extension fixes this. Here's an interview where we dive into the deep engineering around how this is achieved. Joining me (and leading the conversation) is Marco Slot: an engineer with an EXTENSIVE and impressive career history around PostgreSQL: 👉 Created pg_cron in 2017 (3.7k stars) - a tool to run cron-jobs in Postgres 👉 Built pg_incremental - fast, reliable, incremental batch processing inside PostgreSQL itself 👉 co-created pg_lake (after working on Crunchy Data's Warehouse, and getting acquired into Snowflake) 👉 Helped get pg_documentdb (MongoDB-on-Postgres) off the ground @marcoslot is a world-class expert in Postgres extensions. He seriously impressed me with his knowledge over the course of a private LinkedIn conversation, and now that I type out his resume - I understand where it came from. He should be on everyone's radar. So I brought him on the pod. In our full 2-hour deep-dive, we went over: • 🔥 how pg_lake makes analytics 100x faster (literally) • 🔥 perf internals like vectorized execution & CPU branching • 🤔 practical differences between OLTP and OLAP database development (and the age-old mission in uniting both) • 🤔 how (and why) pg_lake intercepts query plans and delegates parts of the query tree to DuckDB • 💡 why Postgres is architecturally terrible at analytical queries (and how vectorized execution fixes this) • 💡 Marco's hard-won experience through a decade+ career in Postgres • 🏆 Iceberg's role as the TCP/IP for tables • 🏆 what the real moat of PostgreSQL is Developments like pg_lake are a real reason why "Just Use Postgres" is much more than a meme, and it'll continue to dominate discourse. I promise you will learn a lot from this episode. Timestamps: (0:02) What is pg_lake? (2:23) Postgres' 100x slower problem and columnar storage experiments they had to make Postgres fast for analytics (6:00) practical examples and internals (16:20) perf internals - vectorized execution & CPU optimization (23:00) pg_lake architecture (why DuckDB isn't embedded) and the connection-per-process issue (29:16) how pg_lake intercepts the query plan tree and delegates parts to DuckDB (41:09) Iceberg catalogs (48:24) postgres to iceberg ingestion patterns (and pg_incremental) (53:40) Marco's (long) career: early AWS, Citus, Microsoft, Crunchy Data & Snowflake (1:04:20) Marco's observations around the merging between OLTP and OLAP (and the subtle dev differences there) (1:15:30) reverse ETL (1:33:08) Iceberg as the TCP/IP for tables (1:35:00) Marco's thoughts on the "Just Use Postgres" fever

English
0
0
8
780
Marco Slot retweetledi
Stanislav Kozlovski
Stanislav Kozlovski@kozlovski·
Your Postgres is 100x slower than traditional OLAP engines. A deceptively simple OSS extension fixes this. Here's an interview where we dive into the deep engineering around how this is achieved. Joining me (and leading the conversation) is Marco Slot: an engineer with an EXTENSIVE and impressive career history around PostgreSQL: 👉 Created pg_cron in 2017 (3.7k stars) - a tool to run cron-jobs in Postgres 👉 Built pg_incremental - fast, reliable, incremental batch processing inside PostgreSQL itself 👉 co-created pg_lake (after working on Crunchy Data's Warehouse, and getting acquired into Snowflake) 👉 Helped get pg_documentdb (MongoDB-on-Postgres) off the ground @marcoslot is a world-class expert in Postgres extensions. He seriously impressed me with his knowledge over the course of a private LinkedIn conversation, and now that I type out his resume - I understand where it came from. He should be on everyone's radar. So I brought him on the pod. In our full 2-hour deep-dive, we went over: • 🔥 how pg_lake makes analytics 100x faster (literally) • 🔥 perf internals like vectorized execution & CPU branching • 🤔 practical differences between OLTP and OLAP database development (and the age-old mission in uniting both) • 🤔 how (and why) pg_lake intercepts query plans and delegates parts of the query tree to DuckDB • 💡 why Postgres is architecturally terrible at analytical queries (and how vectorized execution fixes this) • 💡 Marco's hard-won experience through a decade+ career in Postgres • 🏆 Iceberg's role as the TCP/IP for tables • 🏆 what the real moat of PostgreSQL is Developments like pg_lake are a real reason why "Just Use Postgres" is much more than a meme, and it'll continue to dominate discourse. I promise you will learn a lot from this episode. Timestamps: (0:02) What is pg_lake? (2:23) Postgres' 100x slower problem and columnar storage experiments they had to make Postgres fast for analytics (6:00) practical examples and internals (16:20) perf internals - vectorized execution & CPU optimization (23:00) pg_lake architecture (why DuckDB isn't embedded) and the connection-per-process issue (29:16) how pg_lake intercepts the query plan tree and delegates parts to DuckDB (41:09) Iceberg catalogs (48:24) postgres to iceberg ingestion patterns (and pg_incremental) (53:40) Marco's (long) career: early AWS, Citus, Microsoft, Crunchy Data & Snowflake (1:04:20) Marco's observations around the merging between OLTP and OLAP (and the subtle dev differences there) (1:15:30) reverse ETL (1:33:08) Iceberg as the TCP/IP for tables (1:35:00) Marco's thoughts on the "Just Use Postgres" fever
English
2
24
232
16.4K
Marco Slot retweetledi
Craig Kerstiens
Craig Kerstiens@craigkerstiens·
Your fully open source time series stack: Postgres: because duh pg_partman: time partitioning pg_lake: iceberg for archival natively in Postgres pg_incremental: incremental data processing snowflake.com/en/engineering…
English
1
11
76
4.1K
Marco Slot
Marco Slot@marcoslot·
@denismagda @tobias_petry We've definitely considered it, but two main challenges are the space reclamation in DuckDB files, and unioning not-cached Parquet with a large number of DuckDB tables without repercussions. So far, we've punted on those due to high complexity for incremental improvement.
English
0
0
1
33
Denis Magda
Denis Magda@denismagda·
Let Postgres own the Iceberg catalog and delegate analytics to DuckDB. The result => transactional lakehouse updates with fast analytical queries. This isn’t a concept. It’s exactly what pg_lake delivers today. pg_lake combines a set of extensions and components that let you query and modify Iceberg tables (and other lakehouse formats) directly from Postgres. DuckDB is used to accelerate analytical queries and runs in a sidecar process called pgduck_server, which communicates with Postgres during query execution. How it works (diagram below): 1. An application sends a query to Postgres to calculate unrealized PnL (Profit and Loss) for the Disney ticker. 2. Postgres parses the query and identifies the part that computes the average price from historical lakehouse data. 3. That part is forwarded to pgduck_server for accelerated execution. 4. pgduck_server delegates execution to DuckDB, which queries the lakehouse (reusing cached data if available). 5. DuckDB computes the average price and returns it to Postgres. 6. Postgres joins the result with local portfolio data and computes the unrealized PnL. 7. The final result is returned to the application.
English
7
33
375
28.4K
fictiousxxl
fictiousxxl@Mr_D_V_D·
@denismagda can we see the implementation of this so we understand the actual setup
English
1
0
0
372
Marco Slot
Marco Slot@marcoslot·
@tobias_petry @denismagda We implemented an LRU file caching layer (write-through and background fetches) on top of DuckDB's file system abstraction which can be activated by running pgduck_server with --cache_dir .. In addition Iceberg metadata is cached in Postgres tables.
English
1
0
2
94
Tobias_Petry.sql
Tobias_Petry.sql@tobias_petry·
@denismagda > 4. pgduck_server delegates execution to DuckDB, which queries the lakehouse (reusing cached data if available). How does it reuse cached data? Did you setup a file-based cache for the lakehouse data?
English
1
0
1
711
Marco Slot retweetledi
Mim
Mim@mim_djo·
First look at pg_lake, and how #duckdb gave #postgresql a boost :) Apache Iceberg + Postgres + DuckDB compute pushdown is a very interesting combo. youtu.be/qlzIY6hjjLw
YouTube video
YouTube
English
1
11
54
6.2K
さくらもち太郎🍡
さくらもち太郎🍡@Korosuke512tr·
仕事は納まったがpg_lakeで作ったIceberg テーブルのコンパクションはどうするのが最適なのかは謎のままなので継続調査📈
日本語
1
0
2
140
Marco Slot
Marco Slot@marcoslot·
@tzkb It uses text to store custom types and leaves the parsing and filtering to Postgres. It's definitely not very efficient, but always works. Note that pg_lake can delegate whole complex queries into DuckDB, just not when it needs to filter in Postgres.
English
0
0
0
68
こば -Koba as a DB engineer-
pg_lakeだとDuckDBが扱えないはずのデータ型を指定しても、create tableが出来ている。そしてクエリ投げると、DuckDBとpostgresの両方でfilter掛けるような動きになっている。 これ、pg_lake(pgduck_server?)が何かしら変換掛けてるのかな。感覚的にこちらの方が不思議な動きに見える。
日本語
1
0
1
712
こば -Koba as a DB engineer-
postgres+DuckDB/Icebergについての現状がまとまってた👀 pg_mooncakeの動きが良く分かってなかったので有難い。 そして、DuckDB v1.4からIceberg対応が続いてるので、この辺の情勢もすぐに変わりそうである。 zenn.dev/forcia_tech/ar…
日本語
1
5
38
4.1K
Marco Slot
Marco Slot@marcoslot·
@mim_djo with del as (delete from heap returning *) insert into iceberg select * from del; Put it in a pg_cron job and problem solved 😉
English
2
0
2
328
Mim
Mim@mim_djo·
iceberg or delta for that matter are not great for streaming, i got like 1.3 transaction per second 🙂
Mim tweet media
English
2
1
14
1.6K
Marco Slot retweetledi
Mim
Mim@mim_djo·
Playing with pg_lake in Postgresql, suddenly everything make sense, same database two workload OLTP and OLAP using 🦆
English
1
1
16
1.5K
Marco Slot
Marco Slot@marcoslot·
pg_lake just went open source! pg_lake is a set of extensions (from Crunchy Data Warehouse) that add comprehensive Iceberg support and data lake access to Postgres, with @duckdb transparently integrated into the query engine. GitHub: snowflake-labs/pg_lake Blog link below
Marco Slot tweet media
English
1
5
28
1.4K
Marco Slot retweetledi
Andy Pavlo (@andypavlo.bsky.social)
.@abigale_kim's paper is unleashed! It's the most complete eval of DB extensions/plugins. We analyze @PostgreSQL, @MySQL, @mariadb , SQLite, @duckdb, @Redis. TLDR: Postgres ecosystem is fraught w/ footguns. Other DBMSs have fewer extns but less problems. DuckDB has cleanest API.
PVLDB@pvldb

Vol:18 No:6 → Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility vldb.org/pvldb/vol18/p1…

English
1
32
182
14.5K
Marco Slot retweetledi
Andy Pavlo (@andypavlo.bsky.social)
No system hits the sweet spot of allowing for extensibility while maintaining systems safety. It would be nice if there was a standard plugin API (think POSIX) that allows compatibility across systems. Thanks to @marcoslot + @dave_andersen for their collaboration on this project
Andy Pavlo (@andypavlo.bsky.social) tweet media
English
0
3
22
3.4K