Apache Iceberg Data Lakehouse Tips

149 posts

Apache Iceberg Data Lakehouse Tips banner
Apache Iceberg Data Lakehouse Tips

Apache Iceberg Data Lakehouse Tips

@IcebergDataLake

Unofficial account tweeting content on working with Apache Iceberg Data Lakehouses

เข้าร่วม Mayıs 2023
62 กำลังติดตาม185 ผู้ติดตาม
Apache Iceberg Data Lakehouse Tips รีทวีตแล้ว
Alex Merced | Open Data Lakehouse Advocate
RECENT DATA ARCHITECTURE/ENGINEERING/ANALYTICS CONTENT — Apache Iceberg — > What is Data Lakehouse Table Format? dremio.com/blog/apache-ic… > Comparing Iceberg to Other Lakehouse Solutions dremio.com/blog/comparing… > Iceberg Migration Guide dremio.com/blog/migration… > Hands-on with Managed Polaris Catalog dremio.com/blog/getting-h… > Hands-on with Self-Managed Polaris dremio.com/blog/getting-h… — Hybrid Lakehouse — > 3 Dremio Use Cases for On-Prem Data Lakes dremio.com/blog/3-dremio-… > Hybrid Lakehouse Solution: NetApp dremio.com/blog/hybrid-la… > Hybrid Lakehouse Solution: Minio dremio.com/blog/hybrid-la… > Hybrid Lakehouse Solution: Vast Data dremio.com/blog/hybrid-la… > Hybrid Lakehouse Solution: Pure Storage dremio.com/blog/hybrid-la… — Unified Analytics — > Analysts Guide to JDBC/ODBC, REST, and Arrow Flight dremio.com/blog/a-data-an… > Unified Lakehouse dremio.com/blog/the-unifi… #DataEngineering #DataLakehouse #DataScience #DataAnalytics #DataArchitecture
Alex Merced | Open Data Lakehouse Advocate tweet media
English
0
4
6
216
Apache Iceberg Data Lakehouse Tips รีทวีตแล้ว
Alex Merced | Open Data Lakehouse Advocate
HOW ICEBERG CATALOGS WORK Iceberg tables are one part data stored in several parquet files and a second part metadata files that provide context and understanding of that data as a singular table. The metadata entry point is a file called metadata.json which tracks the tables schemas, partition schemes and snapshots. Everytime the table changes a new metadata.json is created. So when there is possibly dozens or hundreds of these metadata.json files, how does an engine like Dremio, Snowflake or Apache Spark know which is the right one to query the table accurately. This is where a catalog comes in like Nessie and Polaris. A catalog acts like a traffic controller maintaining a list of tables along with the file address where the current metadata.json is stored. These references are updated at the end of a transaction after the new metadata.json is created enabling Atomicity guarantees. A catalog directs queries to the right metadata.json and updates that list when writes are complete. If you enjoyed this post, give it a like and a share! Also check out Dremio.com/blog for a lot more Apache Iceberg education resources. #ApacheIceberg #DataLakehouse #DataEngineering
Alex Merced | Open Data Lakehouse Advocate tweet media
English
0
4
6
160
Apache Iceberg Data Lakehouse Tips รีทวีตแล้ว
Alex Merced | Open Data Lakehouse Advocate
OPTIMIZING ICEBERG TABLES One the things that make Iceberg queries fast is that the metadata can be used eliminate files that don’t need scanning from the scan plan. This is great but if the data is not clustered properly or spread out across many small files, you can still see less than ideal performance. ** Compaction ** When you have more manifests and data files than you need, you are doing more file operations and slowing down performance. By rewriting these files so you can collapse the data into fewer larger files you have the opposite effect. This can be done the REWRITE_DATA_FILES or REWRITE_MANIFESTS procedures in Spark or the OPTIMIZE TABlE command in Dremio. ** Clustering ** If I only am searching for agent in the northwest region, it’d be nice if all those reps where in the same few files, this is known as clustering. When rewriting data files with Spark, there is a “sort” parameter you can pass so it can cluster the data as it rewrites the files. By compacting and clustering you data, the Apache Iceberg metadata becomes even more powerful in skipping data files when executing queries. Read more in my new article on maintaining Apache Iceberg lakehouses here: dremio.com/blog/guide-to-… #DataLakehouse #ApacheIceberg #DataEngineering
Alex Merced | Open Data Lakehouse Advocate tweet media
English
0
5
6
170
Apache Iceberg Data Lakehouse Tips รีทวีตแล้ว
MinIO
MinIO@Minio·
Join us on September 5th at 10am PT for a MinIO x @dremio x @Carahsoft webinar about how modern #datalakes can help government customers solve their modernization initiatives. Register here: hubs.li/Q02Lc2rV0
MinIO tweet media
English
0
6
7
483
Apache Iceberg Data Lakehouse Tips รีทวีตแล้ว
Dremio
Dremio@dremio·
Join us for "An Apache Iceberg Lakehouse Crash Course" an in-depth series designed to provide a comprehensive understanding of Apache Iceberg, taught by Iceberg expert Alex Merced. hello.dremio.com/webcast-an-apa…
Dremio tweet media
English
0
5
7
341
Apache Iceberg Data Lakehouse Tips รีทวีตแล้ว
Dremio
Dremio@dremio·
🎙️ Dive into the minds of data disruptors! 🚀 Join us on the #DataDisruptors podcast as we unravel the strategies and insights shaping the future of data leadership. Tune in for exclusive conversations that redefine the data landscape. Listen now! 🔗 dremio.com/data-disruptor…
Dremio tweet media
English
0
1
1
212