Apache Hudi

627 posts

Apache Hudi banner
Apache Hudi

Apache Hudi

@apachehudi

Official twitter handle of Apache Hudi, an open data lakehouse platform. https://t.co/SXay7oHNah

Katılım Ocak 2019
126 Takip Edilen3.7K Takipçiler
Apache Hudi
Apache Hudi@apachehudi·
At scale, table metadata explodes, as Google's VLDB paper "Big Metadata" shows. Hudi's metadata table treats it like big data 📊: self-managed, indexed Key bits: - Partition/file listings 📁 - Column stats 📈 - Record indexes 🔍 This ensures efficient reads/writes as things grow 🚀.
Apache Hudi tweet media
English
1
1
3
202
Apache Hudi
Apache Hudi@apachehudi·
Data skipping often assumes simple filters: column = value 💡 But real queries are messy 😩 with transforms like: • from_unixtime(ts) ⏰ • substring(id, …) ✂️ • lower(name) 🔡 Hudi's expression indexes reframe Hive partitioning as indexing ✨ - Logical partitioning: Index by 'ts' despite physical 'city' partitions - Build indexes on transformed expression outputs Perfect for complex lakehouse queries!
Apache Hudi tweet media
English
1
1
1
160
Apache Hudi
Apache Hudi@apachehudi·
Blog introducing expression indexes: #partitioning-replaced-by-expression-indexes" target="_blank" rel="nofollow noopener">hudi.apache.org/blog/2024/12/1…
English
0
0
0
33
Apache Hudi
Apache Hudi@apachehudi·
Merge-on-Read is more than just fast writes 🚀. It's also needs ongoing file compaction during ingestion 🔄. Key: This compaction can't block writers ⚠️. Hudi's async compaction ⚙️: Runs as a background service, keeping ingestion smooth while merging logs and base files. No freshness lags or livelocks 🔒—unlike others that halt writers.
Apache Hudi tweet media
English
1
1
2
152
Apache Hudi
Apache Hudi@apachehudi·
Explore concurrency pitfalls in this blog: #pitfalls-in-lake-concurrency-control" target="_blank" rel="nofollow noopener">hudi.apache.org/blog/2021/12/1…
English
0
0
0
39
Apache Hudi
Apache Hudi@apachehudi·
Many pipelines claim “incremental” but rescan full tables for changes 🔍. That's just scheduled batch scans, not true incrementals. Apache Hudi delivers real incremental processing 🔄: Fetch only changes since a commit. Slashes costs 💸: Lower I/O 📉, shuffle 🔀, latency ⏱️, simpler DAGs 🗺️. Reimagine derived tables—update from upstream streams, skip full rebuilds. Hudi builds this right into storage.
Apache Hudi tweet media
English
1
0
1
331
Apache Hudi
Apache Hudi@apachehudi·
Operational data changes row by row. Lake storage is immutable. That mismatch is why old data lake pipelines rewrite entire partitions for small fixes, late events, or CDC updates/deletes. Apache Hudi matters because it adds two missing primitives: •row-level upserts/deletes •incremental reads of what changed That turns the lake into something you can actually operate.
Apache Hudi tweet media
English
0
0
5
587
Apache Hudi
Apache Hudi@apachehudi·
🎓 Vinoth at CMU on a key lakehouse problem: read/write amplification in MERGE operations. MERGE INTO is everywhere in ETL. Most merges only change a few columns per record. Yet full records get rewritten. @apachehudi's answer: partial update encoding on MOR tables. Write only the changed columns to log files. Defer full merges to compaction. Benchmark (1TB table, 100 fields, updates on 3): ⚡ Update latency: 1.4x faster 💾 Bytes written: 70.2x less I/O 🔍 Query latency: 5.7x faster 70x less write I/O. That's the power of writing only what changed. Watch: youtu.be/AYaw06_Xazo?si… #ApacheHudi #DataLakehouse #DataEngineering
YouTube video
YouTube
Apache Hudi tweet mediaApache Hudi tweet media
English
0
1
3
272
Apache Hudi
Apache Hudi@apachehudi·
What if your query engine could skip entire partitions before looking at files? Hudi 1.0's dual-layer data skipping: 1️⃣ Partition stats skip partitions using partition-level min/max 2️⃣ Column stats then prune files within surviving partitions In a 1TB benchmark, files read dropped from 393,360 → 19,304. Gains vary by data and query, but the mechanism is clear: fewer partitions = fewer files scanned. ⚡ Both indexes enabled by default. No extra config. 👇 Deep dive with benchmarks: hudi.apache.org/blog/2025/10/2… #ApacheHudi #DataLakehouse #DataEngineering
English
0
2
3
190
Apache Hudi
Apache Hudi@apachehudi·
90% less data scanned. 58% faster queries. 🚀 Apache Hudi's secondary indexes bring database-style indexing to the lakehouse. CREATE INDEX idx_city ON hudi_table(city); That's it. Now queries on non-key fields skip irrelevant files instead of scanning everything. ✂️ 📊 Benchmark on 1TB TPCDS: 📉 67 GB scanned → 7 GB 📁 5000 files → 521 files ⚡ 14s → 6s For Athena users: less data scanned = lower costs 💰 👇 Deep dive with examples: hudi.apache.org/blog/2025/04/0… #ApacheHudi #DataLakehouse #DataEngineering
Apache Hudi tweet media
English
0
2
5
239
Apache Hudi
Apache Hudi@apachehudi·
One Hudi job. Multiple Kafka topics. Multiple Hudi tables. 🚀 New AWS tutorial: CDC pipelines with Hudi MultiTable Hudi Streamer (a.k.a. DeltaStreamer) ⚡ Process multiple MSK topics in parallel 🕐 15-min sync intervals 🔒 ACID guarantees 📐 Schema evolution support Common config + table-specific overrides = elegant simplicity. Full walkthrough: hudi.apache.org/blog/2026/01/1… #ApacheHudi #AWS #DataEngineering #CDC
Apache Hudi tweet media
English
0
0
1
203
Apache Hudi
Apache Hudi@apachehudi·
@datahubhouse Thanks for the great content! re-shared x.com/apachehudi/sta…
Apache Hudi@apachehudi

🎬 Demystifying @apachehudi - solid intro to lakehouse fundamentals by @datahubhouse ! Covers: 📦 Copy-on-Write vs Merge-on-Read table types ⚡ Hudi 1.0 LSM timeline + secondary indexing 🔄 ACID, schema evolution, CDC, time travel Great starting point for data engineers evaluating lakehouse platforms. 🔗 youtube.com/watch?v=P0cfrp… #ApacheHudi #DataEngineering #DataLakehouse

English
1
0
1
11
DataHubHouse
DataHubHouse@datahubhouse·
Demystifying Apache Hudi Apache Hudi is a sophisticated lakehouse platform designed to manage large-scale, mutable datasets through transactional table formats. The provided documentation highlights two primary storage strategies: Copy-on-Write, which is optimised for heavy read workloads by creating new base files, and Merge-on-Read, which balances performance via delta logs and background compaction. These sources detail the Hudi 1.0 release, introducing an enhanced LSM-based timeline for high-frequency writes and advanced secondary indexing to accelerate query speeds. The technical specifications explain how the system ensures ACID transactions and schema evolution across diverse engines like Spark and Flink. Furthermore, the texts explore Change Data Capture and incremental processing, allowing users to efficiently track record updates and perform time-travel queries. Ultimately, the materials demonstrate how Hudi transforms immutable cloud storage into a high-performance, stream-processing-friendly data environment. youtu.be/P0cfrpcM55Y?si… via @YouTube
YouTube video
YouTube
English
2
3
3
58