RisingWave
1.1K posts

RisingWave
@RisingWaveLabs
Best-in-class stream processing, analytics, and management. 🚀 10x more productive. 🚀 10x more cost-efficient. Chat: https://t.co/MpFCvvNxz1
San Francisco Bay Area Katılım Temmuz 2021
896 Takip Edilen3K Takipçiler

Most data stacks still force a trade-off between fast ingestion and fast analytics.
Streaming systems optimize for writes.
Warehouses optimize for reads.
But rarely do you get both without adding complexity.
That’s where this architecture shines:
Build a Flexible Iceberg Lakehouse with RisingWave (CoW + MoR)
Why this matters
Traditional lakehouse pipelines often struggle with:
Write-heavy CDC workloads slowing down analytics
Read-optimized tables breaking under frequent updates
Complex tuning across ingestion and query layers
Extra pipelines just to balance performance trade-offs
By introducing configurable Iceberg write modes, you eliminate this friction.
What this architecture enables
Choose write-optimized (MoR) or read-optimized (CoW) per table
Handle high-throughput CDC without falling behind
Serve fast BI queries on clean, optimized data
Use a single system for both streaming + analytics
Native Iceberg integration with open table formats
Fully compatible with Spark, Trino, DuckDB
How the flow works
Streaming data (e.g. CDC from PostgreSQL) is ingested into RisingWave
RisingWave writes to Iceberg tables using configurable write modes:
Merge-on-Read (MoR) → fast streaming ingestion via delta files
Copy-on-Write (CoW) → optimized query performance via file rewrites
Iceberg tables are stored on S3 and managed via a catalog (e.g. Lakekeeper)
Compaction continuously optimizes storage and performance
Query engines like Spark, Trino, and DuckDB read the same tables directly
What you get
Streaming ingestion:
PostgreSQL → RisingWave (CDC, real-time updates)
Flexible storage layer:
Iceberg tables with CoW or MoR on S3
Multi-engine access:
Spark, Trino, DuckDB querying the same data
Unified pipeline:
Streaming + batch without duplication
Choosing the right write mode
Merge-on-Read (MoR) → for streaming workloads
Low write latency
Handles frequent updates efficiently
Slightly slower reads (merge required)
Copy-on-Write (CoW) → for analytics workloads
Fast, predictable queries
Clean data files
Higher write cost (file rewrites)
Common pattern
MoR → raw streaming / CDC tables
CoW → curated analytics / BI tables
This gives you the best of both worlds without extra pipelines.
The shift
This isn’t just about tuning performance.
It’s a shift from:
Rigid, one-size-fits-all storage
Fragmented ingestion vs analytics systems
To:
A flexible streaming lakehouse where each table is optimized for its purpose
A system where write and read performance are no longer at odds
A fully open, engine-agnostic architecture
Use RisingWave to stream data, choose the right Iceberg write mode per workload, and let engines like Spark, Trino, and DuckDB query the same data, optimized for both ingestion and analytics.

English

Build Real-time Apps Without Polling or Caches
Most frontend data stacks bolt on real-time as an afterthought.
REST APIs for reads
Caches for speed
WebSockets and polling for “live” updates
Result?
Complex, fragile systems.
The shift: make data live by design.
Build reactive apps with Surf + RisingWave.
Why this matters:
Traditional real-time apps suffer from
Re-running queries on timers
Manual diffing and cache invalidation
Over-fetching even when nothing changes
What this enables:
Queries to materialized views that stay up to date
No polling, updates pushed via SUBSCRIBE
Transactions for safe writes
Live React UI with useQuery()
One WebSocket replaces REST and polling
How it works:
Write → RisingWave updates view →
SUBSCRIBE pushes changes →
UI re-renders automatically
The shift:
Keep asking for data
Let the database push updates
Architecture:
Before: Frontend → API → Cache → DB → Queue → WebSocket
After: Frontend ↔ Surf ↔ RisingWave
Result:
Simpler systems
Faster queries
Truly real-time apps
Declare your data once and let the database keep your UI in sync.
Read the blog by @kwannoel, the creator of Surf:
noelkwan.xyz/surf/developer…
English

Most data stacks still split streaming and analytics into separate worlds.
PostgreSQL handles transactions.
Streaming systems handle ingestion.
Warehouses handle analytics.
But the real power comes when all of them work together seamlessly.
That’s where this architecture shines:
Build a Streaming Lakehouse with PostgreSQL + RisingWave + Apache Iceberg (Glue Catalog) + Spark
Why this matters
Traditional pipelines often introduce unnecessary complexity:
Data duplication across systems
Fragile batch ETL jobs
Delays between ingestion and analytics
Tight coupling to specific engines
By combining CDC, streaming, and open table formats, you eliminate these gaps.
What this architecture enables
Real-time PostgreSQL CDC ingestion into RisingWave
Native Iceberg table management backed by AWS Glue
Continuous streaming writes with transactional consistency
Query the same data directly from Spark
Zero data duplication across systems
Fully open, engine-agnostic architecture
How the flow works
PostgreSQL changes are captured via CDC into RisingWave
RisingWave connects to an Iceberg catalog (AWS Glue) and S3 storage
An internal table is created using ENGINE = iceberg
Streaming data is continuously written into Iceberg tables
RisingWave supports real-time queries and materialized views
Spark queries the same Iceberg table directly
What you get
Streaming ingestion: PostgreSQL → RisingWave
Open lakehouse storage: Iceberg on S3 with Glue catalog
Multi-engine access: Spark, Trino, DuckDB
Unified pipeline: Real-time + batch in one system
This isn’t just another pipeline.
It’s a shift from fragmented systems
to a unified streaming lakehouse
where data is fresh, open, and accessible from anywhere.
Use RisingWave to connect to PostgreSQL, stream CDC, and write directly into Iceberg, while engines like Trino, Spark, and DuckDB read from the same S3 tables.

English

Which Apache Iceberg Catalog Should You Choose?
Iceberg doesn’t have just one catalog.
And that choice matters more than most teams realize.
Your catalog determines how well your lakehouse handles:
metadata coordination
multi-engine access
governance
scaling
cloud portability
In a modern lakehouse:
Data = object storage
Format = Iceberg
Engines = Spark, Trino, Snowflake, RisingWave, etc.
Catalog = metadata control plane
Iceberg catalogs act as the single source of truth for table metadata, tracking schemas, snapshots, partitions, and file locations so engines know how to read and write tables safely.
Think of it like an airport, an analogy from Arsham Eslami:
Data = cargo
Engines = planes
Catalog = air traffic control
Common catalog options
Hadoop – simple but not cloud-scale
Hive Metastore – widely supported but operationally heavy
AWS Glue – great for AWS ecosystems
JDBC – good for testing and PoCs
REST catalogs – open, multi-engine, multi-cloud
Examples include Polaris, Nessie, Gravitino, and Lakekeeper.
Why REST catalogs are gaining traction
They enable:
cross-engine sharing
standardized APIs
cloud portability
centralized governance
One API.
Any engine.
Any cloud.
RisingWave supports multiple Iceberg catalogs (Glue, Hive, JDBC, REST, Snowflake, Unity Catalog, Lakekeeper) and even provides a self-hosted Iceberg REST catalog for creating Iceberg-native tables directly from streaming pipelines.
Want a deeper comparison (with pros/cons of each catalog), read this blog by Fahad Shah:
risingwave.com/blog/apache-ic…
English

Build a Streaming Lakehouse with Kafka + RisingWave + Iceberg + DuckDB
Most streaming pipelines stop at ingestion.
Most lakehouses stop at storage.
The real value comes when streaming ingestion, open table storage, and external query engines all work together as one system.
That is exactly what this architecture enables:
Kafka → RisingWave → Iceberg → DuckDB
Why this matters:
Traditional pipelines often create silos.
Streaming systems handle ingestion.
Warehouse systems handle analytics.
Lakekeeper acts as a catalog and provides the control plane that connects everything.
This often leads to:
duplicate storage
fragile ETL pipelines
engine lock-in
slow access to fresh data
This setup solves that by combining real-time ingestion with open lakehouse storage.
Real-time ingest from Kafka into RisingWave
RisingWave-managed Iceberg tables stored in object storage via Lakekeeper catalog
Continuous streaming writes with transactional table commits
Query the same table directly from DuckDB
No data copies and no proprietary lock-in
Open architecture compatible with other engines
It transforms your stack from
a collection of disconnected systems
into
a unified streaming lakehouse.
With this pattern, you get:
Streaming ingestion
Open Apache Iceberg storage
External query access from multiple engines
A clean path to unify real-time and batch workloads
In this flow:
Kafka streams user events into RisingWave
RisingWave connects to an Iceberg catalog (Lakekeeper) and object storage
A RisingWave-managed internal table is created
Streaming data is continuously written into Iceberg
RisingWave queries and materializes results in real time
DuckDB queries the same Iceberg table directly
What you get
Streaming ingest: Kafka → RisingWave
Open lakehouse storage: Iceberg on object storage
Multi-engine access: DuckDB, Spark, Trino
Managed table lifecycle through RisingWave
This is what the streaming lakehouse should look like.
Real-time ingestion and analysis
Open storage formats
Query from any Iceberg-compatible engine
And with RisingWave supporting Iceberg natively, the streaming lakehouse is no longer theoretical, it’s here!

English

🔥Excited to partner with our friends at @platformatory to help bring another edition of Bangalore Streams to life on March 7!
Expect great conversations, plenty of networking, and good food. 🍕🧉
Save your spot here: meetup.com/bengaluru-stre…

English

Join us TOMORROW for the debut of our Customer Spotlight series, featuring a live discussion with GDU Labs: How GDU Labs Uses RisingWave to Turn Fragmented Data into Verified Profiles with @ProductPasha and Alex Robbin.
Save your spot: luma.com/2h8bla03
English

🔥Happens TOMORROW! NYC Open Source Data Happy Hour cohosted by @aiven_io and @RisingWaveLabs.
We’ll talk Apache Kafka and Apache Iceberg, focusing on practical lessons, real-world use cases, and the latest technical insights.
Save your spot: luma.com/84ihoxyb

English

We are excited to be part of Developer Discovery Day organized by @IMDAsg on March 10. Along with @h2oai, @kong, @Redisinc, our Bohan Zhang will present Brewing Real-Time Intelligence with RisingWave.
Save your spot: developer-discovery.app.entry.gov.sg/form/30c7d0e9-…

English

🥁 Join us on March 5 for our very first Customer Spotlight webinar and for a live convo with GDU Labs: How GDU Labs Uses RisingWave to Turn Fragmented Data into Verified Profiles with Alex Robbin (GPU Labs)
and @ProductPasha (RisingWave).
🎟️Sign up here: luma.com/2h8bla03

English

Join us on March 4 in NYC for the Open Source Data Happy Hour by @aiven_io and RisingWave! Wrap up your day with some great convos about building modern data platforms with Apache Kafka and Apache Iceberg.
Save your spot: luma.com/84ihoxyb

English

Iceberg writes look simple… until you see what’s happening under the hood.
Iceberg write path in one line:
Data files → Manifest files → Manifest list → Metadata file → Catalog (pointer update)
Apache Iceberg Write Path (Engine → Table)
When an engine writes to an Iceberg table, it builds a new snapshot bottom-up, then publishes it atomically:
1) Data Files
The engine writes new data files to object storage (new files for inserts, and new/rewritten files for updates/merges/compaction).
2) Manifest Files
Iceberg records which data files were added and removed in manifest files, along with partition info and file-level stats.
3) Manifest List
Those manifest files are grouped into a manifest list for the snapshot being committed.
4) Metadata File
Iceberg writes a new table metadata file that describes the updated table state (snapshot, schema, history) and references the new manifest list.
5) Catalog (Current Metadata Pointer)
Finally, the catalog is updated to point to the new metadata file, publishing the new snapshot so every engine sees the latest table state.
Why this matters
Iceberg writes are atomic: engines first create the new snapshot’s files + metadata, then flip a single catalog pointer.
That means readers either see the old snapshot or the new snapshot, never a half-written state.

English

We’ve built an end-to-end demo that turns raw network metric streams into actionable, real-time insights using simple SQL.
It combines Kafka for high-throughput data ingestion, RisingWave for real-time stream processing, Iceberg for data lake persistence, and @Minio for object storage.
For details, read the blog below: risingwave.com/blog/real-time…
English

Ever wondered how an engine actually reads an Iceberg table?
Iceberg read path in one line:
Catalog → Metadata → Manifest list → Manifest files → Data files
Apache Iceberg Read Path (Engine → Table)
When an engine reads an Iceberg table, it walks this chain from top to bottom:
1) Catalog
The starting point.
Stores a pointer to the table’s current metadata file, which represents the latest snapshot reference.
2) Metadata File
Defines the table schema, lists snapshots, and references the manifest list for the snapshot being read.
3) Manifest List
Tracks all manifest files associated with the selected snapshot.
4) Manifest Files
Contain metadata about data files, including partition values and file-level statistics, which help determine which files should be read.
5) Data Files
The actual table data is stored in object storage. This is what the engine ultimately reads.
Why this matters
During reads, Iceberg resolves the snapshot through the catalog and metadata layers, then uses manifest metadata to identify the exact set of data files for that snapshot.

English

We built an end-to-end demo that shows how to turn live logistics streams into accurate, real-time ETA insights using SQL, without complex stream-processing code.
It combines Kafka for ingestion and RisingWave for continuous processing and serving.
For details, read this blog: risingwave.com/blog/real-time…
English

Mark your calendar for Feb 19 and join us for What’s Ahead for RisingWave: The 2026 Roadmap.
Get an inside look at what we’re building next with @ProductPasha and be part of the conversation shaping our direction.
Save your spot: luma.com/1xukxo2t

English
