Apache Spark

181 posts

Apache Spark

@ApacheSpark

Lightning-fast unified analytics engine

Katılım Haziran 2013

1 Takip Edilen43.9K Takipçiler

Apache Spark@ApacheSpark·2d

A simple filter-and-sort over a few thousand rows would reduce from 330ms to 150ms based on early testing. The planner skips unnecessary shuffles when it knows the data fits on one node. The features, which are on track to merge into Spark's 4.2/4.3 release lowers the barrier for developers who want to start small and scale up without switching tools. Huge thanks to Daniel Tenedorio and Liang-Chi Hsieh for spearheading the efforts! 🚀

English

492

Apache Spark@ApacheSpark·2d

Apache Spark is getting faster on laptops.💻 Project Feather proposes three optimizations targeting local mode: query compilation and task scheduling improvements, Arrow-based df.cache, and shuffle-free execution for single-node queries. 👉 Read the SPIP: #heading=h.hj76akdx5ul" target="_blank" rel="nofollow noopener">docs.google.com/document/d/1Np… #ApacheSpark #OpenSource #DataEngineering

English

2.7K

Apache Spark@ApacheSpark·3d

Hyukjin Kwon, Apache Spark PMC member, explains why Apache Arrow is becoming the universal language of data within the Spark ecosystem: it is not only about speed. 👇 🔸 Zero-copy columnar IPC → less memory overhead 🔸 One format to inspect data across JVM + Python 🔸 Arrow as the stack’s common columnar layer Full video: youtube.com/watch?v=zvq5Ui… #ApacheSpark #ApacheArrow

YouTube

English

2.1K

Apache Spark@ApacheSpark·8 May

The “small batch problem” is what people actually mean when they complain about streaming latency in Spark. 👇 Pattern: shrink the trigger interval → latency drops… until you hit a floor (usually a few 100ms). Below that, driver coordination overhead dominates and you stop getting returns. That floor isn’t a bug. It’s the cost of Spark’s driver-coordinated execution model (reliability at scale has overhead). Every streaming engine has a version of this; Structured Streaming just makes it easy to push into it. 🎥 Watch the full video: youtu.be/8--H9bLaja4?si… #ApacheSpark #Streaming #StructuredStreaming #DataEngineering

YouTube

English

2.2K

Apache Spark@ApacheSpark·3 May

The pipeline code most data engineers write is mostly plumbing. 🔧 Task ordering, retry logic, dependency wiring, incremental state tracking. None of it is the actual transformation. All of it has to be correct. Spark Declarative Pipelines inverts the model. You define what the data should look like. The engine resolves execution order, handles retries, manages incrementalization, and recovers from failures. The surface area for bugs shrinks. The code that remains is the logic that actually matters. 🔗 Full writeup on Medium: medium.com/apache-spark/d… #ApacheSpark #DeclarativePipelines #DataEngineering

English

4.2K

Apache Spark@ApacheSpark·2 May

There’s a lot to unpack in Spark 4.1, from Real-Time Mode to Recursive CTEs. 🙌 Which feature(s) are you most excited about??

English

424

Apache Spark@ApacheSpark·29 Nis

⚡ Real-Time Mode: sub-100ms streaming latency with a single trigger config change. Same checkpoint, same exactly-once guarantees. 🔁 Recursive CTEs: native WITH RECURSIVE support. Anchor query seeds the result. Recursive term iterates until it returns zero rows. Replaces driver-side while loops and UDFs for graph traversal, org hierarchies, and bill-of-materials. 🛡️ ANSI SQL by default: stricter type coercion, null handling, and divide-by-zero behavior. Existing jobs may surface errors that were previously silent. 🧩 Declarative Pipelines: define tables and their relationships. The engine resolves the DAG, handles incrementalization, manages recovery. 🔌 Spark Connect improvements: decoupled client/server architecture. Thinner clients, better IDE integration, language-agnostic entry points. #ApacheSpark #OpenSource #DataEngineering

English

2.4K

Apache Spark@ApacheSpark·2 May

Spark Declarative Pipelines ➕ Airflow 3: Two declarative layers, different scopes. SDP operates at the data transformation layer. You define streaming tables, materialized views, and their dependencies. The engine resolves execution order, handles incremental processing, and manages state and recovery automatically. Airflow 3 operates at the orchestration layer. Assets replace tasks as the primary abstraction. Dependencies are inferred from data relationships. Scheduling becomes a function of asset readiness, not cron expressions. 👉 The combination eliminates two categories of imperative glue: intra-pipeline DAG wiring and inter-pipeline scheduling logic. #ApacheSpark #ApacheAirflow #DeclarativePipelines #DataEngineering

English

2.5K

Apache Spark@ApacheSpark·28 Nis

Real-Time Mode shipped in Apache Spark 4.1! 🚀 In part 2 of this two-part series, walks through the architecture. If you want to understand the change that makes sub-100ms Spark streaming possible, this is the deep dive. 🎥 Check it out: youtu.be/LIlSgzkkoEw #ApacheSpark #RealTimeMode #Streaming #Architecture #StructuredStreaming

YouTube

English

Apache Spark@ApacheSpark·23 Nis

New video with Jerry Peng on Apache Spark Structured Streaming!⚡️ In this video, we walk through how the engine actually works under the hood: the unbounded table abstraction, micro-batch execution, exactly-once semantics via the write-ahead log, and the design decisions that still get misunderstood nearly a decade in. 🎥 Dive in: youtu.be/8--H9bLaja4 This is part 1 of a two-part series. If you want to understand why Structured Streaming is the default streaming engine for data platforms at scale, and why micro-batch is a design choice, not a compromise, start here.📍 #ApacheSpark #StructuredStreaming #DataEngineering #Streaming #DataPlatform #OpenSource

YouTube

English

3.6K

Apache Spark@ApacheSpark·16 Nis

Your Spark 3 code might fail in Spark 4.0. Here is why. ⚠️ If you’ve ever found a random NULL in your production table and spent hours tracing it back through 10 transformations, you’ve dealt with "Silent Nulls." In Spark 4.0, the "permissive" era is officially ending. 𝗔𝗡𝗦𝗜 𝗦𝗤𝗟 𝗶𝘀 𝗻𝗼𝘄 𝗲𝗻𝗮𝗯𝗹𝗲𝗱 𝗯𝘆 𝗱𝗲𝗳𝗮𝘂𝗹𝘁. What changes? 🔸𝗡𝗼 𝗺𝗼𝗿𝗲 𝘀𝗶𝗹𝗲𝗻𝘁 𝗳𝗮𝗶𝗹𝘂𝗿𝗲𝘀 🔸𝗗𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗳𝗶𝗿𝘀𝘁 🔸𝗣𝗼𝗿𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 @lisancao breaks down the full transition in our latest deep dive 👉 youtu.be/wSWdmS78ENE #ApacheSpark #DataEngineering #BigData #SQL

YouTube

English

1.8K

Apache Spark@ApacheSpark·7 Nis

Why did the Spark community build mapInArrow? 🤔 According to Apache Spark PMC member Hyukjin Kwon, the motivation was simple: enable vectorized processing of nested data without the overhead of Pandas conversion. 🔗 Watch the full breakdown of how Spark 3.3 introduced this shift: youtu.be/zvq5UiVxpEg #ApacheSpark #DataEngineering #ApacheArrow #Python

YouTube

English

2.3K

Apache Spark@ApacheSpark·31 Mar

Where is Apache Spark heading in Spark 4.2 and beyond? 🚀 Join @lisancao and Apache Spark PMC member Hyukjin Kwon as they break down the migration from Python "Pickle" to Apache Arrow and what’s next for the ecosystem. 🔗Watch the full interview: youtu.be/zvq5UiVxpEg Key Insights: 🔥 The Serialization Breakthrough: How Arrow solved the JVM-to-Python bottleneck 🏎️ The Zero-Copy Roadmap: Direct memory access coming in Spark 4.2/4.3 🛠️ Spark 4 Updates: New byte-count batching to prevent OOM issues 💡 Expert Advice: When to enable Arrow for maximum performance #ApacheSpark #ApacheArrow #Python

YouTube

English

1.4K

Apache Spark@ApacheSpark·30 Mar

At the Data Engineering Open Forum, Andreas Neumann will explore the evolution of Spark Declarative Pipelines (SDP). Introduced in Spark 4.1, SDP separates what a pipeline does from how it runs: 🔸 Developers declare datasets and transformations. 🔸 Spark constructs and manages the execution plan. If you want to learn how to build batch and streaming pipelines faster, you don't want to miss it! 📍 San Francisco | April 16 🔗 Full agenda: dataengineeringopenforum.com #opensource #apachespark #DEOF #dataengineering

English

1.7K

Apache Spark@ApacheSpark·27 Mar

Stop building manual DAGs. 🛑 With Spark Declarative Pipelines (SDP), your data flow graph is inferred automatically from your dataset definitions. This modular approach separates your transformation logic from your dependency management. Part 1 of our SDP series covers: 🔸 Streaming Tables vs. Materialized Views 🔸 Creating a traceable dependency chain 🔸 Simple installation via pip 🔗 Dive in: medium.com/apache-spark/d… #PySpark #ApacheSpark #SDP #DataPipelines

English

2.9K

Apache Spark@ApacheSpark·26 Mar

Come hang out in SF on April 16 at the Data Engineering Open Forum! 🚀 In the session "Apache Spark: Structured Streaming Real-Time Mode," Jerry Peng will dive into: 🔸 The evolution of Spark Structured Streaming 🔸 How Real-Time Mode (RTM) works 🔸 Insights into extending the Structured Streaming architecture for low-latency processing 🔸 Real-world examples of how users are putting it into practice 📋 Agenda: dataengineeringopenforum.com/?utm_source=li… 🎟 RSVP: luma.com/deof2026?utm_s… #apachespark #DEOF #dataengineering #opensource

English

963

Apache Spark@ApacheSpark·12 Mar

For a decade, PySpark developers have wrestled with a specific architectural tax: the overhead of moving data between the JVM and Python. It was the bottleneck that kept Python from feeling truly native to the Spark engine. Then came Apache Arrow. 🏹 To celebrate 10 years of Arrow, we’re diving deep into the technical journey that transformed Spark into a zero-copy performance beast. In this video, @lisancao (@databricks), Matt Topol (Columnar), and Hyukjin Kwon (Databricks) map out exactly how this convergence happened and where it’s going in Spark 4.1. 🚀 🎥 Watch the full 10-year retrospective here: youtu.be/EiEgU4m8XfM They dive deep into: 🔸 𝗕𝗲𝘆𝗼𝗻𝗱 𝗣𝗶𝗰𝗸𝗹𝗲: Why the original serialization protocols couldn't scale and how Arrow’s columnar memory layout changed the game 🔸 𝗧𝗵𝗲 𝗥𝗼𝗮𝗱𝗺𝗮𝗽: From the first integration in Spark 2.3 to the "Map in Arrow" API in 3.3 🔸 𝗧𝗵𝗲 𝗙𝗶𝗻𝗮𝗹 𝗙𝗼𝗿𝗺: In Spark 4.1, Arrow isn't just a transport layer anymore—it’s the native execution format 🔸 𝗧𝗵𝗲 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗛𝗶𝗲𝗿𝗮𝗿𝗰𝗵𝘆: A definitive "Tier List" for 2026. Should you use Catalyst, Scala, or Arrow-Native UDFs? #ApacheSpark #ApacheArrow #PySpark #OpenSource

YouTube

English

3.5K

Apache Spark@ApacheSpark·4 Mar

Sandy Ryza explains how Spark Declarative Pipelines (SDP) brings the same "Infrastructure as Code" maturity to data engineering that changed the world of DevOps. The shift is simple: ☁️ Kubernetes manages containers via YAML. 🏗️ Terraform manages infrastructure as code. ⚡ SDP manages Spark pipelines via declared datasets and flows. 𝗧𝗵𝗲 𝗕𝗲𝗻𝗲𝗳𝗶𝘁: By moving away from imperative scripts, you gain better maintainability, fewer side effects, and a standardized way to version and deploy your Spark logic. 🎥 Learn more: youtu.be/WNPYEZ7SMSM #ApacheSpark #DataEngineering #SDP #Kubernetes #Terraform

YouTube

English

3.6K

Apache Spark retweetledi

holden karau@holdenkarau·23 Şub

Want to get involved in open source (specifically @ApacheSpark )? We're running a Community Sprint in Seattle luma.com/rrfvx0ey for folks wanting to get started on Friday, March 13

English

1.4K

Apache Spark@ApacheSpark·24 Şub

🛑 Stop debugging in production. Catch errors in seconds with SDP. What happens when a typo in your table name threatens to break your entire batch pipeline? With Spark Declarative Pipelines (SDP), you find out before you ever hit "run." ✅ In this demo, @lisancao intentionally breaks a pipeline to show how Spark 4.1 handles the "unhappy path." The "Fail-Fast" Breakdown: ❌ 𝗧𝗵𝗲 𝗦𝗮𝗯𝗼𝘁𝗮𝗴𝗲: A simple typo—changing "orders" to "orderz"—is introduced in the sales pipeline script 🔍 𝗧𝗵𝗲 𝗗𝗿𝘆 𝗥𝘂𝗻: Using 𝚜𝚙𝚊𝚛𝚔-𝚙𝚒𝚙𝚎𝚕𝚒𝚗𝚎𝚜 𝚍𝚛𝚢-𝚛𝚞𝚗, Spark analyzes the project without processing a single byte of data 📉 𝗜𝗻𝘀𝘁𝗮𝗻𝘁 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗘𝗿𝗿𝗼𝗿𝘀: SDP identifies that the Iceberg "gold" table cannot resolve because its upstream dependency is missing. 💡 𝗛𝘂𝗺𝗮𝗻-𝗥𝗲𝗮𝗱𝗮𝗯𝗹𝗲 𝗟𝗼𝗴𝘀: Instead of a massive stack trace, Spark provides a clear, digestible message pinpointing exactly which flow failed and why By identifying upstream errors during the dry run, developers save time, compute costs, and the headache of cleaning up partial data loads. 🛡️ 🎥 Catch the full SDP video here: youtu.be/WNPYEZ7SMSM #ApacheSpark #DataEngineering #BigData #Iceberg #DataQuality #SDP

YouTube

English

2.9K

Keşfet

@lisancao @databricks @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA