Apache Spark

165 posts

Apache Spark banner
Apache Spark

Apache Spark

@ApacheSpark

Lightning-fast unified analytics engine

Katılım Haziran 2013
1 Takip Edilen43.5K Takipçiler
Apache Spark
Apache Spark@ApacheSpark·
For a decade, PySpark developers have wrestled with a specific architectural tax: the overhead of moving data between the JVM and Python. It was the bottleneck that kept Python from feeling truly native to the Spark engine. Then came Apache Arrow. 🏹 To celebrate 10 years of Arrow, we’re diving deep into the technical journey that transformed Spark into a zero-copy performance beast. In this video, @lisancao (@databricks), Matt Topol (Columnar), and Hyukjin Kwon (Databricks) map out exactly how this convergence happened and where it’s going in Spark 4.1. 🚀 🎥 Watch the full 10-year retrospective here: youtu.be/EiEgU4m8XfM They dive deep into: 🔸 𝗕𝗲𝘆𝗼𝗻𝗱 𝗣𝗶𝗰𝗸𝗹𝗲: Why the original serialization protocols couldn't scale and how Arrow’s columnar memory layout changed the game 🔸 𝗧𝗵𝗲 𝗥𝗼𝗮𝗱𝗺𝗮𝗽: From the first integration in Spark 2.3 to the "Map in Arrow" API in 3.3 🔸 𝗧𝗵𝗲 𝗙𝗶𝗻𝗮𝗹 𝗙𝗼𝗿𝗺: In Spark 4.1, Arrow isn't just a transport layer anymore—it’s the native execution format 🔸 𝗧𝗵𝗲 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗛𝗶𝗲𝗿𝗮𝗿𝗰𝗵𝘆: A definitive "Tier List" for 2026. Should you use Catalyst, Scala, or Arrow-Native UDFs? #ApacheSpark #ApacheArrow #PySpark #OpenSource
YouTube video
YouTube
Apache Spark tweet media
English
1
7
63
3.3K
Apache Spark
Apache Spark@ApacheSpark·
Sandy Ryza explains how Spark Declarative Pipelines (SDP) brings the same "Infrastructure as Code" maturity to data engineering that changed the world of DevOps. The shift is simple: ☁️ Kubernetes manages containers via YAML. 🏗️ Terraform manages infrastructure as code. ⚡ SDP manages Spark pipelines via declared datasets and flows. 𝗧𝗵𝗲 𝗕𝗲𝗻𝗲𝗳𝗶𝘁: By moving away from imperative scripts, you gain better maintainability, fewer side effects, and a standardized way to version and deploy your Spark logic. 🎥 Learn more: youtu.be/WNPYEZ7SMSM #ApacheSpark #DataEngineering #SDP #Kubernetes #Terraform
YouTube video
YouTube
English
3
9
61
3.5K
Apache Spark retweetledi
holden karau
holden karau@holdenkarau·
Want to get involved in open source (specifically @ApacheSpark )? We're running a Community Sprint in Seattle luma.com/rrfvx0ey for folks wanting to get started on Friday, March 13
English
0
2
4
1.2K
Apache Spark
Apache Spark@ApacheSpark·
🛑 Stop debugging in production. Catch errors in seconds with SDP. What happens when a typo in your table name threatens to break your entire batch pipeline? With Spark Declarative Pipelines (SDP), you find out before you ever hit "run." ✅ In this demo, @lisancao intentionally breaks a pipeline to show how Spark 4.1 handles the "unhappy path." The "Fail-Fast" Breakdown: ❌ 𝗧𝗵𝗲 𝗦𝗮𝗯𝗼𝘁𝗮𝗴𝗲: A simple typo—changing "orders" to "orderz"—is introduced in the sales pipeline script 🔍 𝗧𝗵𝗲 𝗗𝗿𝘆 𝗥𝘂𝗻: Using 𝚜𝚙𝚊𝚛𝚔-𝚙𝚒𝚙𝚎𝚕𝚒𝚗𝚎𝚜 𝚍𝚛𝚢-𝚛𝚞𝚗, Spark analyzes the project without processing a single byte of data 📉 𝗜𝗻𝘀𝘁𝗮𝗻𝘁 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗘𝗿𝗿𝗼𝗿𝘀: SDP identifies that the Iceberg "gold" table cannot resolve because its upstream dependency is missing. 💡 𝗛𝘂𝗺𝗮𝗻-𝗥𝗲𝗮𝗱𝗮𝗯𝗹𝗲 𝗟𝗼𝗴𝘀: Instead of a massive stack trace, Spark provides a clear, digestible message pinpointing exactly which flow failed and why By identifying upstream errors during the dry run, developers save time, compute costs, and the headache of cleaning up partial data loads. 🛡️ 🎥 Catch the full SDP video here: youtu.be/WNPYEZ7SMSM #ApacheSpark #DataEngineering #BigData #Iceberg #DataQuality #SDP
YouTube video
YouTube
English
0
4
22
2.8K
Apache Spark
Apache Spark@ApacheSpark·
New Project Feature Announcement! GitHub Issues are now available for the Apache Spark repository ✅ Users can now directly create issues on GitHub, as part of a pilot project to understand how using issues can supplement the existing JIRA infrastructure on the project. 🔗 Check it out here: github.com/apache/spark/i… #apachespark #oss #opensource
English
0
5
34
2.2K
Apache Spark
Apache Spark@ApacheSpark·
Traditionally, Spark Structured Streaming has only supported micro-batch architectures. These approaches had issues with latency which made it difficult to use Spark for low-latency use cases. The new Real-Time Mode in Spark 4.1 solves this problem. ✅ You now get true continuous stream processing at sub-second latency. All you need to do is change the trigger to the new RealTimeTrigger. Latencies can get as low as the single-digit milliseconds, depending on the complexity of your transformations. #apachespark #opensource #structuredstreaming #oss
English
1
7
50
3.3K
Apache Spark
Apache Spark@ApacheSpark·
We've been using declarative tools like Kubernetes and Terraform for years to manage infrastructure. Spark SQL even brought that same "specify the what, not the how" logic to individual queries. But why stop there? But why stop there? SDP takes that logic and applies it to the entire data pipeline. By building an end-to-end dependency graph before execution, you gain a holistic understanding of your data flow. As Sandy Ryza breaks down, this means: ✅ Instant feedback on whether your pipeline can run successfully ✅ Upstream and downstream schema matching ✅ Significant savings in time and compute costs 📺 Watch the full webinar: youtu.be/WNPYEZ7SMSM #ApacheSpark #DataEngineering #DataPipelines #SDP
YouTube video
YouTube
English
3
7
36
4.3K
Apache Spark
Apache Spark@ApacheSpark·
In this demo, Lisa N. Cao showcases a high-performance, local Lakehouse stack running Apache Spark 4.1, Apache Kafka, and Apache Iceberg—all within a single Docker container. 📦 The Efficiency Breakdown: 🏗️ 𝗢𝗻𝗲-𝗹𝗶𝗻𝗲 𝘀𝗲𝘁𝘂𝗽: Using 𝚜𝚙𝚊𝚛𝚔-𝚙𝚒𝚙𝚎𝚕𝚒𝚗𝚎𝚜 𝚒𝚗𝚒𝚝 developers can generate a project structure, YAML configurations, and transformation templates in seconds. 🌊 𝗦𝗲𝗮𝗺𝗹𝗲𝘀𝘀 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴: Watch how Kafka order streams flow directly into a "Silver Layer" streaming table with minimal boilerplate. 💪𝗟𝗼𝗰𝗮𝗹 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲: Interoperable storage using Iceberg tables—perfect for local testing of batch and streaming analytics. 🛡️ 𝗜𝗻𝘀𝘁𝗮𝗻𝘁 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻: The Dry Run feature allows for full data flow graph validation without processing data, catching dependency issues before they hit production. Click here for the full SDP session! 👇🔗 youtube.com/watch?v=WNPYEZ… #ApacheSpark #DataEngineering #OpenSource #BigData #Lakehouse #Iceberg
YouTube video
YouTube
English
1
7
84
3.4K
Apache Spark
Apache Spark@ApacheSpark·
Meet Sandy Ryza, one of the core engineers behind Spark Declarative Pipelines (SDP)! Sandy has been a part of the open-source community for over a decade, contributing to Apache Spark since 2014 and Hadoop before that. In this clip, Sandy explains why SDP is such a game-changer for the ecosystem and how you can get started today. 🚀 What’s inside the full talk? 🔸 How a declarative approach simplifies your data flow. 🔸 Learn how to install it in seconds using pip. 🔸 Why a shared set of thinking leads to better development for everyone. Sandy says it best: Open source allows the community to standardize and participate in what the future looks like. 🌐 📺 Watch the full talk: youtu.be/WNPYEZ7SMSM #ApacheSpark #DataPipelines #SDP #DataEngineering #OpenSource
YouTube video
YouTube
English
0
4
26
1.9K
Apache Spark
Apache Spark@ApacheSpark·
Stop micromanaging your pipelines! 🛑✋ Apache Spark 4.1 introduces 𝗦𝗽𝗮𝗿𝗸 𝗗𝗲𝗰𝗹𝗮𝗿𝗮𝘁𝗶𝘃𝗲 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 (𝗦𝗗𝗣), and it’s a total shift in how we build data workflows. 🔄 𝗧𝗵𝗲 𝗼𝗹𝗱 𝘄𝗮𝘆: Manual sequencing, retry logic, and dependency chaos. 😫 𝗧𝗵𝗲 𝗦𝗗𝗣 𝘄𝗮𝘆: Declare your tables, define your flow, and let Spark do the heavy lifting. ✅ Check out our latest video with a core SDP engineer, Sandy Ryza, to see how this feature automates: ⚙️ Parallelism ⚙️ Checkpointing ⚙️ Pipeline Sequencing 🎥 Watch the full talk: youtu.be/WNPYEZ7SMSM #ApacheSpark #DataPipelines #SDP #DataEngineering #OpenSource
YouTube video
YouTube
Apache Spark tweet media
English
3
35
215
12.1K
Apache Spark
Apache Spark@ApacheSpark·
Spark Declarative Pipelines makes it faster and easier to write reliable production pipelines. Here’s what  Spark Declarative Pipelines does under the hood when you launch a pipeline using the new spark-pipelines CLI: 1️⃣ Register your decorated functions without executing them. 2️⃣ Infer dependencies from any spark.table() calls in your function source. 3️⃣ Build a DAG and execute your functions in topological order. 4️⃣ Handle writes and storage automatically. This means way less manual boilerplate and figuring out dependencies → more time and less room for human error. 📙 Explore the new Spark Declarative Pipelines Programming Guide: spark.apache.org/docs/latest/de… #ApacheSpark #OpenSource #DeclarativePipelines #SDP
English
0
6
36
2.6K
Apache Spark
Apache Spark@ApacheSpark·
Managing Spark transformations just got significantly easier with the introduction of 𝗦𝗽𝗮𝗿𝗸 𝗗𝗲𝗰𝗹𝗮𝗿𝗮𝘁𝗶𝘃𝗲 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 (𝗦𝗗𝗣) in Apache Spark 4.1. 🚀 As Lisa N. Cao breaks down in this clip, SDP handles the "heavy lifting" in the background so you don't have to. Here is what's happening under the hood: 🔹 𝗗𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝘆 𝗢𝗿𝗱𝗲𝗿𝗶𝗻𝗴: SDP automatically detects table references and ensures your dependencies resolve before execution. 🔹 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝗲: Built-in handling for checkpoints, parallelism, and retries. 🔹 Language agnostic: Author your pipelines in Python or SQL, making your work more shareable and accessible. 🔹 𝗢𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻: Works with Airflow to handle application-level tasks while SDP manages the Spark transformations internally. Check out the new Spark Declarative Pipelines Programming Guide in the 4.1 docs to get started ➡️ spark.apache.org/docs/latest/de… #ApacheSpark #DeclarativePipelines #SparkConnect #OpenSource #PySpark #DataOps
English
0
2
20
2.1K
Apache Spark
Apache Spark@ApacheSpark·
Apache Spark 4.1 has officially landed and it’s a massive leap forward for productivity and performance! 🚀 As Lisa Cao highlights in her latest update, this release is packed with features that fundamentally change how we build and manage Spark pipelines. Key highlights include: 🌟 𝗦𝗽𝗮𝗿𝗸 𝗗𝗲𝗰𝗹𝗮𝗿𝗮𝘁𝗶𝘃𝗲 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 (𝗦𝗗𝗣) 🌟 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗥𝗲𝗮𝗹-𝗧𝗶𝗺𝗲 𝗠𝗼𝗱𝗲 (𝗥𝗧𝗠) 🌟 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝗘𝗻𝗵𝗮𝗻𝗰𝗲𝗺𝗲𝗻𝘁𝘀 🌟 𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗻𝗻𝗲𝗰𝘁 & 𝗠𝗟 🌟 𝗦𝗤𝗟 𝗦𝗰𝗿𝗶𝗽𝘁𝗶𝗻𝗴 🔗 Check out the full release notes: spark.apache.org/releases/spark… #ApacheSpark #SparkConnect #OpenSource #PySpark #DataOps
English
2
9
132
6.4K
Apache Spark
Apache Spark@ApacheSpark·
Have questions for Niranjan? Let us know in the comments. ⤵️
English
1
0
0
893
Apache Spark
Apache Spark@ApacheSpark·
Hey all! Thanks for being here. Where are you joining us from? 🌐
English
0
0
1
425
Apache Spark
Apache Spark@ApacheSpark·
Join us 𝗧𝗢𝗠𝗢𝗥𝗥𝗢𝗪, 𝗗𝗲𝗰𝗲𝗺𝗯𝗲𝗿 𝟵 at 𝟵𝗔𝗠 𝗣𝗧! 🚀 From ingestion to metadata management, scalable data platforms form the foundation of every GenAI pipeline. 🔗 RSVP: luma.com/openlakehouse-… This Open Lakehouse + AI webinar will break down the architecture of @nvidia's GPU-accelerated Data Science Platform, focusing on data processing and critical metadata management for GenAI workloads. We’ll highlight real-world use of Apache Spark™, RAPIDS, @DeltaLakeOSS, and @kubeflow to deliver reproducible, high-performance data operations purpose-built for GenAI. #opensource #oss #genai #nvidia
Apache Spark tweet media
English
0
0
2
945
Apache Spark
Apache Spark@ApacheSpark·
Early this year, Spark Declarative Pipelines (SDP) was announced, which has made it dramatically easier to build robust Apache Spark pipelines using a framework that abstracts away orchestration and complexity. 🚀 Want to know what's next❓The SDP declarative framework extends beyond individual queries to enable a mix of batch and streaming pipelines, keeping multiple datasets fresh. Check out Sandy Ryza's session from Open Lakehouse + AI (Nov 13) where he shares a broader vision for the future of SDP: ✅ Understand the core concepts behind SDP ✅ Learn where the architecture is headed ✅ Discover what this shift means for existing users and Spark engineers 🎥 Dive in: youtube.com/watch?v=0XOSw7… #ApacheSpark #DataEngineering #DeclarativePipelines #SDP
YouTube video
YouTube
Apache Spark tweet media
English
0
1
11
934
Apache Spark
Apache Spark@ApacheSpark·
One week to go! 🚀 Discover how @NVIDIA’s GPU-accelerated Data Science Platform streamlines data processing and metadata capture for GenAI applications. Get a firsthand look at the platform’s architecture, APIs for ingestion, processing, and retrieval, and how it leverages 𝗥𝗔𝗣𝗜𝗗𝗦 𝗔𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗼𝗿 𝗳𝗼𝗿 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸™ to accelerate the pipelines with open source technologies like Apache Spark™, Rapids, @DeltaLakeOSS, and @kubeflow. 🗓️ Tuesday, Dec 9 🕜 9AM PT 🔗 Register: luma.com/openlakehouse-… #openlakehouse #opensource #oss #ai #genai #apachespark
Apache Spark tweet media
English
0
0
0
857