Scott Haines

2.7K posts

Scott Haines banner
Scott Haines

Scott Haines

@newfront

OSS Engineer | Speaker • Trainer | #DatabricksMVP | Author @OReillyMed | ❤️ #ApacheSpark. ❤️ #Dogs. #DatabricksMVP. Views are my own

California, USA Katılım Kasım 2008
870 Takip Edilen837 Takipçiler
Scott Haines retweetledi
MotherDuck
MotherDuck@motherduck·
What happens when you use Claude to psychoanalyze Claude? We ran 50 BIRD-Bench questions through a testing harness using Claude Opus 4.5 and our MCP Server, harvested every chain-of-thought trace, then deployed a team of Claude sub-agents to classify what went right, what went wrong, and why. We're calling it Claude-ception. New on the blog 👇️ motherduck.com/blog/claudecep…
MotherDuck tweet media
English
1
1
7
948
Scott Haines retweetledi
Delta Lake
Delta Lake@DeltaLakeOSS·
📣 𝗡𝗲𝘅𝘁 𝗢𝗽𝗲𝗻 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 + 𝗔𝗜 𝗪𝗲𝗯𝗶𝗻𝗮𝗿: 𝗧𝘂𝗲𝘀𝗱𝗮𝘆, 𝗠𝗮𝗿𝗰𝗵 𝟭𝟬! Open table formats promise engine-agnostic access, but independent protocol maintenance is costly. The Delta Kernel solves this by abstracting the Delta Lake protocol behind a clean API. 🛠️ Join this session to explore how @ClickHouseDB integrated 𝚍𝚎𝚕𝚝𝚊-𝚔𝚎𝚛𝚗𝚎𝚕-𝚛𝚜 into its single-binary C++ build system. 🚀 🎟️ Register: luma.com/OLAI-310 #openlakehouse #oss #deltalake #openlakehouseai #clickhouse
Delta Lake tweet media
English
0
1
6
529
Mim
Mim@mim_djo·
This will be written in the history books :) The Lakehouse is officially a warehouse. The more things change, the more they stay the same. And that is not necessarily a bad thing.
Delta Lake@DeltaLakeOSS

The Next Evolution of Delta Lake: Catalog-Managed Tables 🚀 We are excited to share that 𝗗𝗲𝗹𝘁𝗮 𝗟𝗮𝗸𝗲 𝟰.𝟭.𝟬 introduces 𝗰𝗮𝘁𝗮𝗹𝗼𝗴-𝗺𝗮𝗻𝗮𝗴𝗲𝗱 𝘁𝗮𝗯𝗹𝗲𝘀, which establish the catalog as the coordinator of table access and source of truth for table state! This evolution simplifies discovery and governance while unlocking significant performance gains. 🔹 𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝗶𝘇𝗲𝗱 𝘁𝗮𝗯𝗹𝗲 𝗱𝗶𝘀𝗰𝗼𝘃𝗲𝗿𝘆 𝗮𝗻𝗱 𝘂𝗻𝗶𝗳𝗶𝗲𝗱 𝗴𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲: The catalog facilitates access through logical table identifiers and grants clients appropriate permissions to data, dramatically simplifying how engines discover and use tables in a governed manner. 🔹 𝗘𝗻𝗳𝗼𝗿𝗰𝗲𝗮𝗯𝗹𝗲 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀: The catalog can authoritatively validate or reject schema and constraint changes, preventing incompatible updates that could compromise data integrity or break downstream workloads. 🔹 𝗢𝗽𝗲𝗻 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻: This design aligns Delta with the catalog-managed model pioneered by Apache Iceberg, making it simpler for practitioners to discover and govern data consistently regardless of format. Check out the blog to learn more 👉 delta.io/blog/2026-02-0… #deltalake #catalogs #unitycatalog #opensource #oss

English
3
6
49
5.3K
Scott Haines retweetledi
Delta Lake
Delta Lake@DeltaLakeOSS·
The Next Evolution of Delta Lake: Catalog-Managed Tables 🚀 We are excited to share that 𝗗𝗲𝗹𝘁𝗮 𝗟𝗮𝗸𝗲 𝟰.𝟭.𝟬 introduces 𝗰𝗮𝘁𝗮𝗹𝗼𝗴-𝗺𝗮𝗻𝗮𝗴𝗲𝗱 𝘁𝗮𝗯𝗹𝗲𝘀, which establish the catalog as the coordinator of table access and source of truth for table state! This evolution simplifies discovery and governance while unlocking significant performance gains. 🔹 𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝗶𝘇𝗲𝗱 𝘁𝗮𝗯𝗹𝗲 𝗱𝗶𝘀𝗰𝗼𝘃𝗲𝗿𝘆 𝗮𝗻𝗱 𝘂𝗻𝗶𝗳𝗶𝗲𝗱 𝗴𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲: The catalog facilitates access through logical table identifiers and grants clients appropriate permissions to data, dramatically simplifying how engines discover and use tables in a governed manner. 🔹 𝗘𝗻𝗳𝗼𝗿𝗰𝗲𝗮𝗯𝗹𝗲 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀: The catalog can authoritatively validate or reject schema and constraint changes, preventing incompatible updates that could compromise data integrity or break downstream workloads. 🔹 𝗢𝗽𝗲𝗻 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻: This design aligns Delta with the catalog-managed model pioneered by Apache Iceberg, making it simpler for practitioners to discover and govern data consistently regardless of format. Check out the blog to learn more 👉 delta.io/blog/2026-02-0… #deltalake #catalogs #unitycatalog #opensource #oss
Delta Lake tweet media
English
0
10
52
10.2K
Scott Haines retweetledi
LanceDB
LanceDB@lancedb·
Our next Lance Community Sync is this Thursday 2/26 at 9am PT! Everyone is welcome to join. Subscribe to the Lance Format mailing list to receive the meeting invite and get notified: groups.google.com/a/lance.org/g/… We’ll be keeping a running doc so feel free add any discussion topics here: docs.google.com/document/d/1cP…
LanceDB tweet media
English
0
2
4
372
Scott Haines retweetledi
Delta Lake
Delta Lake@DeltaLakeOSS·
Open table formats promise engine-agnostic access, but independent protocol maintenance is costly. The Delta Kernel solves this with a clean API for optimized readers. Join us to see how @ClickHouseDB integrated delta-kernel-rs into its zero-dependency C++ build system. 🚀 We will cover: 📌 The Kernel’s architectur 📌 Real-world challenges of embedding Rust in a C++ codebase—from static linking and sanitizer support to cross-compilation failures 📌 What’s next for the project 🗓️ March 10, 2026 🕕 9:00AM PT Save your spot! ➡️ luma.com/OLAI-310 #DeltaLake #ClickHouse #Rust #CPP #OpenSource
Delta Lake tweet media
English
0
3
7
539
Scott Haines retweetledi
Andre Landgraf
Andre Landgraf@andrelandgraf·
Releasing add-mcp - add-skill but for installing MCP servers MCP suffers from the same problem as agent skills: every agent ships its own config files. This makes installing MCP servers super annoying and add-mcp is meant to fix that.
English
23
20
120
34.5K
Scott Haines retweetledi
Delta Lake
Delta Lake@DeltaLakeOSS·
📣 Join us next Wednesday, Feb 11 at 9AM PT for Delta Hacks: How Delta can propel organizations to AI data readiness! 🔗 REGISTER: luma.com/deltahacks-0211 AI success starts with your data, but building reliable data pipelines shouldn’t feel like rocket science. 🚀 We’re cutting through the noise to show you how to build production-grade pipelines using the first principles of software and data engineering. We’ll cover: 🔹 Bridging the Gap: Moving from traditional SQL thinking to scalable PySpark patterns 🔹 Architecture over Hype: Applying software engineering principles to your data flow 🔹 Practical Frameworks: How to write cleaner, more maintainable pipelines that solve data quality issues for good #deltalake #opensource #oss #ai #dataengineering #pipelines
Delta Lake tweet media
English
0
2
10
517
Scott Haines retweetledi
Delta Lake
Delta Lake@DeltaLakeOSS·
📣 Engineering Dynamic Lineage: Column-level lineage using @OpenLineage, @ApacheSpark , and Delta Lake Traditional static lineage tools fail to track real-time data flow in complex enterprise environments. Join us as we explore 𝗱𝘆𝗻𝗮𝗺𝗶𝗰, 𝗱𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗶𝘀𝘁𝗶𝗰 𝗹𝗶𝗻𝗲𝗮𝗴𝗲 𝗮𝘀 𝘁𝗲𝗹𝗲𝗺𝗲𝘁𝗿𝘆—treating query execution as parseable events that reveal current data state. 𝗪𝗲'𝗹𝗹 𝗰𝗼𝘃𝗲𝗿: 🔹 Dynamic, deterministic lineage vs. static maps 🔹 OpenLineage as the open source alternative to proprietary solutions 🔹 Stitching flows across 1000s of jobs via Spark listeners 🔹 Integrating lineage alongside tables in the Lakehouse 🔗 Register: luma.com/delta-0224 🗓️ Feb 24 | 9AM PT 📺 Live on LinkedIn, YouTube & X #opensource #oss #deltalake #telemetry #openlineage #apachespark
Delta Lake tweet media
English
0
5
6
458
Scott Haines retweetledi
Apache Spark
Apache Spark@ApacheSpark·
Stop micromanaging your pipelines! 🛑✋ Apache Spark 4.1 introduces 𝗦𝗽𝗮𝗿𝗸 𝗗𝗲𝗰𝗹𝗮𝗿𝗮𝘁𝗶𝘃𝗲 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 (𝗦𝗗𝗣), and it’s a total shift in how we build data workflows. 🔄 𝗧𝗵𝗲 𝗼𝗹𝗱 𝘄𝗮𝘆: Manual sequencing, retry logic, and dependency chaos. 😫 𝗧𝗵𝗲 𝗦𝗗𝗣 𝘄𝗮𝘆: Declare your tables, define your flow, and let Spark do the heavy lifting. ✅ Check out our latest video with a core SDP engineer, Sandy Ryza, to see how this feature automates: ⚙️ Parallelism ⚙️ Checkpointing ⚙️ Pipeline Sequencing 🎥 Watch the full talk: youtu.be/WNPYEZ7SMSM #ApacheSpark #DataPipelines #SDP #DataEngineering #OpenSource
YouTube video
YouTube
Apache Spark tweet media
English
3
34
214
12.1K
Scott Haines retweetledi
Scott Haines retweetledi
Matei Zaharia
Matei Zaharia@matei_zaharia·
Super excited that we’re open sourcing the Dicer autosharder! It’s become a critical piece of infrastructure in Databricks that’s made many of our systems more scalable and reliable, and it’s powered by really cool systems work. databricks.com/blog/open-sour…
Matei Zaharia tweet media
English
5
34
240
17.5K
Scott Haines retweetledi
Delta Lake
Delta Lake@DeltaLakeOSS·
Unify your graph and lakehouse worlds in one stack. 🙌✅ 🗓️ January 21 🕓 9:00AM PT 🔗 RSVP: luma.com/OLAI-121 See how Delta Lake + @puppyquery let you run Cypher and Gremlin queries, multi-hop analytics, and graph visualizations directly on your existing tables—without moving data off your lakehouse or maintaining duplicate systems. From attack-path analysis in security to transaction flows in finance and complex supply chains, learn how relationship-native analytics can unlock insights your standard SQL queries miss. #opensource #oss #deltalake #puppygraph #openlakehouse #lakehouse
Delta Lake tweet media
English
0
2
2
322
Scott Haines retweetledi
Delta Lake
Delta Lake@DeltaLakeOSS·
How do you power high-performance, scalable analytics while maintaining an open data lakehouse? 🤔 In this deep dive, Robert Pack and Scott Haines explore the powerful intersection of Apache Arrow Flight, DataFusion, and Delta Lake. 🎥 Watch the full session recording here: youtube.com/watch?v=gsThsq… The session unpacks: 🔹 𝗙𝗹𝗶𝗴𝗵𝘁 𝗦𝗤𝗟: Defining a semantic layer for fast, efficient SQL communication over Arrow Flight. 🔹 𝗗𝗲𝗹𝘁𝗮 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻: How Delta table semantics—like VACUUM and OPTIMIZE—seamlessly integrate into a Flight SQL server. 🔹 𝗦𝘁𝗿𝗲𝗮𝗺𝗹𝗶𝗻𝗲𝗱 𝗖𝗼𝗺𝗽𝘂𝘁𝗲: Unlocking low-memory streaming compute capabilities backed by DataFusion. #deltalake #datafusion #apachearrow #flightsql #opensource #oss
YouTube video
YouTube
Delta Lake tweet media
English
0
1
6
498
Scott Haines retweetledi
MotherDuck
MotherDuck@motherduck·
If you're streaming data into DuckDB, INSERT statements become a bottleneck fast. DuckDB's Appender API bypasses the SQL layer entirely. No parsing, no query planning. You write directly to the columnar storage format, which means you can handle real-time ingestion without the usual speed/batch size trade-off. How it works: Stream rows through a low-level API. Data caches in batches before writing to disk. You're essentially using a binary protocol instead of SQL strings. Good for: -Kafka consumers or message queue ingestion -Log aggregation pipelines -IoT sensor data collection -Any scenario where data arrives continuously The trade-offs: It's order and type sensitive. You match columns exactly, no inference. One constraint violation fails the entire batch, no partial inserts. And you're writing to a single table per Appender instance. Available in C, C++, Go, Java, and Rust. For batch ETL or small datasets, regular INSERT is simpler and fine. But for streaming? This is the tool. Check out the docs: duckdb.org/docs/stable/da…
English
7
16
141
28K