databricksdaily

380 posts

databricksdaily

@databricksdaily

Your daily Databricks guide | Tips, tricks & real-world use cases | Data Engineering enthusiast | Exploring Databricks opportunities

Remote 参加日 Eylül 2025

44 フォロー中46 フォロワー

固定されたツイート

databricksdaily@databricksdaily·12 Kas

Sharing daily Databricks tips & tricks. Helping the community level up in data engineering & AI Open to Databricks roles #Databricks #Lakehouse #DataEngineering

English

571

databricksdaily@databricksdaily·24 Şub

Opposite situation: jobs are slow, but executors aren’t maxed out. That’s a horizontal scaling moment. If data can be processed in parallel (ETL pipelines, batch ingestion, multiple users), adding more workers reduces runtime because Spark spreads partitions across nodes. Scale out when the workload is parallel not memory-blocked. #databricks

databricksdaily@databricksdaily

Ever increased workers to speed up a Databricks job… and nothing changed? That’s usually a vertical scaling problem. If executors are running out of memory (big joins, caching, heavy shuffle), adding more machines won’t help. You need stronger machines → bigger node type. Scale up when the bottleneck is inside each node, not cluster size. #databricks

English

databricksdaily@databricksdaily·24 Şub

English

databricksdaily@databricksdaily·24 Şub

When you pick a node type in Databricks, you’re choosing the kind of machine each driver and worker runs on. Standard_DS3_v2 → general-purpose node (balanced CPU + memory), good starting point for most ETL workloads. Standard_E8ds_v5 → memory-optimized node, better when jobs rely on large joins, caching, or heavy Spark shuffle. So node type isn’t just a name it signals whether the machine is general, compute-heavy, or memory-heavy, which directly impacts performance and cost. #databricks

databricksdaily@databricksdaily

Before working with Databricks clusters, it helps to know that the compute underneath isn’t custom hardware it’s cloud virtual machines. Node types like Standard_DS3_v2 or Standard_E8ds_v5 are standard Azure VM sizes. Databricks simply uses them as cluster nodes, while organizations may restrict or template which ones teams can use for cost control and governance. #databricks

English

databricksdaily@databricksdaily·24 Şub

English

databricksdaily@databricksdaily·23 Şub

I used to think error handling in Databricks meant wrapping everything in try/except. Then I learned the platform already handles a big part of failure. Spark automatically retries infrastructure issues using lineage node crashes rarely stop pipelines. So the real engineering question shifts from “how do I catch errors?” to: What should happen when data is wrong? Fail the pipeline? Silently drop rows? Or quarantine bad records and keep good data moving? Mature Databricks pipelines are designed to recover from system failures and make data issues visible, auditable, and fixable. That mindset changed how I design ingestion and transformation flows. #Databricks #DataEngineering #DatabricksInterviewPrep

English

databricksdaily@databricksdaily·24 Oca

You don't need a standalone Vector DB anymore. Unity Catalog Vector Search syncs directly with your Delta tables. Word of caution: The index is only as good as your embedding model. If your 'Silver' layer data is noisy, your RAG app's retrieval will be too. Clean data > Fancy models. #Databricks

English

databricksdaily@databricksdaily·23 Oca

The most underrated part of the ecosystem? Delta Sharing. Being able to share live data sets with external partners without them even having a Databricks workspace and without copying data is the ultimate 'no-silo' move. Is this the end of the SFTP era. #Databricks

English

databricksdaily@databricksdaily·22 Oca

Even with Auto-Compact and Optimize-on-Write, the 'small file problem' can still haunt your Gold layer. If your DESCRIBE DETAIL shows thousands of 1MB files, your SELECT performance will tank. Don't rely solely on the engine sometimes a manual OPTIMIZE is still your best friend. #Databricks

English

databricksdaily@databricksdaily·21 Oca

Even though we can run Python UDFs inside Databricks SQL, but use them sparingly. Because they run in a fenced environment, they don't benefit from the same Photon vectorization as native SQL functions. If you can write it in pure SQL, do it. #Databricks

English

databricksdaily@databricksdaily·20 Oca

Stop building complex DLT pipelines for simple aggregations. Materialized Views in the SQL Editor now support 'Triggered' refreshes. They act like a table but update like a stream. Caution: Ensure your source tables have Change Data Feed (CDF) enabled for the most efficient MV refreshes. #Databricks

English

databricksdaily@databricksdaily·19 Oca

Implementing Row-Level Filtering via UC functions is much cleaner than the old 'Dynamic View' hacks. But a word of caution: keep your filter logic simple. Complex lookups in a Row-Level filter can bloat the query plan and hurt BI concurrency. Performance and security is always a trade-off. #Databricks

English

databricksdaily@databricksdaily·18 Oca

Don't just migrate to Unity Catalog to check a box. Use the three-level namespace (catalog.schema.table) to properly segregate Prod, Staging, and Dev within a single metastore. It’s the foundation for fine-grained access control and the only way to get true cross-workspace data lineage. #Databricks

English

databricksdaily@databricksdaily·17 Oca

If your UPDATE and DELETE operations feel sluggish, check if Deletion Vectors are enabled. DVs stop the 'rewrite-on-change' cycle by simply marking rows as deleted in a sidecar file. It turns your Delta Lake into a high-concurrency engine that handles CDC workloads like a traditional RDBMS. #Databricks

English

databricksdaily@databricksdaily·16 Oca

Photon isn’t just a faster execution engine anymore. The recent Vectorized Shuffle updates are changing the game for wide aggregations. By keeping data in columnar format throughout the shuffle and optimizing for CPU cache locality, it’s hitting 1.5x throughput on heavy shuffles. #Databricks

English

databricksdaily@databricksdaily·15 Oca

Technical tip: If you’re still manually tuning ZORDER BY columns, you’re living in 2022. Liquid Clustering is the new standard for a reason. It uses Hilbert curves to allow for incremental clustering meaning you aren't rewriting the whole table just to optimize a few new partitions. Lower write amplification, higher performance. #Databricks

English

databricksdaily@databricksdaily·14 Oca

The shift from AQE to Predictive Query Execution (PQE) in the SQL Warehouse is underrated. While standard Spark AQE only re-plans between stages, PQE monitors tasks in real-time. If it detects a massive spill or data skew, it kills and re-plans mid-stage. That’s how you get that 25% latency reduction on 'ugly' joins. #Databricks

English

databricksdaily@databricksdaily·13 Oca

Stop guessing your DBU burn. The system.billing.usage table in UC is a goldmine. You can join it with system.query_metrics to find exactly which user or dashboard is running inefficient cross-joins. Data observability is built-in now use it. #Databricks

English

databricksdaily@databricksdaily·13 Oca

Big fan of where the Photon engine is headed, My only word of caution for teams diving in: make sure your governance is tight. These performance gains really rely on Unity Catalog and Liquid Clustering if you’re still clinging to legacy Hive Metastore or manual Z-Ordering, you won't see the full 'no-tuning' magic yet #Databricks

Databricks@databricks

Analytics teams want to spend more time answering questions and less time tuning systems or tracking down costs. Recent improvements have made Databricks SQL 5x faster on average, with performance delivered automatically and no index or parameter management required. Learn how Databricks SQL delivers faster analytics with no tuning. databricks.com/blog/sql-datab…

English

databricksdaily@databricksdaily·12 Oca

A quick daily check that helps a lot: • Look for unexpected compute spikes • Scan query plans for bad joins • Make sure prod notebooks aren’t installing packages on the fly Small checks add up over time. #DataArchitect #TechTips #Databricks

English

databricksdaily@databricksdaily·11 Oca

Sharing data shouldn’t mean copying files around. Delta Sharing lets partners access live data without needing to be on Databricks. No exports, no duplicates, fewer things to break. #DataSharing #OpenSource #Databricks

English

ディスカバー

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry