Firstly, let’s talk about me A bit of background on why I’m here writing this blog post:
For those who know me and my career journey, I spent the first decade of my career working in various analyst roles at Whitbread. Towards the tail end of my time there, Whitbread invested in Databricks, which felt like a huge step forward — albeit a daunting one. As someone who was already confident in SQL, suddenly I wasn’t just querying curated datasets; I could ingest data, build transformations, create tables, and schedule jobs myself. It opened the door to what looked and felt like data engineering, without needing to change tools or languages. Like many analysts, I learned Databricks by doing: experimenting in notebooks, copying patterns from colleagues, and focusing on getting correct results out the other end.
What I didn’t fully appreciate at the time was that Databricks isn’t “a better SQL engine”; it’s a distributed compute platform. Most of the performance and cost issues teams run into don’t come from bad SQL, but from reasonable assumptions carried over from traditional data warehouses: that file sizes don’t matter, that the engine will always optimise execution for you, or that if a query runs once, it will continue to scale. Databricks rarely fails loudly when you get these things wrong; jobs usually still run, just slower and more expensive as data grows.
Whilst studying for my master’s, I met analysts from across different organisations and saw that my experience was a common theme. Analysts are given powerful platforms like Databricks, encouraged to move faster and own more of the data pipeline, but largely left to self-teach the engineering principles that sit underneath. That’s not a criticism; Databricks has intentionally been designed to be accessible, which is naturally a double-edged sword, with empowerment on one side and a steep, often invisible learning curve on the other.
During my time at Coeo, I’ve had the opportunity to learn by working alongside specialists who deal with these challenges every day. The techniques in this post are the lessons I wish I’d had learned earlier: simple ways to bridge the gap between strong SQL skills and efficient, scalable Databricks workloads.
And finally, what you’re all here for – my recommendations for performance tuning techniques:
-
Use job clusters instead of all-purpose clusters
All-purpose clusters are great for exploration, but for scheduled workloads, job clusters are leaner and start fresh each run. They reduce resource contention, isolate jobs, and often cut costs, especially for workloads that run periodically.
-
Turn on autoscaling (and set sensible bounds)
Autoscaling allows your cluster to grow and shrink with demand, but defaults can be too aggressive or too conservative. Setting sensible min/max nodes ensures your jobs scale efficiently without over-provisioning or hitting limits that slow everything down. You can find sensible bounds by looking at historical job metrics — CPU, memory, task duration — and testing a few runs to see what handles both your smallest and largest workloads comfortably.
-
Pick the latest stable Databricks Runtime
Newer runtimes bring optimisations, bug fixes, and better connector support. Upgrading can make queries faster “for free,” but always test before switching — some libraries or workflows may behave slightly differently.
-
Test and tune for scale,don’tassume small-dataset performance will hold
A query that runs fine on thousands of rows may become very slow on millions or billions. Distributed systems introduce hidden costs: shuffle, partition skew, and network I/O. Monitor job execution, test at scale, and tune accordingly. Small mistakes in assumptions can turn fast queries into slow and costly ones.
-
Use Delta Lake features, not just Delta format
Saving a table as Delta is just the start. Take advantage of Delta Lake features like Z-ordering, data skipping, optimised writes, and time travel. These features reduce unnecessary reading and writing of data, speed up queries, and make your pipelines more resilient.
-
Avoid over-partitioning by date
It’s tempting to create one file per day or hour, but too many small files increase shuffle overhead and slow queries. Instead, aim for a few larger files per partition. This often leads to faster aggregation and joins, especially at scale.
-
Control shuffle behaviour (spark.sql.shuffle.partitions)
By default, Spark may create hundreds or thousands of partitions during joins or aggregations. That can overwhelm the cluster with tiny tasks. Adjusting spark.sql.shuffle.partitions based on your dataset size balances task execution and reduces wasted CPU cycles. Fewer, well-sized tasks often beat lots of tiny ones
-
Understand that file sizes matter
Small files create overhead; huge files create uneven workloads. Databricks and Spark best practices recommend aiming for 100–500 MB per file for optimal distributed performance.
-
Avoid driver-side behaviour (even in SQL-first workflows)
The driver node coordinates the cluster, so any operation that pulls data to the driver can quickly become a bottleneck. This includes actions like collect() or toPandas(), which move all your data to a single node. Whenever possible, stick to distributed transformations using Spark SQL or DataFrame operations, and prefer native Spark functions over custom Python functions that execute on the driver.
-
Treat Databricks as a distributed system, not a database
Databricks isn’t a traditional RDBMS; it’s cluster-based, distributed compute. Queries involve planning, shuffling, and coordination across nodes. Optimisations that work on a single node may fail at scale. Think in terms of tasks, partitions, and cluster resources, not rows and indexes. Adopting this mindset ties together all the above, ensuring your pipelines remain optimal as you scale.