: Written by Raja Chirala, Senior Data Engineer
Apache Spark is a powerful distributed computing engine, but configuring a Spark cluster can seem daunting at first. Whether you’re setting it up on-premises or in the cloud, getting your cluster configuration right is critical for performance, scalability, and stability and this is where many projects win or lose on performance. In this post, let’s break down Spark cluster configuration with simple calculations you can apply today.
- Understanding Spark Cluster Components
A typical Spark cluster includes:
- Driver: The brain that plans and coordinates tasks.
- Executors: The workers that run tasks and return results.
- Cluster Manager: A service (like YARN, Kubernetes, or Spark’s standalone manager) that handles resource allocation.
Knowing how these components interact will guide your configuration choices.
- Configuration Calculations
Let’s say that we have estimated the daily data input load for Spark at around 50GB. To set up a cluster configuration in Spark to handle such a load, the following calculations will help arrive at a minimal configuration.
In Hadoop 1, the default block size used to be 64MB. However, with so many implementations, trials, and errors, Hadoop 2 was configured to have a default of 128MB as the block size, which has persisted the same in current Hadoop 3. Apart from this, few other notable differences between Hadoop versions are:
- Hadoop 1 used to manage resources and data processing with MapReduce, whereas in Hadoop 2 YARN was introduced for resource management and MapReduce for processing.
- Hadoop 1 has a limit of 4K nodes per cluster whereas Hadoop 2 can go upto 10K nodes in a cluster and this is further enhanced in Hadoop 3 along with more efficiency in fault tolerance and scalability.
We will assume the data is loaded to blob/data lake or something similar hdfs compliant storage. So, typically we can consider partition size to be kept equal to block size which is 128MB.
- Hence, No. of partitions for 50 GB data = 50*1024*MB/128MB = 400
Each partition or task runs on a core.
- Hence, No. of cores needed also = 400
Also, it is to be remembered that, when we invoke an action on RDD, (Resilient Distributed Datasets – which is a fundamental data structure in spark and is known for its fault tolerance and automatic recovery capability from failures. Dataframes and Datasets are higher level abstractions built on top of RDD for ease of programming.) following operations take place:
- A “Job” is created for it.
- Jobs are divided into “STAGES” based on the shuffle boundary.
- Each stage is further divided into tasks based on the number of partitions on the RDD. So, Task is the smallest unit of work for spark.
- Now, how many of these tasks can be executed simultaneously depends on the “Number of Cores” available on executors.
Some studies have shown that executors max out on IO processing with 5 cores on it. (Reference: https://www.cloudera.com/blog/technical/how-to-tune-your-apache-spark-jobs-part-2.html)
Each task aka partition will be executed on a core. Hence, if we go with 5 cores per executor,
- The total no. of executors we need is 400/5 = 80 executors.
This is the minimum no. of executors needed if we want to process all the partitions parallelly at a single point in time. That means, if each partition takes 10 minutes to process, it takes 10 minutes to process all the 50GB of data.
One may reduce the no. of executors to 40 or 20 or even less and so the processing time of the entire data also increases proportionally to a decrease in the number of executors. For example, if we reduce executors to 40, which means 200 cores, only 200 partitions will be executed in parallel at a time and remaining 200 will be in queue and processed later.
Expected Memory needed on each core is typically 4 times that of default partition size = 4*128MB = 512MB.
- Memory needed on each executor = 512*5(cores) = 2.5GB
- Total Memory needed = 2.5GB * 80 (executors) = 200GB
So, to process 50GB of data parallelly, we need 200 GB of total executor memory in total, which is in general, around 4 times that of data. Also, executor memory should not be less than 1.5 times that of spark reserved memory, which means that it should not be less than 450MB.
What about driver memory?
If the queries don’t bring the data into the driver node, one can go with a typical 1GB default memory on the driver. However, if we are issuing any actions such as collect(), etc., they bring the entire data into the driver node, and if the memory is not sufficient to hold the data, we encounter an out of memory issue. Hence, one needs to use such actions judiciously when writing pyspark queries.
All the above configurations can be set on the following properties in a configuration file or can be supplied on the command line while executing.
In spark-defaults.conf or via command-line flags –
- spark.master: Points to your cluster manager (e.g., spark://<master-url> or yarn).
- spark.executor.memory: Amount of memory per executor.
- spark.executor.cores: Number of CPU cores per executor.
- spark.driver.memory: Memory available for the driver.
- spark.driver.cores: CPU cores allocated to the driver (especially relevant in cluster mode).
Conclusion
Every project is different.
Starting with calculations like these gives you a solid baseline, and through monitoring, experimentation, and fine-tuning, you can adapt your cluster to match your workload, whether it’s batch processing, streaming, or machine learning.
Remember: There’s no one-size-fits-all — tuning is an art based on data.
I’d love to hear from you:
How do you typically approach Spark cluster sizing in your projects?
What lessons have you learned through tuning Spark workloads?