TechAmalgam· Apache Spark
Games

Partitions

How Spark splits data across the cluster for parallelism.

To allow every executor to perform work in parallel, Spark breaks the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster. A DataFrame's partitions represent how the data is physically distributed across the cluster during execution.

If you have one partition, Spark has a parallelism of only one — even with thousands of executors. If you have many partitions but only one executor, Spark still has a parallelism of one, because there is only one computation resource.