Transformations

Narrow vs. wide dependencies, and what a shuffle costs.

In Spark, the core data structures are immutable — they cannot be changed after creation. To "change" a DataFrame, you instruct Spark how you would like to modify it. These instructions are called transformations, and they are the core of how you express business logic.

There are two types. Narrow transformations have dependencies where each input partition contributes to only one output partition. Wide transformations have input partitions contributing to many output partitions — often called a shuffle, where Spark exchanges partitions across the cluster.

With narrow transformations, Spark performs pipelining: multiple filters on DataFrames are all performed in-memory. The same is not true for shuffles — when Spark performs a shuffle, it writes results to disk.