TechAmalgam· Apache Spark
Games

DataFrame

Spark's distributed, table-like collection of data.

A DataFrame is the most common Structured API and represents a table of data with rows and columns. Imagine creating a DataFrame with one column containing 1,000 rows with values from 0 to 999. This range represents a distributed collection: when run on a cluster, each part of this range exists on a different executor.

This is what makes a Spark DataFrame different from a single-machine table: it can span thousands of machines, because the data is too large to fit on one machine or it would simply take too long to process there.