Spark shuffle

In Apache Spark, Spark Shuffle describes the procedure in between reduce task and map task. Shuffling refers to the shuffle of data given. This operation is considered the costliest. Parallelising effectively of the spark shuffle operation gives performance output as good for spark jobs.

Operations that trigger shuffling

In depth

Sort-based shuffling

Hash-based shuffling

Tungsten sort

Cost factors

Way to optimize

Related articles: