Introduction

Apache Spark is a powerful distributed computing engine, widely used for big data processing, machine learning, ETL pipelines, and real-time

analytics. But unlocking Spark’s full potential requires deep optimization skills — especially as your datasets grow from gigabytes to terabytes.

This cheat sheet is a field-tested guide for developers and data engineers to:

• Tune Spark jobs for performance and scalability

• Understand when and how to tweak configurations

• Avoid common pitfalls that slow down Spark applications

• Make Spark cost-effective and cloud-friendly

Who is this for?

This guide is designed for:

• Data engineers running batch or streaming pipelines

• Developers working with PySpark or Spark SQL

• Platform engineers tuning Spark on AWS EMR, Databricks, GCP, or standalone clusters

• Anyone handling 1GB to 1TB+ of data in production

What’s Inside?

• Optimization table by data size

From 1GB to 1TB+, with settings for memory, shuffle, joins, file formats, and more.

• Config Keys Explained

Spark settings like spark.sql.shuffle.partitions, spark.memory.fraction, etc.

• Advanced Developer Tips

File format best practices, join strategies, memory tuning, and partition management.

• Debugging & Performance Monitoring

Tips using Spark UI, .explain(), and event logs to find bottlenecks fast.

Why Optimization Matters

Without tuning, even a small Spark job can:

• Blow up memory and fail due to GC overhead

• Run for hours due to inefficient shuffles

• Cost

💰💰💰

in cloud environments

Apache Spark Optimization Cheat Sheet

For Developers & Data Engineers

DATA DECODE By: Divyansh Patel

By: Divyansh Patel

A comprehensive PySpark optimization table tailored for various data volumes, ranging from small-scale (1GB) to large-scale (1TB+). This covers key configurations for memory management,

shuffle behavior, partitioning, parallelism, and file formats, ensuring Spark jobs are optimized for performance and resource utilization.

Data

Volume

Executor

Memory

Executor

Cores

Shuffle Partitions

(spark.sql.shuffle.partitions)

Broadcast Join Threshold

(spark.sql.autoBroadcastJoinThreshold)

Key Configs & Settings

File

Format

Caching

Notes

1–10

2–4 GB

50–100

10 MB (default)

- spark.sql.adaptive.enabled = true

- spark.dynamicAllocation.enabled = true

Parquet

Optional

Great for dev or small batch jobs;

use default partitions

10–50

4–8 GB

2–3

200–400

50 MB

- spark.sql.adaptive.enabled = true

- spark.sql.files.maxPartitionBytes = 128MB

Parquet +

Snappy

Optional

Good for lookup joins using

broadcastUse repartition() if

needed

50–100

6–12 GB

3–4

400–800

100 MB

- spark.executor.memoryOverhead = 1024MB

- spark.sql.adaptive.skewJoin.enabled = true

- spark.sql.files.openCostInBytes = 4MB

Parquet +

ZSTD

YES

Use caching for repeated

readsMonitor via Spark UI

100–

500 GB

8–16 GB

4–5

1000–2000

200 MB

- spark.sql.adaptive.enabled = true

- spark.sql.adaptive.coalescePartitions.enabled =

true

- spark.sql.adaptive.skewJoin.enabled = true

- spark.memory.fraction = 0.6

Parquet +

ZSTD

YES

Avoid collect() and wide

shufflesSalting or bucketing

recommended

500 GB

– 1 TB

12–24 GB

5–6

2000–4000

300 MB

- Enable dynamic allocation

- spark.sql.broadcastTimeout = 600

- spark.memory.storageFraction = 0.3

Parquet +

ZSTD

YES

Use

persist(StorageLevel.DISK_ONLY)if

memory is tight

1 TB+

16–32 GB

6–8

4000–8000+

Disable (use SMJ)

- spark.sql.autoBroadcastJoinThreshold =

-1(disable)

- spark.sql.adaptive.enabled = true

- spark.sql.adaptive.localShuffleReader.enabled =

true

Parquet +

ZSTD

YES

Prefer sort-merge joinsLeverage

partition pruning

By: Divyansh Patel

Understanding and Debugging Performance

Tip

Description

Use Spark UI

Track stage breakdowns, time spent in shuffle, skewed tasks, GC,

etc.

Stage-level Metrics

Look for long-running or skewed stages — hinting bad joins or data

skew

Use .explain(mode="formatted")

Understand execution plan and join strategies

Enable Event Logging

spark.eventLog.enabled = true → helps with Spark History Server

analysis

Avoid Wide Transformations

GroupBy, Join, Distinct – always review partitioning before/after

File Format & Storage Optimizations

Tip

Description

Use Parquet or ORC

Optimized for columnar reads + predicate pushdown

Snappy vs ZSTD Compression

Snappy = faster; ZSTD = better compression

Avoid Small Files

Consolidate using coalesce() or optimize file size to 100–200 MB

Delta or Hudi for CDC

Great for upserts and incremental pipelines

Partitioning & Skew Management

Tip

Description

Use repartition() wisely

Use before shuffle-heavy ops; avoid massive over-partitioning

Salting Keys

Add randomness to skewed join keys (e.g., zip codes, customer_id)

Use Bucketing

For repeated joins or large groupBy ops, pre-bucket tables

Enable Skew Join Handling

spark.sql.adaptive.skewJoin.enabled = true to auto-fix skews

Memory & Execution Optimization

Tip

Description

Memory Fraction

spark.memory.fraction = 0.6 — for execution buffer tuning

Memory Overhead

For Python UDF-heavy jobs, increase spark.executor.memoryOverhead

Storage Level in Cache

Use .persist(StorageLevel.DISK_ONLY) when memory is tight

Avoid Caching Everything

Cache only reused large datasets — and unpersist when done

By: Divyansh Patel

Testing & Development Tips

Tip

Description

Use limit() in dev mode

Prevent pulling full datasets accidentally

Sample Your Data

Use .sample(0.1) or .limit() to test pipeline logic fast

Local Mode Debugging

spark.master = local[*] — good for unit tests & dry runs

Log Warnings Only

Set log4j.rootCategory=ERROR, console in dev for cleaner logs

Best Practices Checklist

• Avoid collect() on large DataFrames

• Write checkpoints when doing complex transformations

• Use broadcast() only when sure the table is small

• Compress outputs (Parquet + Snappy/ZSTD)

• Monitor Spark UI during peak workloads

• Prefer withColumn() over .rdd.map()

By: Divyansh Patel