Introduction
Apache Spark is a powerful distributed computing engine, widely used for big data processing, machine learning, ETL pipelines, and real-time
analytics. But unlocking Spark’s full potential requires deep optimization skills — especially as your datasets grow from gigabytes to terabytes.
This cheat sheet is a field-tested guide for developers and data engineers to:
• Tune Spark jobs for performance and scalability
• Understand when and how to tweak configurations
Avoid common pitfalls that slow down Spark applications
• Make Spark cost-effective and cloud-friendly
Who is this for?
This guide is designed for:
• Data engineers running batch or streaming pipelines
• Developers working with PySpark or Spark SQL
• Platform engineers tuning Spark on AWS EMR, Databricks, GCP, or standalone clusters
Anyone handling 1GB to 1TB+ of data in production
What’s Inside?
Optimization table by data size
From 1GB to 1TB+, with settings for memory, shuffle, joins, file formats, and more.
Config Keys Explained
Spark settings like spark.sql.shuffle.partitions, spark.memory.fraction, etc.
Advanced Developer Tips
File format best practices, join strategies, memory tuning, and partition management.
Debugging & Performance Monitoring
Tips using Spark UI, .explain(), and event logs to find bottlenecks fast.
Why Optimization Matters
Without tuning, even a small Spark job can:
• Blow up memory and fail due to GC overhead
• Run for hours due to inefficient shuffles
• Cost
💰💰💰
in cloud environments
Apache Spark Optimization Cheat Sheet
For Developers & Data Engineers
DATA DECODE By: Divyansh Patel
By: Divyansh Patel
A comprehensive PySpark optimization table tailored for various data volumes, ranging from small-scale (1GB) to large-scale (1TB+). This covers key configurations for memory management,
shuffle behavior, partitioning, parallelism, and file formats, ensuring Spark jobs are optimized for performance and resource utilization.
Data
Volume
Executor
Memory
Executor
Cores
Shuffle Partitions
(spark.sql.shuffle.partitions)
Broadcast Join Threshold
(spark.sql.autoBroadcastJoinThreshold)
Key Configs & Settings
File
Format
Caching
Notes
1–10
GB
2–4 GB
2
50–100
10 MB (default)
- spark.sql.adaptive.enabled = true
- spark.dynamicAllocation.enabled = true
Parquet
Optional
Great for dev or small batch jobs;
use default partitions
10–50
GB
4–8 GB
2–3
200–400
50 MB
- spark.sql.adaptive.enabled = true
- spark.sql.files.maxPartitionBytes = 128MB
Parquet +
Snappy
Optional
Good for lookup joins using
broadcastUse repartition() if
needed
50–100
GB
6–12 GB
3–4
400–800
100 MB
- spark.executor.memoryOverhead = 1024MB
- spark.sql.adaptive.skewJoin.enabled = true
- spark.sql.files.openCostInBytes = 4MB
Parquet +
ZSTD
YES
Use caching for repeated
readsMonitor via Spark UI
100–
500 GB
8–16 GB
4–5
1000–2000
200 MB
- spark.sql.adaptive.enabled = true
- spark.sql.adaptive.coalescePartitions.enabled =
true
- spark.sql.adaptive.skewJoin.enabled = true
- spark.memory.fraction = 0.6
Parquet +
ZSTD
YES
Avoid collect() and wide
shufflesSalting or bucketing
recommended
500 GB
– 1 TB
12–24 GB
5–6
2000–4000
300 MB
- Enable dynamic allocation
- spark.sql.broadcastTimeout = 600
- spark.memory.storageFraction = 0.3
Parquet +
ZSTD
YES
Use
persist(StorageLevel.DISK_ONLY)if
memory is tight
1 TB+
16–32 GB
6–8
4000–8000+
Disable (use SMJ)
- spark.sql.autoBroadcastJoinThreshold =
-1(disable)
- spark.sql.adaptive.enabled = true
- spark.sql.adaptive.localShuffleReader.enabled =
true
Parquet +
ZSTD
YES
Prefer sort-merge joinsLeverage
partition pruning
By: Divyansh Patel
Understanding and Debugging Performance
Tip
Description
Use Spark UI
Track stage breakdowns, time spent in shuffle, skewed tasks, GC,
etc.
Stage-level Metrics
Look for long-running or skewed stages — hinting bad joins or data
skew
Use .explain(mode="formatted")
Understand execution plan and join strategies
Enable Event Logging
spark.eventLog.enabled = true → helps with Spark History Server
analysis
Avoid Wide Transformations
GroupBy, Join, Distinct – always review partitioning before/after
File Format & Storage Optimizations
Description
Optimized for columnar reads + predicate pushdown
Snappy = faster; ZSTD = better compression
Consolidate using coalesce() or optimize file size to 100–200 MB
Great for upserts and incremental pipelines
Partitioning & Skew Management
Tip
Description
Use repartition() wisely
Use before shuffle-heavy ops; avoid massive over-partitioning
Salting Keys
Add randomness to skewed join keys (e.g., zip codes, customer_id)
Use Bucketing
For repeated joins or large groupBy ops, pre-bucket tables
Enable Skew Join Handling
spark.sql.adaptive.skewJoin.enabled = true to auto-fix skews
Memory & Execution Optimization
Tip
Description
Memory Fraction
spark.memory.fraction = 0.6 — for execution buffer tuning
Memory Overhead
For Python UDF-heavy jobs, increase spark.executor.memoryOverhead
Storage Level in Cache
Use .persist(StorageLevel.DISK_ONLY) when memory is tight
Avoid Caching Everything
Cache only reused large datasets — and unpersist when done
By: Divyansh Patel
Testing & Development Tips
Tip
Description
Use limit() in dev mode
Prevent pulling full datasets accidentally
Sample Your Data
Use .sample(0.1) or .limit() to test pipeline logic fast
Local Mode Debugging
spark.master = local[*] — good for unit tests & dry runs
Log Warnings Only
Set log4j.rootCategory=ERROR, console in dev for cleaner logs
Best Practices Checklist
Avoid collect() on large DataFrames
• Write checkpoints when doing complex transformations
• Use broadcast() only when sure the table is small
• Compress outputs (Parquet + Snappy/ZSTD)
• Monitor Spark UI during peak workloads
• Prefer withColumn() over .rdd.map()
By: Divyansh Patel