Broadcast Variables & Accumulators in Apache Spark and pyspark

8/17/2025
All Articles

Broadcast variable example in PySpark with lookup table

Broadcast Variables & Accumulators in Apache Spark and pyspark

Broadcast Variables & Accumulators in Apache Spark

When working with Apache Spark, performance optimization is crucial for handling large-scale data. Two important features that make Spark efficient are Broadcast Variables and Accumulators. These tools help reduce data shuffling, improve speed, and support better cluster-wide communication.

In this tutorial, we’ll explore what broadcast variables and accumulators are, their importance, and practical examples in PySpark and Scala.


What are Broadcast Variables in Spark?

A broadcast variable allows the programmer to keep a read-only copy of data cached on each worker node rather than shipping a copy with every task.

Instead of sending the same data multiple times, Spark broadcasts it once and reuses it across nodes. This improves efficiency, especially when working with lookup tables, configurations, or reference data.

Example: Broadcast Variable in PySpark

from pyspark import SparkContext

sc = SparkContext("local", "Broadcast Example")

# Broadcast variable
lookup_data = {"A": "Apple", "B": "Banana", "C": "Cherry"}
broadcast_var = sc.broadcast(lookup_data)

# Sample RDD
rdd = sc.parallelize(["A", "B", "C", "A", "B"])

# Use broadcast variable
result = rdd.map(lambda x: (x, broadcast_var.value[x])).collect()

print(result)

✅ Output: [('A', 'Apple'), ('B', 'Banana'), ('C', 'Cherry'), ('A', 'Apple'), ('B', 'Banana')]


What are Accumulators in Spark?

An accumulator is a variable that workers can “add” to using an associative operation, and the result is only available to the driver program.

They are often used for counters, sums, and debugging. Unlike regular variables, accumulators are write-only from workers and readable on the driver.

Example: Accumulator in PySpark

from pyspark import SparkContext

sc = SparkContext("local", "Accumulator Example")

# Create an accumulator
acc = sc.accumulator(0)

rdd = sc.parallelize([1, 2, 3, 4, 5])

# Add values to accumulator
def add_func(x):
    global acc
    acc += x

rdd.foreach(add_func)

print("Accumulator Value:", acc.value)

✅ Output: Accumulator Value: 15


Broadcast Variables vs Accumulators

Feature Broadcast Variable Accumulator
Purpose Distribute read-only data across nodes Aggregate values across nodes
Access Read-only Write-only (from workers), Readable (from driver)
Use Case Lookup tables, configurations, reference data Counters, sums, debugging
Scope Available to all workers Updated by workers, collected at driver

Use Cases in Real-World Applications

  1. Broadcast Variables

    • Distributing machine learning models to all worker nodes.

    • Sharing lookup/reference tables in ETL jobs.

    • Caching common datasets for repeated access.

  2. Accumulators

    • Counting bad records in data pipelines.

    • Monitoring metrics (e.g., number of null values).

    • Summing transaction amounts across a cluster.


 

 

Example in Scala:

import org.apache.spark.sql.SparkSession

object BroadcastExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("Broadcast Example").getOrCreate()
    val sc = spark.sparkContext

    // Large lookup map
    val lookupMap = Map(1 -> "Apple", 2 -> "Banana", 3 -> "Cherry")
    val broadcastVar = sc.broadcast(lookupMap)

    val data = sc.parallelize(Seq(1, 2, 3, 4))

    val result = data.map(x => broadcastVar.value.getOrElse(x, "Unknown"))
    result.collect().foreach(println)

    spark.stop()
  }
}

Output:

Apple
Banana
Cherry
Unknown

2. Accumulators in Spark

What are Accumulators?

Accumulators are variables used to perform aggregations across tasks, such as counters or sums. They are write-only for workers and readable only on the driver program.

They are commonly used for monitoring, debugging, or implementing custom metrics.

Example in Scala:

import org.apache.spark.sql.SparkSession

object AccumulatorExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("Accumulator Example").getOrCreate()
    val sc = spark.sparkContext

    // Define an accumulator
    val accumulator = sc.longAccumulator("My Accumulator")

    val data = sc.parallelize(Seq(1, 2, 3, 4, 5))

    data.foreach(x => accumulator.add(x))

    println("Accumulator Value: " + accumulator.value)

    spark.stop()
  }
}

Output:

Accumulator Value: 15

3. Key Differences Between Broadcast Variables and Accumulators

Feature Broadcast Variables Accumulators
Purpose Share read-only data across nodes Aggregate values across tasks
Accessibility Read-only on workers Write-only on workers
Example Use Case Lookup tables, configuration data Counting, summation, metrics
Data Movement Cached on each worker node Values sent back to driver

 

Best Practices

  • Use broadcast variables for large reference datasets instead of repeatedly joining.

  • Use accumulators for counters and debugging, not for main business logic (since updates are not guaranteed in case of task retries).

  • Always test accumulators in production carefully, as they may update multiple times if a job fails and retries.


Conclusion

Both Broadcast Variables and Accumulators are essential tools in Scala Spark for optimizing performance and managing distributed data processing.

  • Use Broadcast Variables when you need to efficiently share large read-only datasets across executors.

  • Use Accumulators when you need to collect metrics or perform aggregations from worker nodes back to the driver.

Mastering these concepts helps in writing efficient, scalable, and optimized Spark applications.

Article