Pyspark Filter- apply a filter to a section of a Pyspark dataframe

Updated by 03/march/2025

Using PySpark, the filter transformation is used to pick elements from the original RDD (Resilient Distributed Dataset) that meet a predetermined requirement. Because it returns a new RDD without changing the original, the filter operation is a transformation, much like the filter function in Python.

We are creating filter on top of dataframe.Here field name gender and salary should be greater than 5000 and gender is male.

 +---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael  |Rose      |        |40288|M     |4000  |
|shubham  |          |Williams|42114|M     |6000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |30001|F     |2000  |
+---------+----------+--------+-----+------+------+

df.where(''' (gender == 'M' and salary > 50000) ''').show()

+-------+------+-----+
|   name|gender|   id|
+-------+------+-----+
|shubham|     M|60000|
+-------+------+-----+

Step-by-Step Implementation

Step 1: Import Required Libraries

To begin, import the necessary modules from PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

Step 2: Create a SparkSession

A SparkSession is required to work with DataFrames in PySpark. If you haven’t already started a session, create one as follows:

spark = SparkSession.builder \
    .appName("PySpark Filter Example") \
    .getOrCreate()

Step 3: Create the DataFrame

We will now create a DataFrame with sample data.

data = [
    ("Michael", "Rose", "", 40288, "M", 4000),
    ("Shubham", "", "Williams", 42114, "M", 6000),
    ("Maria", "Anne", "Jones", 39192, "F", 4000),
    ("Jen", "Mary", "Brown", 30001, "F", 2000)
]

columns = ["firstname", "middlename", "lastname", "id", "gender", "salary"]

df = spark.createDataFrame(data, columns)

Step 4: Apply the Filter Condition

We will filter rows where:

The gender column has the value 'M' (Male)
The salary column has a value greater than 5000

filtered_df = df.filter((col("gender") == "M") & (col("salary") > 5000))

Alternatively, you can use SQL-style expressions:

filtered_df = df.where("gender = 'M' AND salary > 5000")

Step 5: Display the Filtered Results

Use the show() action to display the filtered DataFrame:

filtered_df.show()

Expected Output:

firstname	middlename	lastname	id	gender	salary
Shubham		Williams	42114	M	6000

Conclusion

In this example, we applied a filter transformation to a PySpark DataFrame, extracting rows that met the condition gender = 'M' AND salary > 5000. Since filter() is a transformation, it does not modify the original DataFrame but creates a new one with the filtered results.

Key Takeaways:

The filter() function is used to extract specific rows that match a given condition.
It can be implemented using col() functions or SQL-style expressions.
The show() action retrieves and displays the filtered DataFrame.
PySpark transformations like filter() create a new DataFrame without changing the original.

By leveraging filtering in PySpark, you can efficiently extract meaningful insights from large datasets w.