pyspark filter- apply a filter to a section of a Pyspark dataframe

1/26/2024
All Articles

#pyspark filter- apply a filter a section of a Pyspark dataframe

pyspark filter- apply a filter to a section of a Pyspark dataframe

Pyspark Filter- apply a filter to a section of a Pyspark dataframe

Using PySpark, the filter transformation is used to pick elements from the original RDD (Resilient Distributed Dataset) that meet a predetermined requirement. Because it returns a new RDD without changing the original, the filter operation is a transformation, much like the filter function in Python.

We are creating filter on top of dataframe.Here field name gender and salary  should be greater than 5000 and gender is male.

 +---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael  |Rose      |        |40288|M     |4000  |
|shubham  |          |Williams|42114|M     |6000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |30001|F     |2000  |
+---------+----------+--------+-----+------+------+

 

df.where(''' (gender == 'M' and salary > 50000) ''').show()
+-------+------+-----+
|   name|gender|   id|
+-------+------+-----+
|shubham|     M|60000|
+-------+------+-----+

 

Conclusion :

In this DataFrame example, we are filtering rows based on the condition that the 'salary' column and 'gender'.
 filter transformation is applied to the original RDD (rdd). The show  action is then used to retrieve the results.

 

 

Article