pyspark filter- apply a filter to a section of a Pyspark dataframe
#pyspark filter- apply a filter a section of a Pyspark dataframe
Using PySpark, the filter transformation is used to pick elements from the original RDD (Resilient Distributed Dataset) that meet a predetermined requirement. Because it returns a new RDD without changing the original, the filter operation is a transformation, much like the filter function in Python.
We are creating filter on top of dataframe.Here field name gender and salary should be greater than 5000 and gender is male.
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael |Rose | |40288|M |4000 |
|shubham | |Williams|42114|M |6000 |
|Maria |Anne |Jones |39192|F |4000 |
|Jen |Mary |Brown |30001|F |2000 |
+---------+----------+--------+-----+------+------+
df.where(''' (gender == 'M' and salary > 50000) ''').show()
+-------+------+-----+ | name|gender| id| +-------+------+-----+ |shubham| M|60000| +-------+------+-----+
In this DataFrame example, we are filtering rows based on the condition that the 'salary' column and 'gender'.
filter transformation is applied to the original RDD (rdd). The show action is then used to retrieve the results.