Find the companies who have atleast 2 users who speak both English and German using pyspark dataframe
schema =["company_id", "user_id", "language"]
from pyspark.sql.types import *
cSchema = StructType([StructField("company_id", StringType())\
,StructField("user_id", IntegerType()) , StructField("language" ,StringType()) ])
data = [("A", 1, "English"),
("A", 1, "German"),
("A", 2, "English"),
("A", 2, "German"),
("A", 3, "German"),
("B", 1, "English"),
("B", 2, "German"),
("C", 1, "English"),
("C", 2, "German")]
Solution :
As per below steps ,try to create multiple layer of dataframe.
-
Pivot Table: The pivot_table is used to restructure the data so that each user in a company is checked for speaking both English and German.
-
Filter Bilingual Users: We then filter the users who speak both English and German.
-
Group by Company: After filtering bilingual users, we group by company and count the number of such users for each company.
-
Filter Companies: Finally, we only return the companies that have at least two users speaking both languages.
df = pd.DataFrame(data, columns=["Company", "User", "Language"])
# Pivot table to find out which user speaks both English and German
pivot_df = df.pivot_table(index=["Company", "User"], columns="Language", aggfunc="size", fill_value=0)
# Filter out users who speak both English and German
bilingual_users = pivot_df[(pivot_df["English"] > 0) & (pivot_df["German"] > 0)].reset_index()
# Count users per company who speak both languages
result = bilingual_users.groupby("Company").size().reset_index(name="Bilingual_Users")
# Filter companies with at least 2 bilingual users
final_result = result[result["Bilingual_Users"] >= 2]["Company"]
Conclusion:
You can solve this problem by grouping the data by company and user, then checking if a user speaks both English and German. After that, you can filter the companies that have at least two such users.