Find the companies who have atleast 2 users who speak both English and German

8/27/2024
All Articles

#Spark #bigdataSql #interviewquestion #advancesql

Find the companies who have atleast 2 users who speak both English and German

  Find the companies who have atleast 2 users who speak both English and German using pyspark dataframe

schema =["company_id", "user_id", "language"]
 
from pyspark.sql.types import *
 
cSchema = StructType([StructField("company_id", StringType())\
,StructField("user_id", IntegerType()) , StructField("language" ,StringType()) ])
 
data = [("A", 1, "English"),
("A", 1, "German"),
("A", 2, "English"),
("A", 2, "German"),
("A", 3, "German"),
("B", 1, "English"),
("B", 2, "German"),
("C", 1, "English"),
("C", 2, "German")] 

Solution :
As per below steps ,try to create multiple layer of dataframe.

  •     Pivot Table: The pivot_table is used to restructure the data so that each user in a company is checked for speaking both English and German.
  •     Filter Bilingual Users: We then filter the users who speak both English and German.
  •     Group by Company: After filtering bilingual users, we group by company and count the number of such users for each company.
  •     Filter Companies: Finally, we only return the companies that have at least two users speaking both languages.
 
df = pd.DataFrame(data, columns=["Company", "User", "Language"])
 
# Pivot table to find out which user speaks both English and German
pivot_df = df.pivot_table(index=["Company", "User"], columns="Language", aggfunc="size", fill_value=0)
 
# Filter out users who speak both English and German
bilingual_users = pivot_df[(pivot_df["English"] > 0) & (pivot_df["German"] > 0)].reset_index()
 
# Count users per company who speak both languages
result = bilingual_users.groupby("Company").size().reset_index(name="Bilingual_Users")
 
# Filter companies with at least 2 bilingual users
final_result = result[result["Bilingual_Users"] >= 2]["Company"]
 

Conclusion: 

 
You can solve this problem by grouping the data by company and user, then checking if a user speaks both English and German. After that, you can filter the companies that have at least two such users.
 

 

Article