Find the companies who have atleast 2 users who speak both English and German
PySpark DataFrame #Spark #bigdataSql #interviewquestion
In today’s globalized world, companies often have employees who speak multiple languages. Identifying companies with bilingual users can be crucial for tasks like localization, customer support, and team collaboration. In this article, we’ll explore how to use PySpark DataFrame to find companies with at least two users who speak both English and German.
Given a dataset with the following schema:
We need to identify companies that have at least two users who speak both English and German.
Here’s a sample dataset:
data = [("A", 1, "English"),
("A", 1, "German"),
("A", 2, "English"),
("A", 2, "German"),
("A", 3, "German"),
("B", 1, "English"),
("B", 2, "German"),
("C", 1, "English"),
("C", 2, "German")]
First, import the necessary libraries and create a PySpark DataFrame.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Initialize Spark session
spark = SparkSession.builder.appName("BilingualUsers").getOrCreate()
# Define schema
schema = StructType([
StructField("company_id", StringType(), True),
StructField("user_id", IntegerType(), True),
StructField("language", StringType(), True)
])
# Create DataFrame
data = [("A", 1, "English"), ("A", 1, "German"), ("A", 2, "English"),
("A", 2, "German"), ("A", 3, "German"), ("B", 1, "English"),
("B", 2, "German"), ("C", 1, "English"), ("C", 2, "German")]
df = spark.createDataFrame(data, schema)
We need to identify users who speak both languages. This can be done by grouping the data by company_id
and user_id
and filtering users who have entries for both languages.
from pyspark.sql.functions import collect_set
# Group by company_id and user_id, and collect languages
grouped_df = df.groupBy("company_id", "user_id").agg(collect_set("language").alias("languages"))
# Filter users who speak both English and German
bilingual_users = grouped_df.filter((array_contains(col("languages"), "English")) \
.filter((array_contains(col("languages"), "German")))
Next, count the number of bilingual users for each company.
from pyspark.sql.functions import count
# Count bilingual users per company
bilingual_count = bilingual_users.groupBy("company_id").agg(count("user_id").alias("bilingual_users"))
Finally, filter companies that have at least two bilingual users.
# Filter companies with at least 2 bilingual users
result = bilingual_count.filter(col("bilingual_users") >= 2).select("company_id")
Show the final result.
result.show()
Here’s the complete code for the solution:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import collect_set, array_contains, count, col
# Initialize Spark session
spark = SparkSession.builder.appName("BilingualUsers").getOrCreate()
# Define schema
schema = StructType([
StructField("company_id", StringType(), True),
StructField("user_id", IntegerType(), True),
StructField("language", StringType(), True)
])
# Create DataFrame
data = [("A", 1, "English"), ("A", 1, "German"), ("A", 2, "English"),
("A", 2, "German"), ("A", 3, "German"), ("B", 1, "English"),
("B", 2, "German"), ("C", 1, "English"), ("C", 2, "German")]
df = spark.createDataFrame(data, schema)
# Group by company_id and user_id, and collect languages
grouped_df = df.groupBy("company_id", "user_id").agg(collect_set("language").alias("languages"))
# Filter users who speak both English and German
bilingual_users = grouped_df.filter((array_contains(col("languages"), "English")) \
.filter((array_contains(col("languages"), "German")))
# Count bilingual users per company
bilingual_count = bilingual_users.groupBy("company_id").agg(count("user_id").alias("bilingual_users"))
# Filter companies with at least 2 bilingual users
result = bilingual_count.filter(col("bilingual_users") >= 2).select("company_id")
# Display the result
result.show()
company_id
and user_id
, and their languages are collected. Users who speak both English and German are filtered.Using PySpark DataFrame, we can efficiently identify companies with at least two users who speak both English and German. This approach leverages PySpark’s powerful data manipulation capabilities, making it scalable for large datasets. By following the steps outlined in this guide, you can easily adapt this solution to similar problems in your data analysis workflows.