Find the companies who have atleast 2 users who speak both English and German

8/27/2024
All Articles

PySpark DataFrame #Spark #bigdataSql #interviewquestion

Find the companies who have atleast 2 users who speak both English and German

How to Find Companies with Bilingual Users (English & German) Using PySpark DataFrame: A Step-by-Step Guide

Introduction

In today’s globalized world, companies often have employees who speak multiple languages. Identifying companies with bilingual users can be crucial for tasks like localization, customer support, and team collaboration. In this article, we’ll explore how to use PySpark DataFrame to find companies with at least two users who speak both English and German.

Problem Statement

Given a dataset with the following schema:

  • company_id: The ID of the company.
  • user_id: The ID of the user.
  • language: The language spoken by the user.

We need to identify companies that have at least two users who speak both English and German.

Dataset Example

Here’s a sample dataset:


data = [("A", 1, "English"),
        ("A", 1, "German"),
        ("A", 2, "English"),
        ("A", 2, "German"),
        ("A", 3, "German"),
        ("B", 1, "English"),
        ("B", 2, "German"),
        ("C", 1, "English"),
        ("C", 2, "German")]
    

Step-by-Step Solution Using PySpark

Step 1: Import Required Libraries

First, import the necessary libraries and create a PySpark DataFrame.


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark session
spark = SparkSession.builder.appName("BilingualUsers").getOrCreate()

# Define schema
schema = StructType([
    StructField("company_id", StringType(), True),
    StructField("user_id", IntegerType(), True),
    StructField("language", StringType(), True)
])

# Create DataFrame
data = [("A", 1, "English"), ("A", 1, "German"), ("A", 2, "English"),
        ("A", 2, "German"), ("A", 3, "German"), ("B", 1, "English"),
        ("B", 2, "German"), ("C", 1, "English"), ("C", 2, "German")]

df = spark.createDataFrame(data, schema)
    

Step 2: Filter Users Who Speak Both English and German

We need to identify users who speak both languages. This can be done by grouping the data by company_id and user_id and filtering users who have entries for both languages.


from pyspark.sql.functions import collect_set

# Group by company_id and user_id, and collect languages
grouped_df = df.groupBy("company_id", "user_id").agg(collect_set("language").alias("languages"))

# Filter users who speak both English and German
bilingual_users = grouped_df.filter((array_contains(col("languages"), "English")) \
                            .filter((array_contains(col("languages"), "German")))
    

Step 3: Count Bilingual Users per Company

Next, count the number of bilingual users for each company.


from pyspark.sql.functions import count

# Count bilingual users per company
bilingual_count = bilingual_users.groupBy("company_id").agg(count("user_id").alias("bilingual_users"))
    

Step 4: Filter Companies with At Least Two Bilingual Users

Finally, filter companies that have at least two bilingual users.


# Filter companies with at least 2 bilingual users
result = bilingual_count.filter(col("bilingual_users") >= 2).select("company_id")
    

Step 5: Display the Result

Show the final result.


result.show()
    

Full PySpark Code

Here’s the complete code for the solution:


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import collect_set, array_contains, count, col

# Initialize Spark session
spark = SparkSession.builder.appName("BilingualUsers").getOrCreate()

# Define schema
schema = StructType([
    StructField("company_id", StringType(), True),
    StructField("user_id", IntegerType(), True),
    StructField("language", StringType(), True)
])

# Create DataFrame
data = [("A", 1, "English"), ("A", 1, "German"), ("A", 2, "English"),
        ("A", 2, "German"), ("A", 3, "German"), ("B", 1, "English"),
        ("B", 2, "German"), ("C", 1, "English"), ("C", 2, "German")]

df = spark.createDataFrame(data, schema)

# Group by company_id and user_id, and collect languages
grouped_df = df.groupBy("company_id", "user_id").agg(collect_set("language").alias("languages"))

# Filter users who speak both English and German
bilingual_users = grouped_df.filter((array_contains(col("languages"), "English")) \
                            .filter((array_contains(col("languages"), "German")))

# Count bilingual users per company
bilingual_count = bilingual_users.groupBy("company_id").agg(count("user_id").alias("bilingual_users"))

# Filter companies with at least 2 bilingual users
result = bilingual_count.filter(col("bilingual_users") >= 2).select("company_id")

# Display the result
result.show()
    

Explanation of the Solution

  1. Data Preparation: The dataset is loaded into a PySpark DataFrame with the appropriate schema.
  2. Grouping and Filtering: Users are grouped by company_id and user_id, and their languages are collected. Users who speak both English and German are filtered.
  3. Counting Bilingual Users: The number of bilingual users is counted for each company.
  4. Final Filtering: Companies with at least two bilingual users are identified and displayed.

Conclusion

Using PySpark DataFrame, we can efficiently identify companies with at least two users who speak both English and German. This approach leverages PySpark’s powerful data manipulation capabilities, making it scalable for large datasets. By following the steps outlined in this guide, you can easily adapt this solution to similar problems in your data analysis workflows.

Article