Uncategorized

Mastering SQL & PySpark Interview Questions: How to Find Users’ Most Visited Floor, Total Visits & Unique Resources

Mastering SQL & PySpark Interview Questions: How to Find Users’ Most Visited Floor, Total Visits & Unique Resources

Introduction

In data engineering, understanding user interaction patterns across different locations or resources is crucial for business insights, optimizing experiences, and resource allocation. This problem demonstrates how to aggregate user visit data, identify the most frequented floor per user, calculate total visits, and list unique resources used, using powerful SQL and PySpark techniques.

Problem Statement

For each user in the dataset, determine the following:

  • The most visited floor number.
  • The total number of visits made.
  • The list of unique resources used by the user.

This helps to summarize user activity and resources interaction by aggregating across various visits.

Dataset (SQL – Notebook Friendly)

-- Create table
CREATE TABLE user_visits (
    name STRING,
    location STRING,
    email STRING,
    floor INT,
    resource STRING
);

-- Insert sample data
INSERT INTO user_visits VALUES
('John','New York','john@example.com',3,'Laptop'),
('John','New York','john.doe@example.com',2,'Monitor'),
('John','New York','johnd@example.com',3,'Laptop'),
('Emily','San Francisco','emily.sf@example.com',1,'Desktop'),
('Emily','San Francisco','emily.e@example.com',1,'Laptop'),
('Emily','San Francisco','emily.sf2@example.com',2,'Tablet'),
('Michael','Chicago','michael.c@example.com',4,'Projector'),
('Michael','Chicago','mike@example.com',4,'Projector'),
('Michael','Chicago','michael.c@example.com',3,'Desktop');

Sample Data (Tabular View)

name location email floor resource
John New York john@example.com 3 Laptop
John New York john.doe@example.com 2 Monitor
John New York johnd@example.com 3 Laptop
Emily San Francisco emily.sf@example.com 1 Desktop
Emily San Francisco emily.e@example.com 1 Laptop
Emily San Francisco emily.sf2@example.com 2 Tablet
Michael Chicago michael.c@example.com 4 Projector
Michael Chicago mike@example.com 4 Projector
Michael Chicago michael.c@example.com 3 Desktop

Expected Output

name most_visited_floor total_visits used_resources
John 3 3 “Laptop,Monitor”
Emily 1 3 “Desktop,Laptop,Tablet”
Michael 4 3 “Desktop,Projector”

SQL Solution (Notebook Ready)

WITH floor_counts AS (
    SELECT
        name,
        floor,
        COUNT(*) AS visit_count
    FROM user_visits
    GROUP BY name, floor
),
ranked_floors AS (
    SELECT
        name,
        floor,
        visit_count,
        ROW_NUMBER() OVER (PARTITION BY name ORDER BY visit_count DESC, floor ASC) AS rn
    FROM floor_counts
),
total_visits AS (
    SELECT
        name,
        COUNT(*) AS total_visits
    FROM user_visits
    GROUP BY name
),
resources_list AS (
    SELECT
        name,
        STRING_AGG(DISTINCT resource, ',') AS used_resources
    FROM user_visits
    GROUP BY name
)
SELECT
    t.name,
    rf.floor AS most_visited_floor,
    t.total_visits,
    r.used_resources
FROM total_visits t
JOIN ranked_floors rf ON t.name = rf.name AND rf.rn = 1
JOIN resources_list r ON t.name = r.name
ORDER BY t.name;

PySpark Solution (Databricks Ready)

# PySpark DataFrame solution
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, row_number, collect_set, concat_ws
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("UserVisitsAnalysis").getOrCreate()

# Sample data
data = [
    ('John','New York','john@example.com',3,'Laptop'),
    ('John','New York','john.doe@example.com',2,'Monitor'),
    ('John','New York','johnd@example.com',3,'Laptop'),
    ('Emily','San Francisco','emily.sf@example.com',1,'Desktop'),
    ('Emily','San Francisco','emily.e@example.com',1,'Laptop'),
    ('Emily','San Francisco','emily.sf2@example.com',2,'Tablet'),
    ('Michael','Chicago','michael.c@example.com',4,'Projector'),
    ('Michael','Chicago','mike@example.com',4,'Projector'),
    ('Michael','Chicago','michael.c@example.com',3,'Desktop')
]

columns = ['name', 'location', 'email', 'floor', 'resource']
df = spark.createDataFrame(data, schema=columns)

# Calculate visits per floor per user
floor_visits = df.groupBy('name', 'floor').agg(count('*').alias('visit_count'))

# Identify most visited floor using window function
window_spec = Window.partitionBy('name').orderBy(col('visit_count').desc(), col('floor').asc())
floor_ranked = floor_visits.withColumn('rank', row_number().over(window_spec))
most_visited_floor = floor_ranked.filter(col('rank') == 1).select('name', col('floor').alias('most_visited_floor'))

# Total visits per user
total_visits = df.groupBy('name').agg(count('*').alias('total_visits'))

# Unique resources per user
resources = df.groupBy('name').agg(concat_ws(",", collect_set('resource')).alias('used_resources'))

# Join all results
result = total_visits.join(most_visited_floor, 'name').join(resources, 'name').orderBy('name')

result.show(truncate=False)

Explanation

– First, we count the number of visits each user made to each floor. This helps us identify the floor with the highest visits per user using a window function ordered by visit count descending and floor ascending (to break ties).

– Next, we calculate total visits for each user by counting their visits across all floors.

– To get a unique list of resources used by each user, we collect distinct resource values and concatenate them into a comma-separated string.

– Finally, we join these intermediate results on the user’s name to produce the final output showing the most visited floor, total visits, and used resources.

Run in Notebook

Download full notebook here:
https://example.com/sql-pyspark-user-visits-notebook

Conclusion

This problem demonstrates critical SQL and PySpark concepts like window functions, aggregation, string aggregation, and joins in a straightforward use case similar to real-world user activity summaries. It’s a great interview question to understand your ability to manipulate grouped data and derive meaningful insights efficiently. Make sure you practice both SQL and PySpark approaches to stand out in data engineering interviews!

5 thoughts on “Mastering SQL & PySpark Interview Questions: How to Find Users’ Most Visited Floor, Total Visits & Unique Resources

  1. imkerei says:

    Sweet blog! I found it while surfing around on Yahoo News.
    Do you have any tips on how to get listed in Yahoo
    News? I’ve been trying for a while but I never seem to get
    there! Thank you

  2. imker portal says:

    Greetings, I do think your website may be having internet browser compatibility
    problems. When I look at your web site in Safari, it
    looks fine however, if opening in I.E., it has some overlapping issues.
    I simply wanted to give you a quick heads up!

    Apart from that, great site!

  3. My spouse and I absolutely love your blog and find the majority of your post’s to be exactly what I’m
    looking for. Do you offer guest writers to write content in your case?
    I wouldn’t mind producing a post or elaborating on a number of the subjects you write related to here.

    Again, awesome web log!

  4. Great article, totawlly what I wasnted too find.

    Feel frde too visit myy wbpage :: xmxxtube bokep

  5. cnhub.xyz says:

    Petty niice post. I just stumbped upoin your weblpog andd
    wished tto ssay that I havge reallly enjoyed surfiing aroumd yoyr blig posts.
    After alll I’ll be subscribiing tto your feeed aand I hhope you
    rite agaun soon!

    my wweb page cnhub.xyz

Leave a Reply

Your email address will not be published. Required fields are marked *