Mastering SQL & PySpark Interview Questions: Identify Top 80% Sales-Contributing Products with Window Functions
Introduction
In the world of data engineering, understanding how to analyze sales data efficiently is critical for driving business decisions. One common real-world task is identifying which products contribute to a large fraction of total revenue—say, the top 80% of sales. This analysis empowers companies to focus on high-impact products, optimize inventory, and launch informed marketing campaigns. Leveraging SQL and PySpark, data engineers can process large datasets with ease to uncover actionable insights.
Problem Statement
Find the products that together contribute to the top 80% of total sales.
—
Dataset (SQL – Notebook Friendly)
-- Create table
CREATE TABLE product_sales (
product_id STRING,
sales DECIMAL(10,3)
);
-- Insert sample data
INSERT INTO product_sales VALUES
('OFF-HP-10004552', 150.523),
('TEC-LA-10011223', 480.215),
('OFF-EN-10008123', 320.875),
('TEC-AC-10001900', 720.100),
('FUR-TA-10004100', 100.650),
('FUR-CH-10009999', 890.350),
('OFF-BI-10007777', 60.425),
('TEC-PH-10004567', 530.890),
('OFF-TE-10006987', 240.330);
Sample Data (Tabular View)
| product_id | sales |
|---|---|
| OFF-HP-10004552 | 150.523 |
| TEC-LA-10011223 | 480.215 |
| OFF-EN-10008123 | 320.875 |
| TEC-AC-10001900 | 720.100 |
| FUR-TA-10004100 | 100.650 |
| FUR-CH-10009999 | 890.350 |
| OFF-BI-10007777 | 60.425 |
| TEC-PH-10004567 | 530.890 |
| OFF-TE-10006987 | 240.330 |
Expected Output
| product_id | sales | running_sales | total_sales |
|---|---|---|---|
| FUR-CH-10009999 | 890.350 | 890.350 | 3494.358 |
| TEC-AC-10001900 | 720.100 | 1610.450 | 3494.358 |
| TEC-PH-10004567 | 530.890 | 2141.340 | 3494.358 |
| TEC-LA-10011223 | 480.215 | 2621.555 | 3494.358 |
| OFF-EN-10008123 | 320.875 | 2942.430 | 3494.358 |
SQL Solution (Notebook Ready)
WITH sales_with_total AS (
SELECT
product_id,
sales,
SUM(sales) OVER () AS total_sales
FROM product_sales
),
ranked_sales AS (
SELECT
product_id,
sales,
total_sales,
SUM(sales) OVER (ORDER BY sales DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sales
FROM sales_with_total
)
SELECT
product_id,
sales,
running_sales,
total_sales
FROM ranked_sales
WHERE running_sales <= 0.80 * total_sales
ORDER BY running_sales;
PySpark Solution (Databricks Ready)
# Import necessary functions
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# Initialize Spark session
spark = SparkSession.builder.appName("Top80PercentSales").getOrCreate()
# Sample Data
data = [
("OFF-HP-10004552", 150.523),
("TEC-LA-10011223", 480.215),
("OFF-EN-10008123", 320.875),
("TEC-AC-10001900", 720.100),
("FUR-TA-10004100", 100.650),
("FUR-CH-10009999", 890.350),
("OFF-BI-10007777", 60.425),
("TEC-PH-10004567", 530.890),
("OFF-TE-10006987", 240.330)
]
columns = ["product_id", "sales"]
# Create DataFrame
df = spark.createDataFrame(data, schema=columns)
# Calculate total sales
total_sales = df.agg(F.sum("sales").alias("total_sales")).collect()[0]["total_sales"]
# Define window specification ordered by descending sales
windowSpec = Window.orderBy(F.desc("sales")).rowsBetween(Window.unboundedPreceding, Window.currentRow)
# Add running sum column
df_with_running = df.withColumn("running_sales", F.sum("sales").over(windowSpec))
# Add total sales column
df_final = df_with_running.withColumn("total_sales", F.lit(total_sales))
# Filter to top 80% sales contributors
top_80_df = df_final.filter(F.col("running_sales") <= 0.8 * total_sales).orderBy("running_sales")
top_80_df.show()
Explanation
- We start by calculating the total sales across all products.
- Using window functions, we sort the products in descending order by sales.
- We then calculate a running total (cumulative sum) of sales for each product using the window frame from the start until the current row.
- The products contributing to the running total that does not exceed 80% of total sales are filtered and selected.
- This approach is efficient and scalable, leveraging analytic window functions available in both SQL and PySpark.
Run in Notebook
Download full notebook here: Mastering SQL & PySpark Top Sales Notebook
Conclusion
Identifying the top 80% sales contributors using cumulative sums and window functions is a common but vital task in data analysis and engineering. Mastery of these techniques is essential for interview success and real-world problem solving. Remember to practice writing both clean SQL and PySpark code with window functions—it’s a skill highly valued in data engineering roles.