Python PySpark sum() Function

Last Updated : 20 Sep, 2024

PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data analysis on large datasets.

Overview of the PySpark sum() Function

The sum() function in PySpark is used to calculate the sum of a numerical column across all rows of a DataFrame. It can be applied in both aggregate functions and grouped operations.

Syntax:

pyspark.sql.functions.sum(col)

Here, col is the column name or column expression for which we want to compute the sum.

To illustrate the use of sum(), let’s start with a simple example.

Setting Up PySpark

First, ensure we have PySpark installed. We can install it using pip if we haven't done so:

pip install pyspark

Example 1: Basic Sum Calculation

Let’s create a simple DataFrame and compute the sum of a numerical column.

Python

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum

# Create a Spark session
spark = SparkSession.builder.appName("SumExample").getOrCreate()

# Sample data
data = [
    ("Science", 93),
    ("Physics", 72),
    ("Operation System", 81)
]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Marks"])

# Show the DataFrame
df.show()

# Calculate the sum of the 'Marks' column
total_marks = df.select(sum("Marks")).collect()[0][0]
print(f"Total Marks: {total_marks}")

Output:

Screenshot-2024-09-20-155006 — Example of PySpark sum() function

Explanation:

DataFrame Creation: We create a DataFrame with names and associated values.
Sum Calculation: We use the sum() function to calculate the total of the "Value" column and then collect the result.

Example 2: sum() with groupBy() in Sales Data Analysis

Let's consider a more realistic scenario: analyzing sales data for a retail store. Suppose we have a DataFrame with sales records, including the product name and the total sales amount.

Python

# Sample sales data
sales_data = [
    ("Laptop", 1200.00),
    ("Smartphone", 800.00),
    ("Tablet", 600.00),
    ("Laptop", 1500.00),
    ("Smartphone", 950.00)
]

# Create DataFrame
sales_df = spark.createDataFrame(sales_data, ["Product", "Sales"])

# Show the DataFrame
sales_df.show()

# Calculate total sales for each product
total_sales_by_product = sales_df.groupBy("Product").agg(sum("Sales").alias("Total_Sales"))
total_sales_by_product.show()

Output:

Screenshot-2024-09-20-155350 — Real World Example of

Explanation:

Group By: We use groupBy("Product") to group the sales records by product name.
Aggregation: The agg(sum("Sales").alias("Total_Sales")) computes the total sales for each product, renaming the result to "Total_Sales".

Example 3: Using sum() with Conditions

We can also compute sums conditionally using the when function. For instance, if we want to calculate total sales only for products that exceeded a certain threshold.

Python

from pyspark.sql.functions import when

# ...

# Calculate total sales for products with sales greater than 1000
conditional_sum = sales_df.where("Sales > 1000").agg(sum("Sales").alias("Total_Sales_Above_1000"))
conditional_sum.show()

Output:

Screenshot-2024-09-20-155821 — Using pyspark sum() with conditions

Explanation:

Filtering: The where("Sales > 1000") filters the records to include only those with sales over 1000.
Aggregation: The sum of the filtered records is computed.

Conclusion

The sum() function in PySpark is a fundamental tool for performing aggregations on large datasets. Whether you're calculating total values across a DataFrame or aggregating data based on groups, sum() provides a flexible and efficient way to handle numerical data.

In real-world applications, this function can be used extensively in data analysis tasks such as sales reporting, financial analysis, and performance tracking. With its ability to process massive amounts of data quickly, PySpark's sum() function plays a crucial role in the analytics landscape.