Python PySpark sum() Function
PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data analysis on large datasets.
Overview of the PySpark sum() Function
The sum() function in PySpark is used to calculate the sum of a numerical column across all rows of a DataFrame. It can be applied in both aggregate functions and grouped operations.
Syntax:
pyspark.sql.functions.sum(col)
Here, col is the column name or column expression for which we want to compute the sum.
To illustrate the use of sum(), let’s start with a simple example.
Setting Up PySpark
First, ensure we have PySpark installed. We can install it using pip if we haven't done so:
pip install pyspark
Example 1: Basic Sum Calculation
Let’s create a simple DataFrame and compute the sum of a numerical column.
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum
# Create a Spark session
spark = SparkSession.builder.appName("SumExample").getOrCreate()
# Sample data
data = [
("Science", 93),
("Physics", 72),
("Operation System", 81)
]
# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Marks"])
# Show the DataFrame
df.show()
# Calculate the sum of the 'Marks' column
total_marks = df.select(sum("Marks")).collect()[0][0]
print(f"Total Marks: {total_marks}")
Output:

Explanation:
- DataFrame Creation: We create a DataFrame with names and associated values.
- Sum Calculation: We use the sum() function to calculate the total of the "Value" column and then collect the result.
Example 2: sum() with groupBy() in Sales Data Analysis
Let's consider a more realistic scenario: analyzing sales data for a retail store. Suppose we have a DataFrame with sales records, including the product name and the total sales amount.
# Sample sales data
sales_data = [
("Laptop", 1200.00),
("Smartphone", 800.00),
("Tablet", 600.00),
("Laptop", 1500.00),
("Smartphone", 950.00)
]
# Create DataFrame
sales_df = spark.createDataFrame(sales_data, ["Product", "Sales"])
# Show the DataFrame
sales_df.show()
# Calculate total sales for each product
total_sales_by_product = sales_df.groupBy("Product").agg(sum("Sales").alias("Total_Sales"))
total_sales_by_product.show()
Output:

Explanation:
- Group By: We use groupBy("Product") to group the sales records by product name.
- Aggregation: The agg(sum("Sales").alias("Total_Sales")) computes the total sales for each product, renaming the result to "Total_Sales".
Example 3: Using sum() with Conditions
We can also compute sums conditionally using the when function. For instance, if we want to calculate total sales only for products that exceeded a certain threshold.
from pyspark.sql.functions import when
# ...
# Calculate total sales for products with sales greater than 1000
conditional_sum = sales_df.where("Sales > 1000").agg(sum("Sales").alias("Total_Sales_Above_1000"))
conditional_sum.show()
Output:

Explanation:
- Filtering: The where("Sales > 1000") filters the records to include only those with sales over 1000.
- Aggregation: The sum of the filtered records is computed.
Conclusion
The sum() function in PySpark is a fundamental tool for performing aggregations on large datasets. Whether you're calculating total values across a DataFrame or aggregating data based on groups, sum() provides a flexible and efficient way to handle numerical data.
In real-world applications, this function can be used extensively in data analysis tasks such as sales reporting, financial analysis, and performance tracking. With its ability to process massive amounts of data quickly, PySpark's sum() function plays a crucial role in the analytics landscape.