Calculate and Plot a Cumulative Distribution function with Matplotlib in Python
Cumulative Distribution Functions (CDFs) show the probability that a variable is less than or equal to a value, helping us understand data distribution. For example, a CDF of test scores reveals the percentage of students scoring below a certain mark. Let’s explore simple and efficient ways to calculate and plot CDFs using Matplotlib in Python.
Using np.arange
This is a simple way to compute the CDF. First, the data is sorted and then np.arange is used to create evenly spaced cumulative probabilities. It's fast and perfect when you want a clean and intuitive CDF without extra dependencies.
import numpy as np
import matplotlib.pyplot as plt
d = np.sort(np.random.randn(500))
c = np.arange(1, len(d)+1) / len(d)
plt.plot(d, c, '.', color='blue')
plt.xlabel('Data Values')
plt.ylabel('CDF')
plt.title('CDF via Sorting')
plt.grid()
plt.show()
Output

Explanation:
- np.random.randn(500) generates 500 random data points from a standard normal distribution (mean = 0, std = 1).
- np.sort(d) sorts the generated data in ascending order to prepare for cumulative comparison.
- np.arange(1, len(d)+1) / len(d) creates evenly spaced cumulative probabilities ranging from 1/500 to 500/500 .
- plt.plot(d, c, '.', color='blue') plots the sorted data values against their corresponding CDF values as blue dots on the graph.
Using statsmodels ECDF Method
statsmodels library provides a built-in class ECDF for empirical CDFs. It’s very convenient and gives you a ready-to-use CDF function. If you're okay with an extra dependency, this is one of the cleanest and most accurate methods.
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
d = np.random.randn(500)
e = ECDF(d)
plt.step(e.x, e.y, color='green')
plt.xlabel('Data Values')
plt.ylabel('CDF')
plt.title('CDF via ECDF')
plt.grid()
plt.show()
Output

Explanation:
- np.random.randn(500) generates 500 random data points from a standard normal distribution (mean = 0, std = 1).
- ECDF(d) computes the empirical cumulative distribution function of the data d.
- plt.step(e.x, e.y, color='green') plots the ECDF as a step function with green color, showing the cumulative probabilities.
Using scipy.stats.cumfreq method
scipy.stats.cumfreq computes cumulative frequency tables. This method is slightly more advanced, offering more control over binning and output values. It's a good option if you already use SciPy in your workflow.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import cumfreq
d = np.random.randn(500)
r = cumfreq(d, numbins=25)
x = r.lowerlimit + np.linspace(0, r.binsize * r.cumcount.size, r.cumcount.size)
c = r.cumcount / r.cumcount[-1]
plt.plot(x, c, color='purple')
plt.xlabel('Data Values')
plt.ylabel('CDF')
plt.title('CDF via cumfreq')
plt.grid()
plt.show()
Output

Explanation:
- np.random.randn(500) generates 500 random data points from a standard normal distribution (mean = 0, std = 1).
- cumfreq(d, numbins=25) computes the cumulative frequency distribution of the data d using 25 bins.
- r.lowerlimit + np.linspace(0, r.binsize * r.cumcount.size, r.cumcount.size) creates the bin edges corresponding to the cumulative counts.
- r.cumcount / r.cumcount[-1] normalizes the cumulative counts to get cumulative probabilities (CDF values).
Using histogram
This is the most visual and basic method. You create a histogram and visually accumulate the bins to estimate the CDF. While not precise, it's helpful for understanding the concept of cumulative distribution.
import numpy as np
import matplotlib.pyplot as plt
d = np.random.randn(500)
cnt, b = np.histogram(d, bins=10)
p = cnt / cnt.sum()
c = np.cumsum(p)
fig, a1 = plt.subplots(figsize=(8, 6))
a1.bar(b[:-1], p, width=np.diff(b), color='red', alpha=0.6)
a1.set_ylabel('PDF', color='red')
a2 = a1.twinx()
a2.plot(b[1:], c, color='blue')
a2.set_ylabel('CDF', color='blue')
plt.title("PDF & CDF via Histogram")
plt.show()
Output

Explanation:
- np.random.randn(500) generates 500 random data points from a standard normal distribution.
- np.histogram(d, bins=10) bins the data into 10 intervals and counts occurrences.
- cnt / cnt.sum() normalizes counts to get the PDF.
- np.cumsum(p) calculates the cumulative sum of the PDF to obtain the CDF.a1.bar() plots the PDF and a1.twinx() with a2.plot() plots the CDF on a second y-axis.