How to Calculate Correlation Between Two Columns in Pandas?
Correlation is used to summarize the strength and direction of the linear association between two quantitative variables. It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association and a negative value for r indicates a negative association. Let's explore several methods to calculate correlation between columns in a pandas DataFrame.
Using Series.corr()
corr() calculates the Pearson correlation coefficient between two individual columns (Series) in a pandas DataFrame. It’s simple and quick when you want to check the correlation between just two variables.
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
corr = data['math_score'].corr(data['science_score'])
print(corr)
Output
0.9931532689569343
Explanation: This code computes the correlation coefficient between math and science scores, a value between -1 and +1 that measures the strength and direction of their linear relationship. +1 indicates a perfect positive correlation, -1 a perfect negative and 0 means no linear correlation.
Using Dataframe.corr()
Dataframe corr() computes the correlation matrix for all numeric columns in the DataFrame. It returns pairwise correlation coefficients between all columns, making it easy to see relationships across multiple variables at once.
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
res = data.corr()
print(res)
Output

Explanation: This code calculates the correlation matrix for all numeric columns in the dataframe, showing the pairwise correlation coefficients between each subject's scores. Each value ranges from -1 to +1, indicating the strength and direction of the linear relationships among the columns.
Using numpy.corrcoef()
corrcoef() from the NumPy library calculates the Pearson correlation coefficient matrix between two arrays. It is useful when working directly with NumPy arrays or when pandas is not required.
import numpy as np
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
corr = np.corrcoef(data['math_score'], data['english_score'])[0, 1]
print(corr)
Output
0.976632340152094
Explanation: This code calculates the Pearson correlation coefficient between math and English scores using NumPy’s corrcoef function. It returns a value between -1 and +1 that measures the strength and direction of the linear relationship between these two columns.
Using scipy.stats.pearsonr()
This function calculates the Pearson correlation coefficient along with the p-value to test the hypothesis of no correlation. It is helpful if you want to know both the strength of the correlation and its statistical significance.
from scipy.stats import pearsonr
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
corr, p_value = pearsonr(data['science_score'], data['history_score'])
print(corr)
print(p_value)
Output
0.9045939369328619
0.03486446724084317
Explanation: This code calculates the Pearson correlation and p-value between science and history scores, showing their linear relationship and its statistical significance.