Visualizing Multiple Datasets on the Same Scatter Plot
Seaborn is a powerful Python visualization library built on top of Matplotlib, designed for making statistical graphics easier and more attractive. One common requirement in data visualization is to compare two datasets on the same scatter plot to identify patterns, correlations, or differences. This article will guide you through the process of plotting two datasets on the same scatter plot using Seaborn.
Table of Content
Concatenating Data Sets with Scatter Plot
The first step in plotting two data sets on the same scatter plot is to concatenate them into a single DataFrame. This can be achieved using the pd.concat
function from Pandas. By concatenating the data sets, we can preserve the information about which row belongs to which dataset by adding a new column indicating the dataset origin. Example:
import pandas as pd
# Concatenate data sets
combined_data = pd.concat([set1, set2], keys=['set1', 'set2'])
Understanding the Importance of Overlaying Scatterplots
Overlaying scatterplots is valuable for several reasons:
- Comparison: Visualizing two datasets simultaneously allows for a direct comparison of their distributions, identifying potential clusters, outliers, or trends that might be unique to each dataset.
- Correlation Analysis: By examining how the points from each dataset are distributed relative to each other, you can gain insights into potential correlations or lack thereof.
- Interactive Exploration: With interactive plots, you can hover over individual points to reveal details about both datasets, facilitating a deeper understanding of the underlying data.
Step-by-Step Guide for Plotting Dual Datasets
Setting Up the Environment: Importing Libraries and Creating Sample Data
First, let's import the necessary libraries and create some sample data:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Creating sample data
data1 = pd.DataFrame({
'x': range(10),
'y': [2, 3, 5, 7, 11, 13, 17, 19, 23, 29],
'label': 'Dataset 1'
})
data2 = pd.DataFrame({
'x': range(10),
'y': [1, 4, 9, 16, 25, 36, 49, 64, 81, 100],
'label': 'Dataset 2'
})
Step 1: Combining the Datasets Using pd.concat()
To plot both datasets on the same scatter plot, you can combine them into a single DataFrame. This makes it easier to handle them with Seaborn, which works well with tidy data:
combined_data = pd.concat([data1, data2])
Step 2: Plotting the Scatter Plot
Now that we have the combined data, we can use Seaborn to plot it. The scatterplot function will be used to create the scatter plot. By using the hue parameter, we can differentiate between the two datasets:
# Creating the scatter plot
sns.scatterplot(data=combined_data, x='x', y='y', hue='label')
# Adding labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot of Two Datasets')
plt.legend(title='Dataset')
# Display the plot
plt.show()
Output:

Customizing the Plot For Two Data Sets
Seaborn offers various customization options to enhance the visualization. You can customize markers, colors, and add additional elements like regression lines or error bars.
1. Customizing Markers and Colors
# Customizing markers and colors
sns.scatterplot(data=combined_data, x='x', y='y', hue='label', style='label', palette=['blue', 'red'], markers=['o', 's'])
# Adding labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Customized Scatter Plot of Two Datasets')
plt.legend(title='Dataset')
# Display the plot
plt.show()
Output:

2. Adding a Regression Line
You can use sns.lmplot to add a regression line to the scatter plot. This function allows you to fit and visualize a linear model for each dataset:
# Adding regression lines
sns.lmplot(data=combined_data, x='x', y='y', hue='label', markers=['o', 's'], palette=['blue', 'red'], height=6, aspect=1.5)
# Adding labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot with Regression Lines')
# Display the plot
plt.show()
Output:

Handling Larger Multiple Datasets
When dealing with larger datasets, scatter plots can become cluttered and difficult to interpret. Seaborn provides several options for handling larger datasets, such as using transparency and aggregating data points.
# Sample larger dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Generate random data
np.random.seed(42)
large_data1 = pd.DataFrame({
'x': np.random.rand(100),
'y': np.random.rand(100),
'dataset': ['set1'] * 100
})
large_data2 = pd.DataFrame({
'x': np.random.rand(100),
'y': np.random.rand(100),
'dataset': ['set2'] * 100
})
# Combine the larger datasets
large_combined_data = pd.concat([large_data1, large_data2])
# Create the scatter plot with transparency
sns.scatterplot(data=large_combined_data, x='x', y='y', hue='dataset', alpha=0.5)
plt.show()
Output:

Conclusion
Plotting two datasets on the same scatter plot using Seaborn is a straightforward process that involves combining the datasets into a single DataFrame and leveraging Seaborn's powerful plotting functions. By customizing markers, colors, and adding elements like regression lines, you can create informative and attractive visualizations that clearly convey the relationships between your datasets.