Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn
Exploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming, and exploring data to make it suitable for analysis.
Why EDA important in Data Science?
To effectively work with data, it’s essential to first understand the nature and structure of data. EDA helps answer critical questions about the dataset and guides the necessary preprocessing steps before applying any algorithms. For instance:
- What type of data do we have? Are we working with numbers, text, or dates?
- Are there outliers? These are unusual values that are very different from the rest.
- Is anything missing? Are some parts of the dataset empty or incomplete?
Imagine you’re working with a student performance dataset. If some rows are missing test scores, or the names of subjects are inconsistently spelled (e.g., "Math" and "Mathematics"), you’ll need to address these issues before proceeding. EDA helps to identify such problems and clean the data to ensure reliable analysis.
Now, we will understand core packages for exploratory data analysis (EDA), including NumPy, Pandas, Seaborn, and Matplotlib.
1. NumPy for Numerical Operations
NumPy is used for working with numerical data in Python.
- Handles Large Datasets Efficiently: NumPy allows to work with large, multi-dimensional arrays and matrices of numerical data. Provides functions for performing mathematical operations such as linear algebra, statistical analysis.
- Facilitates Data Transformation: Helps in sorting, reshaping, and aggregating data.
Example : Let’s consider a simple example where we analyze the distribution of a dataset containing exam scores for students using numpy:
import numpy as np
# Dataset: Exam scores
scores = np.array([45, 50, 55, 60, 65, 70, 75, 80, 200]) # Note: One extreme value (200)
# Calculate basic statistics
mean_score = np.mean(scores)
median_score = np.median(scores)
std_dev_score = np.std(scores)
print(f"Mean: {mean_score}, Median: {median_score}, Standard Deviation: {std_dev_score}")
Output
Mean: 77.77777777777777, Median: 65.0, Standard Deviation: 44.541560561838764
This example demonstrates how NumPy can quickly compute statistics. We can also detect anomalies in data using z-score. Now follow below resources for in-depth understanding.
- Introduction to NumPy
- Basics of NumPy Arrays
- Broadcasting - Perform operations on arrays with different shapes
- Linear algebra operations: Solving Mathematical Problems
- Saving and loading NumPy arrays
2. Pandas for Data Manipulation
Built on top of NumPy, Pandas excels at handling tabular data (data organized in rows and columns) through its core data structures: Series (1D) and DataFrame (2D). Pandas simplifies the process of working with structured data by:
- Easy loading and saving of datasets in formats like CSV, Excel, SQL, or JSON:
- Data Processing with Pandas
- Slicing rows with pandas Indexing
- Data Aggregation and Grouping
- Working with Date and Time
3. Matplotlib for Data Visualization
Matplotlib brings us data visualizations, it is a powerful and versatile open-source plotting library for Python, designed to help users visualize data in a variety of formats.
- Introduction to Matplotlib
- Pyplot in Matplotlib
- Matplotlib – Axes Class
- Matplotlib for 3D Plotting
- Exploratory Data Analysis with matplotlib
4. Seaborn for Statistical Data Visualization
Seaborn is built on top of Matplotlib and is specifically designed for statistical data visualization. It provides a high-level interface for drawing attractive and informative statistical graphics.
- Introduction to Seaborn
- Types Of Seaborn Plots
- Pairplot function in seaborn
- FacetGrid in Seaborn
- Time Series Visualization with Seaborn : Line Plot
Complete EDA Workflow Using NumPy, Pandas, and Seaborn
Let's implement complete workflow for performing EDA: starting with numerical analysis using NumPy and Pandas, followed by insightful visualizations using Seaborn to make data-driven decisions effectively.
- Performing EDA with Numpy and Pandas - Set 1
- After analysis : Visualizing with seaborn - Set 2
For more hands-on implementation - Explore projects below:
- Titanic Data EDA using Seaborn
- Uber Rides Data Analysis
- Zomato Data Analysis Using Python
- Global Covid-19 Data Analysis and Visualizations
- iPhone Sales Analysis
- Google Search Analysis
Web Scraping For EDA
Now, what is Web-scraping? : It is the automated process of extracting data from websites for later on analysis.
- How to Extract Weather Data from Google in Python?
- Movies Review Scraping And Analysis
- Product Price Scraping and Analysis
- News Scraping and Analysis
- Real-time Share Price scrapping and analysis