Pandas - get_dummies() method
In Pandas, the get_dummies() function converts categorical variables into dummy/indicator variables (known as one-hot encoding). This method is especially useful when preparing data for machine learning algorithms that require numeric input.
Syntax: pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, drop_first=False, dtype=None)
The function returns a DataFrame where each unique category in the original data is converted into a separate column, and the values are represented as True (for presence) or False (for absence).
Encoding a Pandas DataFrame
Let's look at an example of how to use the get_dummies() method to perform one-hot encoding.
import pandas as pd
data = {
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}
df = pd.DataFrame(data)
print('Original DataFrame')
display(df)
# Perform one-hot encoding
df_encoded = pd.get_dummies(df)
print('\n DataFrame after performing One-hot Encoding')
display(df_encoded)
Output:
In the output, each unique category in the Color and Size columns has been transformed into a separate binary (True or False) column. The new columns indicate whether the respective category is present in each row.
To get, the output as 0 and 1, instead of True and False, you can set the data type (dtype) as 'float' or 'int'.
# Perform one-hot encoding
df_encoded = pd.get_dummies(df, dtype = int)
print('\n DataFrame after performing One-hot Encoding')
display(df_encoded)
Output:
Encoding a Pandas Series
import pandas as pd
# Series with days of the week
days = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Monday'])
print(pd.get_dummies(days, dtype='int'))
Output
Friday Monday Thursday Tuesday Wednesday 0 0 1 0 0 0 1 0 0 0 1 0 2 0 0 0 0 1 3 ...
In this example, each unique day of the week is transformed into a dummy variable, where a 1 indicates the presence of that day.
Converting NaN Values into a Dummy Variable
The dummy_na=True option can be used when dealing with missing values. It creates a separate column indicating whether the value is missing or not.
import pandas as pd
import numpy as np
# List with color categories and NaN
colors = ['Red', 'Blue', 'Green', np.nan, 'Red', 'Blue']
print(pd.get_dummies(colors, dummy_na=True, dtype='int'))
Output
Blue Green Red NaN 0 0 0 1 0 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 1 0 5 1 0 0 0
The dummy_na=True parameter adds a column for missing values (NaN), indicating where the NaN values were originally present.