How to select a subset of a DataFrame?
We often work with subsets of a dataset, whether extracting specific columns, filtering rows based on conditions, or both. In this guide, we’ll explore various ways to select subsets of data using the pandas library in Python. All examples use the nba.csv dataset.
import pandas as pd
df = pd.read_csv("nba.csv")
df.head()
Output
Selecting columns
To select a subset of a DataFrame, one common approach is isolating specific columns. This helps focus on only the relevant data you need for analysis or visualization.
Example 1: In this example, we are extracting just one column, "Age", from the dataset using square bracket notation.
import pandas as pd
df = pd.read_csv("nba.csv")
a = df["Age"]
print(a.head())
Output

Explanation: df["Age"] returns the column named "Age" as a pandas.Series and head() shows the first 5 rows by default.
Example 2: In this example, we are selecting multiple columns, "Name" and "Age", by passing a list of column names.
import pandas as pd
df = pd.read_csv("nba.csv")
df.head()
res = df[["Name", "Age"]]
print(res.head())
Output

Explanation: df[["Name", "Age"]] uses a list to select multiple columns and the result is still a DataFrame containing just the specified columns.
Selecting rows
To select a subset of a DataFrame, another key approach is filtering rows based on conditions. This allows us to focus only on records that meet specific criteria, like age or salary thresholds.
Example 1: In this example, we filter rows where the value in the "Age" column is greater than 30.
import pandas as pd
df = pd.read_csv("nba.csv")
df.head()
res = df[df["Age"] > 30]
print(res.head())
Output

Explanation: df["Age"] > 30 returns a Boolean Series and df[condition], we retrieve only the rows where the condition is True.
Example 2: In this example, we select players who are older than 30 and have a salary greater than 10 million.
import pandas as pd
df = pd.read_csv("nba.csv")
df.head()
res = df[(df["Age"] > 30) & (df["Salary"] > 10000000)]
print(res.head())
Output

Explanation: We use & for combining conditions (logical AND). Each condition is enclosed in parentheses and the result is a filtered DataFrame with both criteria met.
Selecting rows and columns together
To select a specific subset of a DataFrame based on both rows and columns, we use .loc[] or .iloc[]. This gives us precise control by allowing us to apply filters and simultaneously extract only relevant columns.
Example 1: In this example, we first select names of players older than 25, then select both name and age of players from the "Lakers" team.
import pandas as pd
df = pd.read_csv("nba.csv")
a = df.loc[df["Age"] > 25, "Name"]
print(a.head())
b = df.loc[df["Team"] == "Lakers", ["Name", "Age"]]
print(b.head())
Output

Explanation:
- df.loc[condition, column] allows us to select rows that match a condition and specific columns.
- First query returns a Series of names for players older than 25.
- Second query returns a DataFrame with name and age for players on the Lakers team.
Example 2: In this example, we select the first 5 rows and first 3 columns using row and column index positions.
import pandas as pd
df = pd.read_csv("nba.csv")
res = df.iloc[:5, :3]
print(res)
Output

Explanation:
- df.iloc[row_index, column_index] selects by integer index.
- :5 selects rows from index 0 to 4.
- :3 selects the first three columns (index 0 to 2).
Selecting with .head() and .tail()
To quickly view a subset of rows from the beginning, end, or randomly from a DataFrame, pandas provides convenient methods like .head(), .tail() and .sample().
Example 1: In this example, we fetch the first 10 rows and fetch the last 5 rows (default).
import pandas as pd
df = pd.read_csv("nba.csv")
print(df.head(10))
print(df.tail())
Output


Explanation:
- df.head(10) returns the first 10 rows from the top of the DataFrame.
- df.tail() without arguments returns the last 5 rows by default.
Example 2: In this example, we select 5 random rows.
import pandas as pd
df = pd.read_csv("nba.csv")
res= df.sample(5)
print(res)
Output

Explanation: df.sample(5) randomly selects 5 rows from the DataFrame. The number inside sample() determines how many random rows you want.