Build a spam text classifier by using PyTorch

Author

Anna Gutowska

AI Engineer, Developer Advocate

IBM

Introduction

In this tutorial, you will learn how to take control of your inbox to keep your personal information safe. The objective is to build a text classifier with PyTorch that identifies and filters out unwanted or unsolicited communications, also known as spam. With high accuracy, your machine learning (ML) classifier sifts through your messages and separates the important ones from the junk.

“CONGRATULATIONS! You’ve won a free prize! Click here to claim it.”

“Your account has been compromised. Click here to reset your password.”

How often are you flooded with emails and text messages like these? Did you win a free iPhone or is the suspicious link a way to steal your information? Modern-day email providers do a good job at filtering these emails as spam. However, these scams can reach us in many forms like text messages, phone calls or even paper mail. Let’s take spam detection into our own hands with PyTorch.

What is text classification?

Text classification is a machine learning task that involves assigning predefined labels to text data in order to automatically categorize it into groups. This technique is fundamental to natural language processing (NLP) applications such as:

  • Spam detection: Identifying unwanted or unsolicited emails or messages. This application is the one we explore in this tutorial.
  • Sentiment analysis: Determining the emotional tone of a piece of text (for example, positive, negative or neutral).
  • Topic modeling: Categorizing text into topics or themes (for example, sports, politics or entertainment).
  • Intent detection: Identifying the purpose or intention behind a piece of text (for example, booking a flight or making a complaint).

How it works

The text classification process typically involves the following steps:
 
  1. Text preprocessing: Cleaning and normalizing the text data, which might include
    1. Tokenization: Splitting text into individual words or tokens.
    2. Stopword removal: Removing common words like “the” or “and” with low sementatic significance.
    3. Stemming or lemmatization: Reducing words to their base form.
  2. Feature extraction: Converting the preprocessed text into a numerical representation that can be used by a machine learning algorithm, such as:
    1. Bag of words: Representing text as a vector of word frequencies.
    2. Term frequency-inverse document frequency (TF-IDF): Multiplies Term Frequency (TF) measures how often a word appears in a document and Inverse Document Frequency (IDF) measures how rare a word is across all documents in a collection.
    3. Word embeddings: Representing words as vectors in a multidimensional space, where the distance and direction between vectors reflect the similarity and relationships among the corresponding words.
  3. Model training: Training a machine learning model on the labeled text data by using a classification algorithm such as:
    1. Naive Bayes
    2. Support vector machines (SVM)
    3. Logistic regression
    4. Random forest
    5. Deep neural networks (for example, convolutional neural networks (CNN) or recurrent neural networks (RNN))
    6. Transformers
  4. Model evaluation: Evaluating the performance of the trained model on a test dataset by using metrics such as:
    1. Accuracy
    2. Precision
    3. Recall
    4. F1-score
  5. Model deployment: Deploying the trained model in a real-world application, such as an API or a spam filter. Deployment isn’t covered in this tutorial as it requires various considerations, such as model serving, application programming interface (API) design, scalability and security.

Why PyTorch?

PyTorch is a software-based open source deep learning library used to build neural networks, combining the machine learning (ML) library of Torch with a Python-based high-level API. PyTorch has a data type, the tensor, which serves as the fundamental unit of data used for computation. There are also PyTorch functions to operate on tensors incrementally and interactively.1 However, the primary benefits that have established PyTorch as the top ML framework among academic and research communities are its ability to leverage GPU for accelerated computation, ease of debugging and its ability to handle large datasets easily.1,2 Apart from these reasons, PyTorch is well suited for text classification particularly because of the large number of prebuilt functions and modules for common text classification tasks, such as embedding layers, recurrent neural networks (RNNs) and transformers. Paired with PyTorch’s dynamic computation graph, it stands out as the clear choice for our use case, providing a blend of flexibility and speed.3

Prerequisites

To run this tutorial effectively, users need to have Python downloaded. This tutorial stably runs with Python 3.13.

Steps

Step 1. Set up your environment

There are several ways in which you can run the code provided in this tutorial. Either use IBM® watsonx.ai® to follow along step-by-step or clone our GitHub repository to run the full Jupyter Notebook.

Option 1: Use watsonx.ai

Follow the following steps to set up an IBM account to use a Jupyter Notebook.

  1. You need an IBM Cloud® account to create a watsonx.ai project.
  2. Create a watsonx.ai project by using your IBM Cloud account.
  3. Create a Jupyter Notebook.
    1. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset.       

Option 2: Run the tutorial locally

1. Several Python versions can work for this tutorial. At the time of publishing, we recommend downloading Python 3.13, the latest version.

2. In your preferred IDE, clone the GitHub repository by using https://github.com/IBM/ibmdotcom-tutorials.git  as the HTTPS URL. For detailed steps on how to clone a repository, refer to the GitHub documentation.

3. Inside a terminal, create a virtual environment to avoid Python dependency issues.

python3.13 -m venv myvenv
source myvenv/bin/activate

4. Then, navigate to this tutorial’s directory.

cd docs/tutorials/machine-learning

Step 2. Install and import relevant libraries

We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they’re not installed, a quick pip installation can be performed. The -q  flag quiets or suppresses the progress bars.

Some helpful libraries here include ntlk , or Natural Language Toolkit, for text processing. This tutorial uses the nltk  stopwords corpus for data preprocessing. Additionally, the fastparquet  and pyarrow  libraries are required to load data from HuggingFace to parquet objects, returning a DataFrame .

!pip install -q torch pandas pyarrow fastparquet scikit-learn nltk huggingface_hub matplotlib

Import the following modules and classes.

import torch
import torch.nn as nn
import torch.optim as optim
import nltk
import regex as re
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
from nltk.corpus import stopwords
from torch.utils.data import DataLoader, Dataset
from collections import Counter

Step 3. Text preprocessing

The Deysi/spam-detection-dataset  we will be using for training and testing can be found on HuggingFace. It consists of prepartitioned data containing text  and label  columns. Each message in the text column is therefore assigned a label of not_spam  or spam .

Let’s read the Parquet files by using the file paths of the training dataset and the test set. We can also return the first 10 rows to take a closer look.

splits = {‘train’: ‘data/train-00000-of-00001-daf190ce720b3dbb.parquet’, ‘test’: ‘data/test-00000-of-00001-fa9b3e8ade89a333.parquet’}
df_train = pd.read_parquet(“hf://datasets/Deysi/spam-detection-dataset/” + splits[“train”])
df_test = pd.read_parquet(“hf://datasets/Deysi/spam-detection-dataset/” + splits[“test”])
df_train.head(10)

Output:

Because this tutorial will explore one model and will not require us to choose between rivaling approaches, we do not need a validation partition of the data. However, if you would like to experiment with selecting from different models, you should concatenate the training and testing sets and repartition. A suggested split can be 50–25–25 for training, validation and testing.

To start the data cleaning process, we first need to create a set of “stopwords.” To do so, we can use the NLTK stopwords corpus. This corpus consists of words such as “a,” “of,” “me,” “you,” “what,” and others. The purpose of this corpus is to exclude any such words that do not provide significant semantic meaning. By focusing on words with significant meaning, we can reduce the dimensionality of the text data and improve computational efficiency.

nltk.download(‘stopwords’)
stop_words = set(stopwords.words(‘english’))

The clean_data  function is the function enabling the data cleaning process. In it, we convert the text to lowercase, remove non-alphanumeric characters (except spaces), remove “stopwords” as defined by the stop_words  set and then we return the cleaned text as a string.

def clean_data(text):
    text = text.lower()
    text = re.sub(r’[^a-zA-Z0-9\s]’, ‘’, text)
    words = [w for w in text.split() if w not in stop_words]
    return “ “.join(words)

Let’s now apply this function to the text  column and store the preprocessed text in the clean_text  column.

df_train[‘clean_text’] = df_train[‘text’].apply(clean_data)
df_test[‘clean_text’] = df_test[‘text’].apply(clean_data)

As a next step to this process, we can map categorical labels to numerical values so that spam  maps to 1 and not_spam  maps to 0. These binary labels can then be used as input for an ML model. The resulting labels are stored as float32 values in the label column of the df_train  and df_test  dataframes.

df_train[“label”] = df_train[“label”].map({“spam”: 1, “not_spam”: 0}).astype(“float32”)
df_test[“label”] = df_test[“label”].map({“spam”: 1, “not_spam”: 0}).astype(“float32”)

Step 4. Tokenization and vocabulary

As part of the next step, we will tokenize the text and create a vocabulary dictionary. This step maps the tokens to indices and is an alternative approach to the prebuilt vocab  function of the torch.classes.torchtext.Vocab class . Just as the Vocab  class does by default, we can insert some special tokens to the beginning of the vocabulary:

  • <PAD>  (padding token): Used to truncate sequences to a uniform length, ensuring that all input sequences have the same length, which is necessary for batch processing.
  • <UNK>  (unknown token): Used to represent out-of-vocabulary (OOV) words, that is, words that are not in the training dataset, allowing the model to handle unseen words during inference.
def tokenize(text):
return text.split()

train_texts = df_train[“clean_text”]
all_tokens = [token for text in train_texts for token in tokenize(text)]
vocab = {word: i+2 for i, word in enumerate(Counter(all_tokens))} # start from 2
vocab[“<PAD>“] = 0
vocab[“<UNK>“] = 1

Step 5. Instantiate the dataset

Next up is the encoder function, which tokenizes the input text  into individual tokens and converts them into corresponding IDs by using the vocab  dictionary. If the length of the IDs is less than the specified max_len  of 50, the function pads the IDs with zeros to reach the maximum length. The function then returns the encoded IDs, truncated to max_len  if necessary.

def encode(text, max_len=50):
    tokens = tokenize(text)
    ids = [vocab.get(t, 1) for t in tokens]
    if len(ids) < max_len:
        ids += [0] * (max_len - len(ids))
    else:
        ids = ids[:max_len]
    return ids

It’s time to create the SpamDataset  class—our custom dataset class. To create a custom dataset class, we must implement three functions inherited from PyTorch’s Dataset  class: __init__  , __len__  and __getitem__  .4 This class preprocesses the text data by using the encode function and stores it as tensors, allowing for easy access and retrieval of data samples by using the __getitem__  method. The labels are also stored as tensors.

class SpamDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = [torch.tensor(encode(t), dtype=torch.long) for t in texts]
        self.labels = torch.tensor(labels, dtype=torch.float32)

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

This code extracts the text and test data along with their corresponding labels from the training and testing DataFrames and stores them in separate variables. The text data is stored in train_texts  and test_texts , while the labels are stored in train_labels  and test_labels  as NumPy arrays. This step is necessary to prepare the data for further processing and modeling.

train_texts = df_train[“clean_text”]
train_labels = df_train[“label”].to_numpy(copy=True)

test_texts = df_test[“clean_text”]
test_labels = df_test[“label”].to_numpy(copy=True)

The next code block creates instances of SpamDataset  for training and testing, passing in the corresponding text data and labels. It then uses PyTorch’s DataLoader  to wrap an iterable around the datasets.4 For automatic batching, we can set our batch_size  to 32.

train_ds = SpamDataset(train_texts, train_labels)
test_ds = SpamDataset(test_texts, test_labels)

train_dl = DataLoader(train_ds, batch_size=32, shuffle=True)
test_dl = DataLoader(test_ds, batch_size=32)

Step 6. Define the model

The SpamClassifier  class defines a PyTorch neural network model that classifies text as spam or not spam by using an embedding layer, a long short-term memory (LSTM) layer and a linear layer as well as a sigmoid activation function. The model is defined with parameters such as vocab_sizeembed_dim  and hidden_dim , which can be adjusted to fine-tune the model’s performance.

For a detailed view of the default LSTM hyperparameters not shown here, such as dropoutbidirectional  and num_layers , refer to the PyTorch documentation.

class SpamClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=64, hidden_dim=64):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_dim, batch_first=True)
        self.fc = nn.Linear(in_features=hidden_dim, out_features=1)
        self.sigmoid = nn.Sigmoid()

def forward(self, x):
    x = self.embedding(x)
    _, (hidden, _) = self.lstm(x)
    out = self.fc(hidden[-1])
    return self.sigmoid(out).squeeze(1)

The following code block initializes an instance of the SpamClassifier  model, passing in the vocab_size  parameter to define the size of the vocabulary that the model will be trained on.

vocab_size = len(vocab)
model = SpamClassifier(vocab_size)

 

Step 7. Training

The following lines define the loss function used to evaluate the model’s performance, with binary cross-entropy (BCE) loss being a suitable choice for binary classification problems like spam detection. We can then initialize the Adam optimizer, which is used to update the model’s parameters during training, with a learning rate of 0.001.

The learning rate helps ensure that the model learns enough from training to make meaningful adjustments to its parameters while also not overcorrecting. A learning rate of 0.001 is a common default because it provides a good balance between convergence speed and stability.

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

To check whether a CUDA-compatible GPU is available and to set the device  variable to either “cuda” or “cpu,” we can run the following code. After allowing the model to use the GPU for accelerated computations, if possible, we then move the model to the specified device. This step ensures that all subsequent computations and data are processed on the chosen hardware, whether it be the GPU or CPU. Note, a GPU is not required to run this notebook as-is.

device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
model.to(device)

Output:

SpamClassifier(
      (embedding): Embedding(22754, 64, padding_idx=0)
      (lstm): LSTM(64, 64, batch_first=True)
      (fc): Linear(in_features=64, out_features=1, bias=True)
      (sigmoid): Sigmoid()
)

We are now ready to train the mode for 5 epochs, or iterations. The model is set to training mode at the beginning of each epoch by using model.train() . The training data is then iterated over in batches by using the train_dl  data loader, with each batch being transferred to the specified device (GPU or CPU) for processing. For each batch, the model clears accumulated gradients, makes predictions about the input and calculates the loss by using the criterion . With loss.backward() , the model computes the gradients, then updates the model parameters by using the optimizer , and increments the train_loss . The average loss is calculated and printed at the end of each epoch, providing a measure of the model’s performance during training.

epochs = 5
train_losses = []
for epoch in range(epochs):
    model.train()
    train_loss = 0
    for X, y in train_dl:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        preds = model(X)
        loss = criterion(preds, y)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    train_losses.append(train_loss/len(train_dl))
    print(f”Epoch {epoch+1} | Loss: {train_loss/len(train_dl):.4f}”)

Output:

Epoch 1 | Loss: 0.4563
Epoch 2 | Loss: 0.2659
Epoch 3 | Loss: 0.2312
Epoch 4 | Loss: 0.1316
Epoch 5 | Loss: 0.0936

As you can see, the loss is significantly dropping with each epoch. Let’s plot the loss over the number of epochs as a visualization of these results.

epochs = np.arange(1, len(train_losses) + 1)
plt.figure(figsize=(12, 5))
plt.plot(epochs, train_losses, label=’Training Loss’, color=’blue’)
plt.title(‘Training loss over epochs’)
plt.xlabel(‘Epoch’)
plt.ylabel(‘Training loss’)
plt.legend()
plt.show()

Output:

Step 8. Evaluation

Our final step is evaluation of our trained model. We want to know how well it performs, don’t we? For this step, the model is set to evaluation mode by using model.eval()  and its performance is evaluated on the test set by using the test_dl  data loader. Accuracy is calculated by comparing the model’s predictions to the actual labels. The accuracy is then printed as a fraction of correct predictions out of the total number of test data samples, providing a measure of the model’s performance on unseen data.

For labeled data that uses supervised techniques like ours, accuracy calculation is the simplest evaluation approach. Other approaches include the area under the ROC curve (AUC), receiver operating curve (ROC) and F-score or F-measure.5

model.eval()
correct, total = 0, 0
with torch.no_grad():
    for X, y in test_dl:
        X, y = X.to(device), y.to(device)
        preds = model(X)
        preds_cls = (preds > 0.5).float()
        correct += (preds_cls == y).sum().item()
        total += y.size(0)
print(f”Final accuracy: {correct / total:.4f}”)

Output:

Final accuracy: 0.9692

Amazing! With 97.43% accuracy after only 5 epochs of training, our text classification model is performing very well on unseen data. The results obtained are excellent, given that we have trained the model from scratch, without leveraging any pre-trained models.

Note, your result might slightly differ but should be close to 0.96.

Conclusion

In this tutorial, we have successfully built and trained a SpamClassifier model by using PyTorch, which can classify text as spam or not spam. We started by preparing the dataset, creating a data loader and defining the model architecture. We then trained the model and evaluated its performance on the test dataset. By following this tutorial, you have gained hands-on experience with building and training a neural network model for text classification tasks. You can now apply this knowledge to build more complex models and tackle real-world problems in NLP.

Related solutions
IBM® watsonx Orchestrate™ 

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
Artificial intelligence consulting and services

IBM Consulting AI services help reimagine how businesses work with AI for transformation.

Explore AI services
Take the next step

Whether you choose to customize pre-built apps and skills or build and deploy custom agentic services using an AI studio, the IBM watsonx platform has you covered.

Explore watsonx Orchestrate Explore watsonx.ai
Footnotes

1Antiga, L. P. G., Stevens, E., & Viehmann, T. (2020). Deep learning with PyTorch. Simon and Schuster.

2B. Tóth, S. A. Laczi and V. Póser. (2024). “Spam Filter Using Artificial Intelligence: PyTorch Framework Based Approach,” 2024 IEEE 22nd Jubilee International Symposium on Intelligent Systems and Informatics (SISY) pp. 000227-000232. DOI: 10.1109/SISY62279.2024.10737559.

3PyTorch. “Pytorch/Pytorch.” (2021). GitHub. github.com/pytorch/pytorch.

4“Datasets & DataLoaders — PyTorch Tutorials 2.7.0+Cu126 Documentation.” (2024). Pytorch.org. docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html.

5Hossin, M., & Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2), 1.