In this tutorial, you will learn how to take control of your inbox to keep your personal information safe. The objective is to build a text classifier with PyTorch that identifies and filters out unwanted or unsolicited communications, also known as spam. With high accuracy, your machine learning (ML) classifier sifts through your messages and separates the important ones from the junk.
“CONGRATULATIONS! You’ve won a free prize! Click here to claim it.”
“Your account has been compromised. Click here to reset your password.”
How often are you flooded with emails and text messages like these? Did you win a free iPhone or is the suspicious link a way to steal your information? Modern-day email providers do a good job at filtering these emails as spam. However, these scams can reach us in many forms like text messages, phone calls or even paper mail. Let’s take spam detection into our own hands with PyTorch.
Text classification is a machine learning task that involves assigning predefined labels to text data in order to automatically categorize it into groups. This technique is fundamental to natural language processing (NLP) applications such as:
Model deployment: Deploying the trained model in a real-world application, such as an API or a spam filter. Deployment isn’t covered in this tutorial as it requires various considerations, such as model serving, application programming interface (API) design, scalability and security.
PyTorch is a software-based open source deep learning library used to build neural networks, combining the machine learning (ML) library of Torch with a Python-based high-level API. PyTorch has a data type, the tensor, which serves as the fundamental unit of data used for computation. There are also PyTorch functions to operate on tensors incrementally and interactively.1 However, the primary benefits that have established PyTorch as the top ML framework among academic and research communities are its ability to leverage GPU for accelerated computation, ease of debugging and its ability to handle large datasets easily.1,2 Apart from these reasons, PyTorch is well suited for text classification particularly because of the large number of prebuilt functions and modules for common text classification tasks, such as embedding layers, recurrent neural networks (RNNs) and transformers. Paired with PyTorch’s dynamic computation graph, it stands out as the clear choice for our use case, providing a blend of flexibility and speed.3
To run this tutorial effectively, users need to have Python downloaded. This tutorial stably runs with Python 3.13.
There are several ways in which you can run the code provided in this tutorial. Either use IBM® watsonx.ai® to follow along step-by-step or clone our GitHub repository to run the full Jupyter Notebook.
Follow the following steps to set up an IBM account to use a Jupyter Notebook.
1. Several Python versions can work for this tutorial. At the time of publishing, we recommend downloading Python 3.13, the latest version.
2. In your preferred IDE, clone the GitHub repository by using
3. Inside a terminal, create a virtual environment to avoid Python dependency issues.
4. Then, navigate to this tutorial’s directory.
We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they’re not installed, a quick pip installation can be performed. The
Some helpful libraries here include
!pip install -q torch pandas pyarrow fastparquet scikit-learn nltk huggingface_hub matplotlib
Import the following modules and classes.
The
Let’s read the Parquet files by using the file paths of the training dataset and the test set. We can also return the first 10 rows to take a closer look.
Output:
Because this tutorial will explore one model and will not require us to choose between rivaling approaches, we do not need a validation partition of the data. However, if you would like to experiment with selecting from different models, you should concatenate the training and testing sets and repartition. A suggested split can be 50–25–25 for training, validation and testing.
To start the data cleaning process, we first need to create a set of “stopwords.” To do so, we can use the NLTK stopwords corpus. This corpus consists of words such as “a,” “of,” “me,” “you,” “what,” and others. The purpose of this corpus is to exclude any such words that do not provide significant semantic meaning. By focusing on words with significant meaning, we can reduce the dimensionality of the text data and improve computational efficiency.
The
Let’s now apply this function to the
As a next step to this process, we can map categorical labels to numerical values so that
As part of the next step, we will tokenize the text and create a vocabulary dictionary. This step maps the tokens to indices and is an alternative approach to the prebuilt
Next up is the encoder function, which tokenizes the input
It’s time to create the
This code extracts the text and test data along with their corresponding labels from the training and testing DataFrames and stores them in separate variables. The text data is stored in
The next code block creates instances of
The
For a detailed view of the default LSTM hyperparameters not shown here, such as
The following code block initializes an instance of the
The following lines define the loss function used to evaluate the model’s performance, with binary cross-entropy (BCE) loss being a suitable choice for binary classification problems like spam detection. We can then initialize the Adam optimizer, which is used to update the model’s parameters during training, with a learning rate of 0.001.
The learning rate helps ensure that the model learns enough from training to make meaningful adjustments to its parameters while also not overcorrecting. A learning rate of 0.001 is a common default because it provides a good balance between convergence speed and stability.
To check whether a CUDA-compatible GPU is available and to set the
Output:
SpamClassifier( (embedding): Embedding(22754, 64, padding_idx=0) (lstm): LSTM(64, 64, batch_first=True) (fc): Linear(in_features=64, out_features=1, bias=True) (sigmoid): Sigmoid() )
We are now ready to train the mode for 5 epochs, or iterations. The model is set to training mode at the beginning of each epoch by using
Output:
Epoch 1 | Loss: 0.4563 Epoch 2 | Loss: 0.2659 Epoch 3 | Loss: 0.2312 Epoch 4 | Loss: 0.1316 Epoch 5 | Loss: 0.0936
As you can see, the loss is significantly dropping with each epoch. Let’s plot the loss over the number of epochs as a visualization of these results.
Output:
Our final step is evaluation of our trained model. We want to know how well it performs, don’t we? For this step, the model is set to evaluation mode by using
For labeled data that uses supervised techniques like ours, accuracy calculation is the simplest evaluation approach. Other approaches include the area under the ROC curve (AUC), receiver operating curve (ROC) and F-score or F-measure.5
Output:
Final accuracy: 0.9692
Amazing! With 97.43% accuracy after only 5 epochs of training, our text classification model is performing very well on unseen data. The results obtained are excellent, given that we have trained the model from scratch, without leveraging any pre-trained models.
Note, your result might slightly differ but should be close to 0.96.
In this tutorial, we have successfully built and trained a SpamClassifier model by using PyTorch, which can classify text as spam or not spam. We started by preparing the dataset, creating a data loader and defining the model architecture. We then trained the model and evaluated its performance on the test dataset. By following this tutorial, you have gained hands-on experience with building and training a neural network model for text classification tasks. You can now apply this knowledge to build more complex models and tackle real-world problems in NLP.
1Antiga, L. P. G., Stevens, E., & Viehmann, T. (2020). Deep learning with PyTorch. Simon and Schuster.
2B. Tóth, S. A. Laczi and V. Póser. (2024). “Spam Filter Using Artificial Intelligence: PyTorch Framework Based Approach,” 2024 IEEE 22nd Jubilee International Symposium on Intelligent Systems and Informatics (SISY) pp. 000227-000232. DOI: 10.1109/SISY62279.2024.10737559.
3PyTorch. “Pytorch/Pytorch.” (2021). GitHub. github.com/pytorch/pytorch.
4“Datasets & DataLoaders — PyTorch Tutorials 2.7.0+Cu126 Documentation.” (2024). Pytorch.org. docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html.
5Hossin, M., & Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2), 1.