Lung Cancer Detection using Convolutional Neural Network (CNN)
Computer Vision is one of the applications of deep neural networks and one such use case is in predicting the presence of cancerous cells. In this article, we will learn how to build a classifier using Convolution Neural Network which can classify normal lung tissues from cancerous tissues.
The following process will be followed to build this classifier:

Below is the step by step process for making our CNN model.
1. Importing Libraries
We will be using use:
- Pandas, NumPy, Matplotlib and Scikit-learn for data handling and analysis.
- OpenCV for image processing.
- TensorFlow to build and train machine learning models, including CNNs.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from glob import glob
from sklearn.model_selection import train_test_split
from sklearn import metrics
from zipfile import ZipFile
import cv2
import gc
import os
import tensorflow as tf
from tensorflow import keras
from keras import layers
import warnings
warnings.filterwarnings('ignore')
2. Importing Dataset
The dataset used for this project is available from Kaggle and it consists of 5,000 images belonging to three categories of lung conditions:
- Normal Class
- Lung Adenocarcinomas
- Lung Squamous Cell Carcinomas
This dataset has already been augmented meaning, the 250 images for each category were artificially expanded so we won’t need to perform Data Augmentation ourselves.
- We use Python’s zipfile module to extract the contents of the dataset. This is crucial because the dataset is stored as a compressed .zip file.
- zip.extractall() extracts all the contents to the current working directory.
data_path = 'lung-and-colon-cancer-histopathological-images.zip'
with ZipFile(data_path,'r') as zip:
zip.extractall()
print('The data set has been extracted.')
Output:
The data set has been extracted.
3. Visualizing the Data
Here we visualize the images to get an understanding of what the data looks like. This helps in identifying the nature of images that the model will be trained on. Classes will contain the names: 'lung_n', 'lung_aca' and 'lung_scc' corresponding to Normal, Lung Adenocarcinoma and Lung Squamous Cell Carcinoma. These are the three classes that we have here.
path = '/lung_colon_image_set/lung_image_sets'
for cat in classes:
image_dir = f'{path}/{cat}'
images = os.listdir(image_dir)
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle(f'Images for {cat} category . . . .', fontsize=20)
for i in range(3):
k = np.random.randint(0, len(images))
img = np.array(Image.open(f'{path}/{cat}/{images[k]}'))
ax[i].imshow(img)
ax[i].axis('off')
plt.show()
Output:



The above output may vary if we will run this because the code has been implemented in such a way that it will show different images every time we run it.
- It selects a random sample of three images from each category and visualizes them using Matplotlib.
- PIL.Image.open(): open the images and convert them in a format that can be displayed.
4. Preparing the Dataset
Before training the model we need to process the images into a format suitable for the CNN model. This involves resizing the images and converting them into NumPy arrays for efficient computation.
- Image Resizing: Since large images are computationally expensive to process we resize them to a standard size (256x256) using numpy array. We used 10 epochs with batch size of 64.
- One hot encoding: Labels (Y) are converted to one-hot encoded vectors using pd.get_dummies(). This allows the model to output soft probabilities for each class.
- Train-Test Split: We split the dataset into training and validation sets i.e 80% for training and 20% for validation. This allows us to evaluate the model's performance on unseen data.
IMG_SIZE = 256
SPLIT = 0.2
EPOCHS = 10
BATCH_SIZE = 64
X = []
Y = []
for i, cat in enumerate(classes):
images = glob(f'{path}/{cat}/*.jpeg')
for image in images:
img = cv2.imread(image)
X.append(cv2.resize(img, (IMG_SIZE, IMG_SIZE)))
Y.append(i)
X = np.asarray(X)
one_hot_encoded_Y = pd.get_dummies(Y).values
X_train, X_val, Y_train, Y_val = train_test_split(X, one_hot_encoded_Y, test_size=SPLIT, random_state=2022)
Output:
(12000, 256, 256, 3) (3000, 256, 256, 3)
5. Model Development
Now we start building the CNN. Here, we use TensorFlow and Keras to define the architecture of our CNN model.
- Sequential(): Builds a linear stack of layers.
- Conv2D(): Applies convolution with specified filters, kernel size, ReLU activation and padding.
- MaxPooling2D(): Downsamples feature maps by taking max values over pool size.
- Flatten(): Converts 2D feature maps into 1D vector.
- Dense(): Fully connected layer with given units and activation.
- BatchNormalization(): Normalizes activations to speed up training.
- Dropout(): Randomly drops neurons to reduce overfitting.
- model.summary(): Displays model architecture details.
model = keras.models.Sequential([
layers.Conv2D(filters=32,
kernel_size=(5, 5),
activation='relu',
input_shape=(IMG_SIZE,
IMG_SIZE,
3),
padding='same'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(filters=64,
kernel_size=(3, 3),
activation='relu',
padding='same'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(filters=128,
kernel_size=(3, 3),
activation='relu',
padding='same'),
layers.MaxPooling2D(2, 2),
layers.Flatten(),
layers.Dense(256, activation='relu'),
layers.BatchNormalization(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(3, activation='softmax')
])
model.summary()
Output:

6. Model Compilation
After defining the model architecture we will compile the model with an optimizer, loss function and evaluation metric then train it using the training data.
- We use the Adam optimizer which adjusts the learning rate during training to speed up convergence.
- Categorical cross entropy loss is appropriate as loss function for multi-class classification problems as it measures the difference between the predicted and actual probability distributions.
- EarlyStopping: Stops training if validation accuracy doesn’t improve for a set number of epochs (patience).
- ReduceLROnPlateau: Reduces learning rate when validation loss plateaus, controlled by patience and factor.
- Custom myCallback class: Stops training early when validation accuracy exceeds 90%.
- self.model.stop_training = True: Signals to stop training inside the callback.
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if logs.get('val_accuracy') > 0.90:
print('\n Validation accuracy has reached upto \
90% so, stopping further training.')
self.model.stop_training = True
es = EarlyStopping(patience=3,
monitor='val_accuracy',
restore_best_weights=True)
lr = ReduceLROnPlateau(monitor='val_loss',
patience=2,
factor=0.5,
verbose=1)
7. Model Training
Now we will train our model by defining the following:
- model.fit() trains the model on training data X_train and Y_train.
- validation_data provides validation inputs X_val and Y_val for evaluation each epoch.
- batch_size sets the number of samples per training batch.
- epochs defines how many times the model iterates over the entire training set.
- verbose=1 displays training progress.
- callbacks includes early stopping, learning rate reduction and custom callback to control training based on validation metrics.
history = model.fit(X_train, Y_train,
validation_data = (X_val, Y_val),
batch_size = BATCH_SIZE,
epochs = EPOCHS,
verbose = 1,
callbacks = [es, lr, myCallback()])
Output:

8. Visualizing
Let's visualize the training and validation accuracy with each epoch.
- pd.DataFrame(history.history) converts training history into a DataFrame.
- history_df.loc[:, ['accuracy', 'val_accuracy']].plot() plots training and validation accuracy.
history_df = pd.DataFrame(history.history)
history_df.loc[:,['accuracy','val_accuracy']].plot()
plt.show()
Output:

This graph shows the training and validation accuracy of the model over epochs. The training accuracy increases steadily reaching near 1.0 indicating the model is learning well from the training data.
However the validation accuracy fluctuates suggesting that the model may be overfitting where it performs well on the training data but struggles to generalize to unseen data. It can be avoided by further fine tuning the model.
9. Model Evaluation
Now as we have our model ready let's evaluate its performance on the validation data using different metrics. For this we will first predict the class for the validation data using this model and then compare the output with the true labels.
- model.predict(X_val) generates predictions for validation data.
- np.argmax(Y_val, axis=1) converts one-hot encoded true labels to class indices.
- np.argmax(Y_pred, axis=1) converts predicted probabilities to class indices.
- metrics.classification_report() prints precision, recall, f1-score and support for each class.
Y_pred = model.predict(X_val)
Y_val = np.argmax(Y_val, axis=1)
Y_pred = np.argmax(Y_pred, axis=1)
print(metrics.classification_report(Y_val, Y_pred,
target_names=classes))
Output:
94/94 ━━━━━━━━━━━━━━━━━━━━ 80s 851ms/step
Now we will draw classification report using the predicted labels and the true labels.
- metrics.classification_report() displays detailed evaluation metrics for each class based on true (Y_val) and predicted (Y_pred) labels, using class names from classes.
print(metrics.classification_report(Y_val, Y_pred,
target_names=classes))
Output:

The classification report shows that the model performs well on normal lung tissue (lung_n) with high precision and recall resulting in a strong F1-score. However it struggles with lung_aca (lung adenocarcinoma) and lung_scc (lung squamous cell carcinoma) particularly in terms of recall. It tells us that model can be improved in handling imbalanced classes and enhancing performance across all categories.