🛢 Well Log Lithology Classification using Machine Learning

An end-to-end Machine Learning pipeline for automated lithology prediction from well log measurements using the FORCE 2020 dataset.

📌 Project Overview

Lithology identification is one of the most important tasks in petroleum exploration and reservoir characterization. Traditionally, geologists and petrophysicists interpret lithology manually using multiple well log measurements, making the process time-consuming and dependent on expert knowledge.

This project demonstrates how Machine Learning can automate lithology prediction by learning patterns from well log data.

Using the FORCE 2020 Well Log Dataset, the project performs:

Data Cleaning
Missing Value Treatment
Feature Engineering
Model Training
Model Evaluation
Interactive Streamlit Deployment

The application predicts lithology from unseen well log measurements and provides an interactive dashboard for visualization and analysis.

🎯 Project Objectives

The primary objectives of this project are:

Develop a complete end-to-end Machine Learning workflow.
Predict lithology from physical well log measurements.
Compare multiple Machine Learning algorithms.
Engineer domain-specific petrophysical features.
Build an interactive Streamlit application for inference.
Demonstrate a production-style project structure suitable for Data Science and Geoscience portfolios.

🌍 Dataset

This project is built using the FORCE 2020 Machine Learning Competition Dataset, one of the most widely used benchmark datasets for machine learning applications in geoscience.

Dataset Statistics

Property	Value
Dataset	FORCE 2020 Well Log Dataset
Wells	98
Samples	117,511
Raw Well Log Measurements	15
Engineered Features	37
Total Training Features	52
Target	Lithology Classification

The dataset contains physical measurements recorded by downhole logging tools across multiple wells.

📊 Input Well Logs

The Machine Learning model learns relationships between several petrophysical measurements.

Log	Description	Geological Interpretation
GR	Gamma Ray	Indicates shale content and natural radioactivity
RHOB	Bulk Density	Helps distinguish sandstone, limestone and dolomite
NPHI	Neutron Porosity	Estimates hydrogen index and apparent porosity
RDEP / ILD	Deep Resistivity	Indicates fluid content and hydrocarbon potential
DTC	Sonic Transit Time	Provides information about rock compaction and porosity
PEF	Photoelectric Factor	Useful for carbonate identification
SP	Spontaneous Potential	Assists lithology and permeability interpretation

🧠 Machine Learning Pipeline

The workflow implemented in this repository follows an industry-style pipeline:

Raw Well Logs
      │
      ▼
Data Cleaning
      │
      ▼
Feature Engineering
      │
      ▼
Model Training
      │
      ▼
Model Evaluation
      │
      ▼
Best Model Selection
      │
      ▼
Streamlit Dashboard
      │
      ▼
Prediction on New Well Logs

The entire workflow has been implemented using modular Python scripts, making the project easy to maintain, extend and deploy.

🛠 Machine Learning Models

Three supervised Machine Learning algorithms are trained and evaluated.

Model	Purpose
Logistic Regression	Baseline Linear Model
Random Forest	Tree Ensemble Model
XGBoost	Gradient Boosted Decision Trees

The best-performing model is automatically selected and saved for deployment in the Streamlit application.

⚡ Feature Engineering

A significant portion of this project focuses on domain-specific feature engineering rather than relying only on raw well logs.

Engineered features include:

Density-Neutron Separation
Gamma Ray × Density Interaction
Resistivity Ratios
Logarithmic Resistivity Transformations
Rolling Window Statistics
Vertical Gradients
Depth-Based Features
Petrophysical Interaction Features

These engineered variables help the models capture geological patterns that are not directly observable from the original measurements.

⚙️ System Architecture

The project follows a modular Machine Learning pipeline where each stage is isolated into independent modules. This improves maintainability, reproducibility, and scalability.

graph TD

A[FORCE 2020 Dataset]

A --> B[Data Loading]

B --> C[Data Cleaning]

C --> D[Feature Engineering]

D --> E[Train/Test Split by Wells]

E --> F[Model Training]

F --> G[Model Evaluation]

G --> H[Best Model Selection]

H --> I[Save Model]

I --> J[Streamlit Dashboard]

J --> K[Upload New Well Log]

K --> L[Lithology Prediction]

L --> M[Visualization & Download]

🔄 Complete Workflow

The complete workflow implemented in this repository is shown below.

Raw Well Log CSV
        │
        ▼
Read Dataset
        │
        ▼
Replace Invalid Values
        │
        ▼
Handle Missing Data
        │
        ▼
Feature Engineering
        │
        ▼
Split Wells (Train/Test)
        │
        ▼
Train Multiple Models
        │
        ▼
Evaluate Performance
        │
        ▼
Select Best Model
        │
        ▼
Save Trained Model
        │
        ▼
Launch Streamlit Dashboard
        │
        ▼
Upload New Well Logs
        │
        ▼
Predict Lithology
        │
        ▼
Download Predictions

📂 Repository Structure

Well_Log_Lithology_Classification/

│

├── app/
│   └── streamlit_app.py          # Interactive Streamlit dashboard

│

├── data/
│   ├── raw/                      # Original dataset
│   └── processed/                # Processed datasets

│

├── models/                       # Saved Machine Learning models

│

├── reports/
│   └── figures/                  # Generated plots and evaluation figures

│

├── src/
│   ├── data_loader.py
│   ├── preprocessing.py
│   ├── feature_engineering.py
│   ├── train.py
│   ├── predict.py
│   ├── evaluate.py
│   ├── visualization.py
│   └── utils.py

│

├── tests/

├── main.py

├── requirements.txt

├── README.md

└── .gitignore

📦 Source Code Overview

The repository is divided into independent modules following good software engineering practices.

Module	Responsibility
main.py	Executes the complete Machine Learning workflow
data_loader.py	Reads raw well log datasets
preprocessing.py	Handles invalid values, missing values and data cleaning
feature_engineering.py	Creates geological and statistical features
train.py	Trains Logistic Regression, Random Forest and XGBoost models
evaluate.py	Computes Accuracy, Precision, Recall and Macro F1
predict.py	Loads trained models and performs inference
visualization.py	Generates lithology plots and evaluation figures
utils.py	Common helper functions and mappings
streamlit_app.py	Interactive dashboard for deployment

🧹 Data Preprocessing Pipeline

Raw well log measurements often contain missing values, invalid measurements and noisy observations.

The preprocessing stage performs:

Replacement of invalid placeholder values
Missing value treatment
Numerical interpolation
Physical range validation
Data consistency checks

These preprocessing steps ensure that the Machine Learning models receive reliable and physically meaningful inputs.

⚡ Feature Engineering

Feature engineering is one of the most important components of this project.

Instead of relying only on raw well logs, additional geological features are generated to improve model performance.

Generated Features

Density-Neutron Separation
Resistivity Ratios
Gamma Ray × Density Interaction
Logarithmic Resistivity
Vertical Log Gradients
Rolling Mean
Rolling Standard Deviation
Moving Window Statistics
Depth-Based Features

These engineered variables capture geological relationships that improve lithology discrimination.

🤖 Machine Learning Training

Three different Machine Learning algorithms are trained and compared.

Algorithm	Description
Logistic Regression	Linear baseline classifier
Random Forest	Ensemble of Decision Trees
XGBoost	Gradient Boosted Decision Trees

Each model is trained independently and evaluated using unseen wells to reduce spatial data leakage.

The best-performing model is automatically selected and saved inside the models/ directory for deployment.

📈 Model Evaluation

Model performance is evaluated using multiple metrics instead of relying only on accuracy.

Evaluation includes:

Accuracy
Precision
Recall
Macro F1 Score
Confusion Matrix
Classification Report

Using multiple metrics provides a more reliable assessment, especially when lithology classes are imbalanced.

🚀 Installation

Clone the repository and install the required dependencies.

git clone https://github.com/<YOUR_USERNAME>/Well-Log-Lithology-Classification.git

cd Well-Log-Lithology-Classification

Create a virtual environment.

python -m venv venv

Activate the environment.

Windows

venv\Scripts\activate

Linux / macOS

source venv/bin/activate

Install the dependencies.

pip install -r requirements.txt

▶ Running the Project

1️⃣ Execute the Complete Machine Learning Pipeline

The entire workflow can be executed using

python main.py

Running the pipeline performs the following tasks:

Loads the FORCE 2020 Dataset
Cleans the raw well log data
Handles missing values
Engineers geological features
Splits wells into training and testing sets
Trains multiple Machine Learning models
Evaluates model performance
Saves the best-performing model
Generates evaluation figures

2️⃣ Launch the Streamlit Dashboard

Start the interactive dashboard using

streamlit run app/streamlit_app.py

The dashboard provides an intuitive interface for predicting lithology from unseen well log data.

💻 Streamlit Dashboard Features

The application includes multiple interactive pages.

🏠 Dashboard

Project overview
Dataset statistics
Model summary
Performance metrics

📊 Dataset Explorer

Dataset preview
Statistical summary
Missing value inspection
Feature distributions

🔍 Lithology Prediction

Upload a new CSV file containing well log measurements.

The application automatically

Loads the trained model
Performs prediction
Displays predicted lithology
Allows downloading the prediction results

📈 Feature Importance

Visualizes the contribution of individual well log measurements to the final prediction.

This helps explain which physical measurements influence the model most.

🌍 Well Log Viewer

Interactive visualization of

Gamma Ray
Resistivity
Density
Neutron Porosity
Sonic
Lithology Track

allowing geological interpretation alongside model predictions.

📊 Model Performance

The following Machine Learning algorithms were evaluated.

Model	Accuracy	Macro F1
Logistic Regression	82.4%	0.241
Random Forest	84.7%	0.327
XGBoost	84.2%	0.315

Why Macro F1?

Lithology datasets are naturally imbalanced because some rock types occur much more frequently than others.

While accuracy measures overall correctness, the Macro F1 Score evaluates performance across every lithology class equally, making it a more reliable metric for this problem.

Best Performing Model

🥇 Random Forest Classifier The Random Forest classifier achieved the highest validation performance and was therefore selected as the final deployment model for the Streamlit application.

📈 Generated Figures

The pipeline automatically generates several visualization outputs.

Figure	Description
Lithology Distribution	Shows the frequency of each lithology class in the dataset
Correlation Heatmap	Displays relationships between well log measurements
Feature Importance	Identifies the most influential features used by the model
Confusion Matrix	Visualizes prediction accuracy across lithology classes
Well Log Tracks	Professional petrophysical visualization of multiple well logs

All figures are automatically generated during model evaluation and are saved inside

reports/figures/

Running

python main.py

will regenerate these figures and overwrite the previous outputs.

These figures are regenerated whenever the training pipeline (main.py) is executed.

📷 Dashboard Preview

Replace these placeholders with screenshots after uploading the project to GitHub.

Dashboard

Prediction Page

Well Log Viewer

Feature Importance

![Feature Importance](docs/feature_importance.png)

📌 Key Highlights

✔ End-to-End Machine Learning Pipeline

✔ Real Petroleum Engineering Dataset (FORCE 2020)

✔ Domain-Specific Feature Engineering

✔ Multiple Machine Learning Models

✔ Automated Model Evaluation

✔ Interactive Streamlit Dashboard

✔ Professional Data Visualization

✔ Modular Python Project Structure

🔮 Future Improvements

The project can be extended in several directions:

Deep Learning

LSTM / GRU models for sequential well log interpretation
Transformer-based lithology classification

Explainable AI

SHAP Value Analysis
LIME Explanations
Feature Attribution Dashboard

Deployment

Docker Containerization
REST API using FastAPI
Cloud Deployment (AWS / Azure / GCP)

Geological Extensions

LAS File Support
Multi-Well Comparison
3D Geological Visualization
Facies Probability Visualization
Reservoir Property Prediction

🛠 Technology Stack

Category	Technologies
Programming	Python
Data Analysis	Pandas, NumPy
Machine Learning	Scikit-Learn, XGBoost
Visualization	Matplotlib, Plotly
Dashboard	Streamlit
Model Persistence	Joblib
Version Control	Git, GitHub

📚 Learning Outcomes

Through this project, the following concepts were implemented:

Data preprocessing for real-world well log data
Geological feature engineering
Supervised Machine Learning
Model evaluation and comparison
Production-style project organization
Interactive dashboard development
Scientific visualization
Deployment-ready workflows

🤝 Contributing

Contributions, suggestions, and improvements are welcome.

If you find an issue or have an idea to improve the project, feel free to open an issue or submit a pull request.

📄 License

This project is released under the MIT License.

👨‍💻 Author

AvionicS_7

B.S 4 Year, Exploration Geophysics

Indian Institute of Technology Kharagpur

Interested in:

Machine Learning
Geophysics
Petroleum Data Analytics
AI for Earth Sciences

⭐ If you found this repository useful...

Please consider giving it a ⭐ Star on GitHub.

It helps others discover the project and motivates future improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
data		data
models		models
notebooks		notebooks
reports		reports
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🛢 Well Log Lithology Classification using Machine Learning

📌 Project Overview

🎯 Project Objectives

🌍 Dataset

Dataset Statistics

📊 Input Well Logs

🧠 Machine Learning Pipeline

🛠 Machine Learning Models

⚡ Feature Engineering

⚙️ System Architecture

🔄 Complete Workflow

📂 Repository Structure

📦 Source Code Overview

🧹 Data Preprocessing Pipeline

⚡ Feature Engineering

Generated Features

🤖 Machine Learning Training

📈 Model Evaluation

🚀 Installation

Windows

Linux / macOS

▶ Running the Project

1️⃣ Execute the Complete Machine Learning Pipeline

2️⃣ Launch the Streamlit Dashboard

💻 Streamlit Dashboard Features

🏠 Dashboard

📊 Dataset Explorer

🔍 Lithology Prediction

📈 Feature Importance

🌍 Well Log Viewer

📊 Model Performance

Why Macro F1?

Best Performing Model

📈 Generated Figures

📷 Dashboard Preview

Dashboard

Prediction Page

Well Log Viewer

Feature Importance

📌 Key Highlights

🔮 Future Improvements

Deep Learning

Explainable AI

Deployment

Geological Extensions

🛠 Technology Stack

📚 Learning Outcomes

🤝 Contributing

📄 License

👨‍💻 Author

⭐ If you found this repository useful...

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages