Skip to content

AvionicS-7/Well_Log_Lithology_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛢 Well Log Lithology Classification using Machine Learning

An end-to-end Machine Learning pipeline for automated lithology prediction from well log measurements using the FORCE 2020 dataset.

Python Scikit-Learn XGBoost Streamlit Status


📌 Project Overview

Lithology identification is one of the most important tasks in petroleum exploration and reservoir characterization. Traditionally, geologists and petrophysicists interpret lithology manually using multiple well log measurements, making the process time-consuming and dependent on expert knowledge.

This project demonstrates how Machine Learning can automate lithology prediction by learning patterns from well log data.

Using the FORCE 2020 Well Log Dataset, the project performs:

  • Data Cleaning
  • Missing Value Treatment
  • Feature Engineering
  • Model Training
  • Model Evaluation
  • Interactive Streamlit Deployment

The application predicts lithology from unseen well log measurements and provides an interactive dashboard for visualization and analysis.


🎯 Project Objectives

The primary objectives of this project are:

  • Develop a complete end-to-end Machine Learning workflow.
  • Predict lithology from physical well log measurements.
  • Compare multiple Machine Learning algorithms.
  • Engineer domain-specific petrophysical features.
  • Build an interactive Streamlit application for inference.
  • Demonstrate a production-style project structure suitable for Data Science and Geoscience portfolios.

🌍 Dataset

This project is built using the FORCE 2020 Machine Learning Competition Dataset, one of the most widely used benchmark datasets for machine learning applications in geoscience.

Dataset Statistics

Property Value
Dataset FORCE 2020 Well Log Dataset
Wells 98
Samples 117,511
Raw Well Log Measurements 15
Engineered Features 37
Total Training Features 52
Target Lithology Classification

The dataset contains physical measurements recorded by downhole logging tools across multiple wells.


📊 Input Well Logs

The Machine Learning model learns relationships between several petrophysical measurements.

Log Description Geological Interpretation
GR Gamma Ray Indicates shale content and natural radioactivity
RHOB Bulk Density Helps distinguish sandstone, limestone and dolomite
NPHI Neutron Porosity Estimates hydrogen index and apparent porosity
RDEP / ILD Deep Resistivity Indicates fluid content and hydrocarbon potential
DTC Sonic Transit Time Provides information about rock compaction and porosity
PEF Photoelectric Factor Useful for carbonate identification
SP Spontaneous Potential Assists lithology and permeability interpretation

🧠 Machine Learning Pipeline

The workflow implemented in this repository follows an industry-style pipeline:

Raw Well Logs
      │
      ▼
Data Cleaning
      │
      ▼
Feature Engineering
      │
      ▼
Model Training
      │
      ▼
Model Evaluation
      │
      ▼
Best Model Selection
      │
      ▼
Streamlit Dashboard
      │
      ▼
Prediction on New Well Logs

The entire workflow has been implemented using modular Python scripts, making the project easy to maintain, extend and deploy.


🛠 Machine Learning Models

Three supervised Machine Learning algorithms are trained and evaluated.

Model Purpose
Logistic Regression Baseline Linear Model
Random Forest Tree Ensemble Model
XGBoost Gradient Boosted Decision Trees

The best-performing model is automatically selected and saved for deployment in the Streamlit application.


⚡ Feature Engineering

A significant portion of this project focuses on domain-specific feature engineering rather than relying only on raw well logs.

Engineered features include:

  • Density-Neutron Separation
  • Gamma Ray × Density Interaction
  • Resistivity Ratios
  • Logarithmic Resistivity Transformations
  • Rolling Window Statistics
  • Vertical Gradients
  • Depth-Based Features
  • Petrophysical Interaction Features

These engineered variables help the models capture geological patterns that are not directly observable from the original measurements.


⚙️ System Architecture

The project follows a modular Machine Learning pipeline where each stage is isolated into independent modules. This improves maintainability, reproducibility, and scalability.

graph TD

A[FORCE 2020 Dataset]

A --> B[Data Loading]

B --> C[Data Cleaning]

C --> D[Feature Engineering]

D --> E[Train/Test Split by Wells]

E --> F[Model Training]

F --> G[Model Evaluation]

G --> H[Best Model Selection]

H --> I[Save Model]

I --> J[Streamlit Dashboard]

J --> K[Upload New Well Log]

K --> L[Lithology Prediction]

L --> M[Visualization & Download]
Loading

🔄 Complete Workflow

The complete workflow implemented in this repository is shown below.

Raw Well Log CSV
        │
        ▼
Read Dataset
        │
        ▼
Replace Invalid Values
        │
        ▼
Handle Missing Data
        │
        ▼
Feature Engineering
        │
        ▼
Split Wells (Train/Test)
        │
        ▼
Train Multiple Models
        │
        ▼
Evaluate Performance
        │
        ▼
Select Best Model
        │
        ▼
Save Trained Model
        │
        ▼
Launch Streamlit Dashboard
        │
        ▼
Upload New Well Logs
        │
        ▼
Predict Lithology
        │
        ▼
Download Predictions

📂 Repository Structure

Well_Log_Lithology_Classification/

│

├── app/
│   └── streamlit_app.py          # Interactive Streamlit dashboard

│

├── data/
│   ├── raw/                      # Original dataset
│   └── processed/                # Processed datasets

│

├── models/                       # Saved Machine Learning models

│

├── reports/
│   └── figures/                  # Generated plots and evaluation figures

│

├── src/
│   ├── data_loader.py
│   ├── preprocessing.py
│   ├── feature_engineering.py
│   ├── train.py
│   ├── predict.py
│   ├── evaluate.py
│   ├── visualization.py
│   └── utils.py

│

├── tests/

├── main.py

├── requirements.txt

├── README.md

└── .gitignore

📦 Source Code Overview

The repository is divided into independent modules following good software engineering practices.

Module Responsibility
main.py Executes the complete Machine Learning workflow
data_loader.py Reads raw well log datasets
preprocessing.py Handles invalid values, missing values and data cleaning
feature_engineering.py Creates geological and statistical features
train.py Trains Logistic Regression, Random Forest and XGBoost models
evaluate.py Computes Accuracy, Precision, Recall and Macro F1
predict.py Loads trained models and performs inference
visualization.py Generates lithology plots and evaluation figures
utils.py Common helper functions and mappings
streamlit_app.py Interactive dashboard for deployment

🧹 Data Preprocessing Pipeline

Raw well log measurements often contain missing values, invalid measurements and noisy observations.

The preprocessing stage performs:

  • Replacement of invalid placeholder values
  • Missing value treatment
  • Numerical interpolation
  • Physical range validation
  • Data consistency checks

These preprocessing steps ensure that the Machine Learning models receive reliable and physically meaningful inputs.


⚡ Feature Engineering

Feature engineering is one of the most important components of this project.

Instead of relying only on raw well logs, additional geological features are generated to improve model performance.

Generated Features

  • Density-Neutron Separation
  • Resistivity Ratios
  • Gamma Ray × Density Interaction
  • Logarithmic Resistivity
  • Vertical Log Gradients
  • Rolling Mean
  • Rolling Standard Deviation
  • Moving Window Statistics
  • Depth-Based Features

These engineered variables capture geological relationships that improve lithology discrimination.


🤖 Machine Learning Training

Three different Machine Learning algorithms are trained and compared.

Algorithm Description
Logistic Regression Linear baseline classifier
Random Forest Ensemble of Decision Trees
XGBoost Gradient Boosted Decision Trees

Each model is trained independently and evaluated using unseen wells to reduce spatial data leakage.

The best-performing model is automatically selected and saved inside the models/ directory for deployment.


📈 Model Evaluation

Model performance is evaluated using multiple metrics instead of relying only on accuracy.

Evaluation includes:

  • Accuracy
  • Precision
  • Recall
  • Macro F1 Score
  • Confusion Matrix
  • Classification Report

Using multiple metrics provides a more reliable assessment, especially when lithology classes are imbalanced.


🚀 Installation

Clone the repository and install the required dependencies.

git clone https://github.com/<YOUR_USERNAME>/Well-Log-Lithology-Classification.git

cd Well-Log-Lithology-Classification

Create a virtual environment.

python -m venv venv

Activate the environment.

Windows

venv\Scripts\activate

Linux / macOS

source venv/bin/activate

Install the dependencies.

pip install -r requirements.txt

▶ Running the Project

1️⃣ Execute the Complete Machine Learning Pipeline

The entire workflow can be executed using

python main.py

Running the pipeline performs the following tasks:

  • Loads the FORCE 2020 Dataset
  • Cleans the raw well log data
  • Handles missing values
  • Engineers geological features
  • Splits wells into training and testing sets
  • Trains multiple Machine Learning models
  • Evaluates model performance
  • Saves the best-performing model
  • Generates evaluation figures

2️⃣ Launch the Streamlit Dashboard

Start the interactive dashboard using

streamlit run app/streamlit_app.py

The dashboard provides an intuitive interface for predicting lithology from unseen well log data.


💻 Streamlit Dashboard Features

The application includes multiple interactive pages.

🏠 Dashboard

  • Project overview
  • Dataset statistics
  • Model summary
  • Performance metrics

📊 Dataset Explorer

  • Dataset preview
  • Statistical summary
  • Missing value inspection
  • Feature distributions

🔍 Lithology Prediction

Upload a new CSV file containing well log measurements.

The application automatically

  • Loads the trained model
  • Performs prediction
  • Displays predicted lithology
  • Allows downloading the prediction results

📈 Feature Importance

Visualizes the contribution of individual well log measurements to the final prediction.

This helps explain which physical measurements influence the model most.


🌍 Well Log Viewer

Interactive visualization of

  • Gamma Ray
  • Resistivity
  • Density
  • Neutron Porosity
  • Sonic
  • Lithology Track

allowing geological interpretation alongside model predictions.


📊 Model Performance

The following Machine Learning algorithms were evaluated.

Model Accuracy Macro F1
Logistic Regression 82.4% 0.241
Random Forest 84.7% 0.327
XGBoost 84.2% 0.315

Why Macro F1?

Lithology datasets are naturally imbalanced because some rock types occur much more frequently than others.

While accuracy measures overall correctness, the Macro F1 Score evaluates performance across every lithology class equally, making it a more reliable metric for this problem.

Best Performing Model

🥇 Random Forest Classifier The Random Forest classifier achieved the highest validation performance and was therefore selected as the final deployment model for the Streamlit application.


📈 Generated Figures

The pipeline automatically generates several visualization outputs.

Figure Description
Lithology Distribution Shows the frequency of each lithology class in the dataset
Correlation Heatmap Displays relationships between well log measurements
Feature Importance Identifies the most influential features used by the model
Confusion Matrix Visualizes prediction accuracy across lithology classes
Well Log Tracks Professional petrophysical visualization of multiple well logs

All figures are automatically generated during model evaluation and are saved inside

reports/figures/

Running

python main.py

will regenerate these figures and overwrite the previous outputs.

These figures are regenerated whenever the training pipeline (main.py) is executed.


📷 Dashboard Preview

Replace these placeholders with screenshots after uploading the project to GitHub.

Dashboard

Dashboard


Prediction Page

Prediction


Well Log Viewer

Well Log Viewer


Feature Importance

![Feature Importance](docs/feature_importance.png)

📌 Key Highlights

✔ End-to-End Machine Learning Pipeline

✔ Real Petroleum Engineering Dataset (FORCE 2020)

✔ Domain-Specific Feature Engineering

✔ Multiple Machine Learning Models

✔ Automated Model Evaluation

✔ Interactive Streamlit Dashboard

✔ Professional Data Visualization

✔ Modular Python Project Structure


🔮 Future Improvements

The project can be extended in several directions:

Deep Learning

  • LSTM / GRU models for sequential well log interpretation
  • Transformer-based lithology classification

Explainable AI

  • SHAP Value Analysis
  • LIME Explanations
  • Feature Attribution Dashboard

Deployment

  • Docker Containerization
  • REST API using FastAPI
  • Cloud Deployment (AWS / Azure / GCP)

Geological Extensions

  • LAS File Support
  • Multi-Well Comparison
  • 3D Geological Visualization
  • Facies Probability Visualization
  • Reservoir Property Prediction

🛠 Technology Stack

Category Technologies
Programming Python
Data Analysis Pandas, NumPy
Machine Learning Scikit-Learn, XGBoost
Visualization Matplotlib, Plotly
Dashboard Streamlit
Model Persistence Joblib
Version Control Git, GitHub

📚 Learning Outcomes

Through this project, the following concepts were implemented:

  • Data preprocessing for real-world well log data
  • Geological feature engineering
  • Supervised Machine Learning
  • Model evaluation and comparison
  • Production-style project organization
  • Interactive dashboard development
  • Scientific visualization
  • Deployment-ready workflows

🤝 Contributing

Contributions, suggestions, and improvements are welcome.

If you find an issue or have an idea to improve the project, feel free to open an issue or submit a pull request.


📄 License

This project is released under the MIT License.


👨‍💻 Author

AvionicS_7

B.S 4 Year, Exploration Geophysics

Indian Institute of Technology Kharagpur

Interested in:

  • Machine Learning
  • Geophysics
  • Petroleum Data Analytics
  • AI for Earth Sciences

⭐ If you found this repository useful...

Please consider giving it a ⭐ Star on GitHub.

It helps others discover the project and motivates future improvements.


About

Machine Learning pipeline for automated well log lithology classification using the FORCE 2020 dataset with an interactive Streamlit dashboard (upcoming).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors