An end-to-end Machine Learning pipeline for automated lithology prediction from well log measurements using the FORCE 2020 dataset.
Lithology identification is one of the most important tasks in petroleum exploration and reservoir characterization. Traditionally, geologists and petrophysicists interpret lithology manually using multiple well log measurements, making the process time-consuming and dependent on expert knowledge.
This project demonstrates how Machine Learning can automate lithology prediction by learning patterns from well log data.
Using the FORCE 2020 Well Log Dataset, the project performs:
- Data Cleaning
- Missing Value Treatment
- Feature Engineering
- Model Training
- Model Evaluation
- Interactive Streamlit Deployment
The application predicts lithology from unseen well log measurements and provides an interactive dashboard for visualization and analysis.
The primary objectives of this project are:
- Develop a complete end-to-end Machine Learning workflow.
- Predict lithology from physical well log measurements.
- Compare multiple Machine Learning algorithms.
- Engineer domain-specific petrophysical features.
- Build an interactive Streamlit application for inference.
- Demonstrate a production-style project structure suitable for Data Science and Geoscience portfolios.
This project is built using the FORCE 2020 Machine Learning Competition Dataset, one of the most widely used benchmark datasets for machine learning applications in geoscience.
| Property | Value |
|---|---|
| Dataset | FORCE 2020 Well Log Dataset |
| Wells | 98 |
| Samples | 117,511 |
| Raw Well Log Measurements | 15 |
| Engineered Features | 37 |
| Total Training Features | 52 |
| Target | Lithology Classification |
The dataset contains physical measurements recorded by downhole logging tools across multiple wells.
The Machine Learning model learns relationships between several petrophysical measurements.
| Log | Description | Geological Interpretation |
|---|---|---|
| GR | Gamma Ray | Indicates shale content and natural radioactivity |
| RHOB | Bulk Density | Helps distinguish sandstone, limestone and dolomite |
| NPHI | Neutron Porosity | Estimates hydrogen index and apparent porosity |
| RDEP / ILD | Deep Resistivity | Indicates fluid content and hydrocarbon potential |
| DTC | Sonic Transit Time | Provides information about rock compaction and porosity |
| PEF | Photoelectric Factor | Useful for carbonate identification |
| SP | Spontaneous Potential | Assists lithology and permeability interpretation |
The workflow implemented in this repository follows an industry-style pipeline:
Raw Well Logs
│
▼
Data Cleaning
│
▼
Feature Engineering
│
▼
Model Training
│
▼
Model Evaluation
│
▼
Best Model Selection
│
▼
Streamlit Dashboard
│
▼
Prediction on New Well Logs
The entire workflow has been implemented using modular Python scripts, making the project easy to maintain, extend and deploy.
Three supervised Machine Learning algorithms are trained and evaluated.
| Model | Purpose |
|---|---|
| Logistic Regression | Baseline Linear Model |
| Random Forest | Tree Ensemble Model |
| XGBoost | Gradient Boosted Decision Trees |
The best-performing model is automatically selected and saved for deployment in the Streamlit application.
A significant portion of this project focuses on domain-specific feature engineering rather than relying only on raw well logs.
Engineered features include:
- Density-Neutron Separation
- Gamma Ray × Density Interaction
- Resistivity Ratios
- Logarithmic Resistivity Transformations
- Rolling Window Statistics
- Vertical Gradients
- Depth-Based Features
- Petrophysical Interaction Features
These engineered variables help the models capture geological patterns that are not directly observable from the original measurements.
The project follows a modular Machine Learning pipeline where each stage is isolated into independent modules. This improves maintainability, reproducibility, and scalability.
graph TD
A[FORCE 2020 Dataset]
A --> B[Data Loading]
B --> C[Data Cleaning]
C --> D[Feature Engineering]
D --> E[Train/Test Split by Wells]
E --> F[Model Training]
F --> G[Model Evaluation]
G --> H[Best Model Selection]
H --> I[Save Model]
I --> J[Streamlit Dashboard]
J --> K[Upload New Well Log]
K --> L[Lithology Prediction]
L --> M[Visualization & Download]
The complete workflow implemented in this repository is shown below.
Raw Well Log CSV
│
▼
Read Dataset
│
▼
Replace Invalid Values
│
▼
Handle Missing Data
│
▼
Feature Engineering
│
▼
Split Wells (Train/Test)
│
▼
Train Multiple Models
│
▼
Evaluate Performance
│
▼
Select Best Model
│
▼
Save Trained Model
│
▼
Launch Streamlit Dashboard
│
▼
Upload New Well Logs
│
▼
Predict Lithology
│
▼
Download Predictions
Well_Log_Lithology_Classification/
│
├── app/
│ └── streamlit_app.py # Interactive Streamlit dashboard
│
├── data/
│ ├── raw/ # Original dataset
│ └── processed/ # Processed datasets
│
├── models/ # Saved Machine Learning models
│
├── reports/
│ └── figures/ # Generated plots and evaluation figures
│
├── src/
│ ├── data_loader.py
│ ├── preprocessing.py
│ ├── feature_engineering.py
│ ├── train.py
│ ├── predict.py
│ ├── evaluate.py
│ ├── visualization.py
│ └── utils.py
│
├── tests/
├── main.py
├── requirements.txt
├── README.md
└── .gitignore
The repository is divided into independent modules following good software engineering practices.
| Module | Responsibility |
|---|---|
| main.py | Executes the complete Machine Learning workflow |
| data_loader.py | Reads raw well log datasets |
| preprocessing.py | Handles invalid values, missing values and data cleaning |
| feature_engineering.py | Creates geological and statistical features |
| train.py | Trains Logistic Regression, Random Forest and XGBoost models |
| evaluate.py | Computes Accuracy, Precision, Recall and Macro F1 |
| predict.py | Loads trained models and performs inference |
| visualization.py | Generates lithology plots and evaluation figures |
| utils.py | Common helper functions and mappings |
| streamlit_app.py | Interactive dashboard for deployment |
Raw well log measurements often contain missing values, invalid measurements and noisy observations.
The preprocessing stage performs:
- Replacement of invalid placeholder values
- Missing value treatment
- Numerical interpolation
- Physical range validation
- Data consistency checks
These preprocessing steps ensure that the Machine Learning models receive reliable and physically meaningful inputs.
Feature engineering is one of the most important components of this project.
Instead of relying only on raw well logs, additional geological features are generated to improve model performance.
- Density-Neutron Separation
- Resistivity Ratios
- Gamma Ray × Density Interaction
- Logarithmic Resistivity
- Vertical Log Gradients
- Rolling Mean
- Rolling Standard Deviation
- Moving Window Statistics
- Depth-Based Features
These engineered variables capture geological relationships that improve lithology discrimination.
Three different Machine Learning algorithms are trained and compared.
| Algorithm | Description |
|---|---|
| Logistic Regression | Linear baseline classifier |
| Random Forest | Ensemble of Decision Trees |
| XGBoost | Gradient Boosted Decision Trees |
Each model is trained independently and evaluated using unseen wells to reduce spatial data leakage.
The best-performing model is automatically selected and saved inside the models/ directory for deployment.
Model performance is evaluated using multiple metrics instead of relying only on accuracy.
Evaluation includes:
- Accuracy
- Precision
- Recall
- Macro F1 Score
- Confusion Matrix
- Classification Report
Using multiple metrics provides a more reliable assessment, especially when lithology classes are imbalanced.
Clone the repository and install the required dependencies.
git clone https://github.com/<YOUR_USERNAME>/Well-Log-Lithology-Classification.git
cd Well-Log-Lithology-ClassificationCreate a virtual environment.
python -m venv venvActivate the environment.
venv\Scripts\activatesource venv/bin/activateInstall the dependencies.
pip install -r requirements.txtThe entire workflow can be executed using
python main.pyRunning the pipeline performs the following tasks:
- Loads the FORCE 2020 Dataset
- Cleans the raw well log data
- Handles missing values
- Engineers geological features
- Splits wells into training and testing sets
- Trains multiple Machine Learning models
- Evaluates model performance
- Saves the best-performing model
- Generates evaluation figures
Start the interactive dashboard using
streamlit run app/streamlit_app.pyThe dashboard provides an intuitive interface for predicting lithology from unseen well log data.
The application includes multiple interactive pages.
- Project overview
- Dataset statistics
- Model summary
- Performance metrics
- Dataset preview
- Statistical summary
- Missing value inspection
- Feature distributions
Upload a new CSV file containing well log measurements.
The application automatically
- Loads the trained model
- Performs prediction
- Displays predicted lithology
- Allows downloading the prediction results
Visualizes the contribution of individual well log measurements to the final prediction.
This helps explain which physical measurements influence the model most.
Interactive visualization of
- Gamma Ray
- Resistivity
- Density
- Neutron Porosity
- Sonic
- Lithology Track
allowing geological interpretation alongside model predictions.
The following Machine Learning algorithms were evaluated.
| Model | Accuracy | Macro F1 |
|---|---|---|
| Logistic Regression | 82.4% | 0.241 |
| Random Forest | 84.7% | 0.327 |
| XGBoost | 84.2% | 0.315 |
Lithology datasets are naturally imbalanced because some rock types occur much more frequently than others.
While accuracy measures overall correctness, the Macro F1 Score evaluates performance across every lithology class equally, making it a more reliable metric for this problem.
🥇 Random Forest Classifier The Random Forest classifier achieved the highest validation performance and was therefore selected as the final deployment model for the Streamlit application.
The pipeline automatically generates several visualization outputs.
| Figure | Description |
|---|---|
| Lithology Distribution | Shows the frequency of each lithology class in the dataset |
| Correlation Heatmap | Displays relationships between well log measurements |
| Feature Importance | Identifies the most influential features used by the model |
| Confusion Matrix | Visualizes prediction accuracy across lithology classes |
| Well Log Tracks | Professional petrophysical visualization of multiple well logs |
All figures are automatically generated during model evaluation and are saved inside
reports/figures/
Running
python main.pywill regenerate these figures and overwrite the previous outputs.
These figures are regenerated whenever the training pipeline (main.py) is executed.
Replace these placeholders with screenshots after uploading the project to GitHub.

✔ End-to-End Machine Learning Pipeline
✔ Real Petroleum Engineering Dataset (FORCE 2020)
✔ Domain-Specific Feature Engineering
✔ Multiple Machine Learning Models
✔ Automated Model Evaluation
✔ Interactive Streamlit Dashboard
✔ Professional Data Visualization
✔ Modular Python Project Structure
The project can be extended in several directions:
- LSTM / GRU models for sequential well log interpretation
- Transformer-based lithology classification
- SHAP Value Analysis
- LIME Explanations
- Feature Attribution Dashboard
- Docker Containerization
- REST API using FastAPI
- Cloud Deployment (AWS / Azure / GCP)
- LAS File Support
- Multi-Well Comparison
- 3D Geological Visualization
- Facies Probability Visualization
- Reservoir Property Prediction
| Category | Technologies |
|---|---|
| Programming | Python |
| Data Analysis | Pandas, NumPy |
| Machine Learning | Scikit-Learn, XGBoost |
| Visualization | Matplotlib, Plotly |
| Dashboard | Streamlit |
| Model Persistence | Joblib |
| Version Control | Git, GitHub |
Through this project, the following concepts were implemented:
- Data preprocessing for real-world well log data
- Geological feature engineering
- Supervised Machine Learning
- Model evaluation and comparison
- Production-style project organization
- Interactive dashboard development
- Scientific visualization
- Deployment-ready workflows
Contributions, suggestions, and improvements are welcome.
If you find an issue or have an idea to improve the project, feel free to open an issue or submit a pull request.
This project is released under the MIT License.
AvionicS_7
B.S 4 Year, Exploration Geophysics
Indian Institute of Technology Kharagpur
Interested in:
- Machine Learning
- Geophysics
- Petroleum Data Analytics
- AI for Earth Sciences
Please consider giving it a ⭐ Star on GitHub.
It helps others discover the project and motivates future improvements.


