Skip to content

End-to-end offline OCR and semantic parsing pipeline for identity documents based on YOLOv11, PaddleOCR, and LLaMA-3.1.

License

Notifications You must be signed in to change notification settings

Zer0-Bug/ID-Document_Recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

163 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ID-Document Recognition

Python YOLOv11 PaddleOCR LLaMA-3.1 License

An advanced OCR + NER pipeline for structured identity document extraction.

Leveraging YOLOv11 for detection, PaddleOCR for recognition, and LLaMA-3.1 for semantic parsing.


° ° ° ° ° °



Technical Architecture

The pipeline is architected for high precision and privacy, running entirely offline. It consists of four modular stages integrated into a seamless flow:

  1. Object Detection (YOLOv11): Real-time identification and precision localization of the document within the source image.
  2. Image Processing & Enhancement: Adaptive normalization, CLAHE enhancement, and noise reduction kernels to maximize text legibility.
  3. Optical Character Recognition (PaddleOCR): Robust text detection and recognition across diverse orientations and multilingual scripts.
  4. Structured Inference (LLaMA-3.1): Local LLM-driven semantic analysis to transform raw OCR text into validated, high-fidelity JSON.


Project Structure

ID-Document_Recognition/
├── app.py                                    # Flask backend for web interface
├── main_pipeline.py                          # Main orchestrator for full document processing
├── environment.yaml                          # Conda environment configuration
├── requirements.txt                          # Python dependencies
├── LICENSE                                   # MIT License
├── README.md                                 # Project documentation
│
├── Codes/                                    # Modular implementation of pipeline stages
│   ├── __init__.py                           # Marks this directory as a Python package
│   ├── check_ocr_validity.py                 # Post-processing and validation logic
│   ├── crop.py                               # YOLOv11 detection and image enhancement
│   ├── LLaMAv3.py                            # LLaMA v3.1 extraction via Ollama
│   └── paddle_ocr.py                         # PaddleOCR inference logic
│
├── Models/                                   # Static model weights and detection assets
│   ├── YOLOv11.pt                            # Trained document detection weights
│   ├── haarcascade_frontalface_default.xml   # Face detection descriptor
│   └── PaddleOCR/                            # Local PaddleOCR inference engines
│
├── Static/                                   # Frontend assets
│   ├── script.js                             # Client-side interaction logic
│   └── style.css                             # UI styling
│
├── templates/                                # HTML templates
│   └── index.html                            # Primary interface
│
└── Outputs/                                  # Transient output directory (cleared per run)
    ├── cropped_color.jpg                     # Original cropped document
    ├── cropped_enhanced.jpg                  # Enhanced document version for OCR
    ├── face_color.jpg                        # Isolated face image
    ├── ocr_output.json                       # Raw and partially structured OCR results
    └── llamav3.json                          # Final structured entities from LLaMA


Processing Pipeline

1. Initialization

The system ensures a clean runtime environment by removing the existing Outputs/ directory and re-initializing it. This prevents data contamination between sequential processing requests.

2. Document Detection & Cropping

The YOLOv11 model performs inference to localize the ID card. Simultaneously, a Haar Cascade classifier identifies biological features (faces) within the detected region, which are then isolated for the final output.

3. Image Enhancement

Prior to OCR execution, the system applies a three-stage enhancement pipeline to the cropped document image to maximize the character recognition accuracy:

def enhance_image(img):
    # 1. Grayscale conversion for luminance-based processing
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # 2. CLAHE (Contrast Limited Adaptive Histogram Equalization)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    contrast_enhanced = clahe.apply(gray)

    # 3. Sharpening via custom Convolutional Kernel
    kernel = np.array([[0, -0.5, 0],
                       [-0.5, 3, -0.5],
                       [0, -0.5, 0]])
    sharpened = cv2.filter2D(contrast_enhanced, -1, kernel)

    return sharpened

4. OCR with PaddleOCR

The enhanced image is processed through the PaddleOCR engine. The system performs text detection and recognition, filtering results based on a predefined confidence threshold (default 0.6) to ensure data reliability.

5. OCR Validation

The check_ocr_validity.py module performs heuristic validation, checking for character count, language consistency (Latin-based scripts), and nonsense detection to ensure the text blob is suitable for LLM processing.

6. Entity Extraction with LLaMA-3.1

The LLaMA-3.1 8B model receives the cleaned raw text and performs semantic extraction into a structured JSON format. It handles:

  • Cross-lingual field mapping.
  • Date normalization (ISO 8601).
  • Disambiguation of names and identity numbers.

7. LLM Validation

The system validates the structural integrity of the LLM output, ensuring it adheres to the expected schema and contains meaningful data before returning it to the user.

8. Flask Web Integration

The backend serves the application via a REST API, handling images as base64-encoded strings and returning both structured JSON and the isolated face image for frontend rendering.



Local Execution Environment

The pipeline is architected for local deployment to guarantee data privacy and minimize network latency.

Local LLM Setup (Ollama)

  1. Installation: Download the Ollama implementation from ollama.com.
  2. Service Initialization: Ensure the daemon is running in the background:
    ollama serve
  3. Model Acquisition: Pull the specific model used in the pipeline:
    ollama pull llama3.1:8b
  4. Verification: Confirm the model is responsive via the CLI:
    ollama run llama3.1:8b "Test"


Detailed Module Specifications

1. Document Detection and Normalization (crop.py)

This module employs the YOLOv11 (You Only Look Once) architecture for real-time document localization. Upon detection, the script performs a transformation to isolate the document.

  • Face Detection: Haarcascade is used to detect biological features (faces) for secondary extraction.
  • Enhancement Pipeline: Applies CLAHE and sharpening kernels to maximize OCR readability.

2. Optical Character Recognition (paddle_ocr.py)

Utilizes the PaddleOCR engine for multilingual text recognition.

  • Logic: Performs multi-stage detection of text lines followed by recognition.
  • Output: Returns a raw text blob and a structured JSON containing bounding boxes and confidence scores.

3. Semantic Extraction (LLaMAv3.py)

Integrates with the Ollama API to process raw OCR text through a fine-tuned prompt engineering strategy.

  • Prompt Logic: Forces the model to output a strict JSON schema, handling cross-lingual dates and identity-specific field normalization.
  • Validation: Implements regex-based fallback mechanisms if the LLM output deviates from the expected JSON format.


Technical Specifications

Stage Model / Method
Document Detection YOLOv11 (Segment/Detect)
Face Detection Haar Cascade (OpenCV)
OCR Engine PaddleOCR (PP-OCRv4)
Entity Extraction LLaMA 3.1 (8B, Offline)
Scripting Language Python 3.10


Input / Output Context

  • Input: Document images in .jpg or .png format, submitted via base64 encoding.
  • Output: Structured JSON schema and base64-encoded face image.

Output Example:

{
  "Name": "Ahmet",
  "Surname": "Yilmaz",
  "DOB": "1995-01-04",
  "Nationality": "TUR",
  "Identity Number": "12345678901",
  "Gender": "M",
  "Date of Issue": "2021-03-10",
  "Expiry Date": "2031-02-17",
  "Place of Birth": "Ankara"
}


Deployment & Installation

Repository Acquisition

To obtain a local copy of this repository, the following command is executed:

git clone https://github.com/Zer0-Bug/ID-Document_Recognition.git
cd ID-Document_Recognition

Environment Configuration

The project dependencies are managed via pip. It is recommended to utilize a virtual environment:

pip install -r requirements.txt

Application Execution

To launch the integrated pipeline for a specific file via CLI:

python main_pipeline.py path_to_image.jpg

To initialize the web-based interactive interface:

python app.py

Default local access: http://127.0.0.1:5000



Contribution

Contributions are always appreciated. Open-source projects grow through collaboration, and any improvement—whether a bug fix, new feature, documentation update, or suggestion—is valuable.

To contribute, please follow the steps below:

  1. Fork the repository.
  2. Create a new branch for your change:
    git checkout -b feature/your-feature-name
  3. Commit your changes with a clear and descriptive message:
    git commit -m "Add: brief description of the change"
  4. Push your branch to your fork:
    git push origin feature/your-feature-name
  5. Open a Pull Request describing the changes made.

All contributions are reviewed before being merged. Please ensure that your changes follow the existing code style and include relevant documentation or tests where applicable.

Security & Privacy Disclaimer

Important

  • This repository presents a research-oriented, proof-of-concept implementation of an offline identity document recognition pipeline combining object detection, OCR, and large language model–based semantic parsing. The primary objective is to explore architectural design choices and model interoperability rather than to deliver a production-ready identity verification system.

  • While the pipeline is designed to operate entirely offline to reduce data exposure risks, it has not undergone formal security testing, threat modeling, or adversarial robustness evaluation. As such, the system may be vulnerable to manipulated inputs, adversarial images, prompt injection effects at the LLM layer, or OCR spoofing attacks.

  • This implementation must not be used in security-critical, legal, or regulatory contexts, including but not limited to identity verification (KYC), access control, onboarding, or governmental workflows, without comprehensive audits, model retraining on representative datasets, and strict compliance validation (e.g., GDPR, KVKK, ISO/IEC 27001).

  • All sample images, model weights, prompts, and generated outputs included in this repository are provided solely for academic evaluation and experimental purposes. Any deployment beyond controlled research settings requires replacement with properly licensed data, secure data-handling policies, and explicit user consent mechanisms.



Email × LinkedIn