OCRack

Advanced PDF translation engine with OCR capabilities, intelligent chunking, and automated image extraction/reinsertion.

Features

Default Behavior: Automatic image extraction + translation + PDF generation
Smart Chunking: Intelligent text segmentation for optimal translation context
Image Processing: Automatic extraction and reinsertion with precise positioning
Cost Optimization: Real-time cost tracking with OpenAI API caching support
Rich UI: Professional terminal interface with detailed progress tracking
Robust Pipeline: Error handling, retries, and checkpoint recovery

Installation

git clone https://github.com/marcostolosa/OCRack.git
cd OCRack
pip install -e .

Dependencies

# Core dependencies (installed automatically)
pip install -r requirements.txt

# Playwright browser engine
playwright install chromium

# External programs (install separately)
# - Pandoc: https://pandoc.org/installing.html (fallback only)
# - Tesseract OCR: https://github.com/tesseract-ocr/tesseract (fallback only)

Demo

Original Traduzido

Usage

Basic Commands

# Default: Extract images + translate + generate PDF
ocrack document.pdf -p "234-235"

# Skip image extraction
ocrack document.pdf -p "234-235" --no-ocr

# Terminal output only (no PDF)
ocrack document.pdf -p "234-235" --cli

Advanced Usage

# Page range translation
ocrack input.pdf --pages "10-28"

# Chapter-based translation  
ocrack input.pdf -c "1,3-5"

# High-quality model
ocrack input.pdf --pages "10-28" -m gpt-4o

# Cost control
ocrack input.pdf -c "1-10" --max-chunks 50

Command Reference

Flag	Description
`-p, --pages "X-Y"`	Translate specific page range
`--no-ocr`	Disable image extraction
`--cli`	Terminal output (skip PDF generation)
`-c "1,3-5"`	Translate specific chapters
`-m MODEL`	OpenAI model (default: gpt-4o-mini)
`-o DIR`	Output directory

Architecture

Processing Pipeline

PDF Analysis - Document structure and metadata extraction
Image Extraction - Page-specific image harvesting with coordinates
Text Extraction - Layout-aware text processing with OCR fallback
Smart Chunking - Context-preserving text segmentation
AI Translation - OpenAI GPT-powered EN→PT-BR translation
Document Assembly - HTML compilation with embedded images
PDF Generation - Playwright-based PDF creation with syntax highlighting

Core Components

OCR Engine: PaddleOCR (primary), Tesseract (fallback)
PDF Renderer: Playwright (primary), markdown-pdf (fallback)
Translation: OpenAI GPT models with technical prompt engineering
Image Processing: Base64 embedding for reliable rendering
Syntax Highlighting: Automatic language detection with Pygments

Configuration

Environment Variables

export OPENAI_API_KEY="sk-..."
export OPENAI_MODEL="gpt-4o-mini"  # optional
export OPENAI_BASE_URL="..."       # optional

Output Structure

project/
├── out/                       # Default output directory
│   ├── document_translated_YYYYMMDD_HHMMSS.pdf
│   ├── document_translated_YYYYMMDD_HHMMSS.html  # debug
│   └── img_manifest/          # Extracted images
│       ├── manifest.json      # Image metadata
│       └── page_XXX_img_XX.png
└── logs/
    └── translate_YYYYMMDD_HHMMSS.log

Technical Specifications

Dependencies

Core Libraries:

openai>=1.0.0 - API client
rich>=13.0.0 - Terminal UI
PyMuPDF>=1.23.0 - PDF processing

PDF Processing:

pdfplumber>=0.7.0 - Text extraction
playwright>=1.40.0 - HTML to PDF rendering
markdown>=3.4.0 - HTML generation

OCR Stack:

paddleocr>=2.8.0 - Primary OCR engine
paddlepaddle>=2.6.0 - ML framework
pytesseract>=0.3.10 - Fallback OCR

Image Processing:

Pillow>=9.0.0 - Image manipulation
opencv-python>=4.8.0 - Computer vision

Performance Metrics

Typical Speed: 1-5 pages/minute (depends on content complexity)
Cost (gpt-4o-mini): $0.001-0.01 per page
Memory Usage: ~500MB baseline + 50MB per concurrent translation
Cache Efficiency: Up to 50% cost reduction on repeated content

Batch Processing

# Multiple page ranges
for pages in "1-50" "51-100" "101-150"; do
    ocrack document.pdf --pages "$pages" -o "batch_$pages"
done

# Chapter processing with cost limits
ocrack technical_manual.pdf -c "1-20" --max-chunks 100 -m gpt-4o

Troubleshooting

Common Issues

Images not rendering in PDF:

Verify Playwright installation: playwright install chromium
Check image extraction: ls out/img_manifest/

OCR quality issues:

Install PaddleOCR models: automatic on first run
Fallback to Tesseract: requires separate installation

Translation failures:

Verify OpenAI API key: echo $OPENAI_API_KEY
Check API limits and billing
Reduce chunk size with --max-chunks

Debug Mode

# Enable verbose logging
ocrack document.pdf -p "1-10" --debug

# Keep HTML output for inspection
ocrack document.pdf -p "1-10" --keep-html

License

MIT License

Warning: This tool processes potentially sensitive documents. Ensure compliance with data protection regulations when using cloud-based translation services.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
assets		assets
pdf_translate		pdf_translate
.gitignore		.gitignore
.whitesource		.whitesource
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCRack

Features

Installation

Dependencies

Demo

Usage

Basic Commands

Advanced Usage

Command Reference

Architecture

Processing Pipeline

Core Components

Configuration

Environment Variables

Output Structure

Technical Specifications

Dependencies

Performance Metrics

Batch Processing

Troubleshooting

Common Issues

Debug Mode

License

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

marcostolosa/OCRack

Folders and files

Latest commit

History

Repository files navigation

OCRack

Features

Installation

Dependencies

Demo

Usage

Basic Commands

Advanced Usage

Command Reference

Architecture

Processing Pipeline

Core Components

Configuration

Environment Variables

Output Structure

Technical Specifications

Dependencies

Performance Metrics

Batch Processing

Troubleshooting

Common Issues

Debug Mode

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages