Skip to content

Conversation

@aka964
Copy link

@aka964 aka964 commented May 17, 2025

Purpose

This pull request adds support for a new OCR engine using Google Cloud Vision API (GCV) to OCRmyPDF. It provides an alternative to Tesseract for cloud-based, high-accuracy OCR, especially useful for multilingual documents and complex image layouts.

Implementation Details

  • Added GVisionOcrEngine in ocrmypdf.builtin_plugins.gvision
  • Checks availability using the GOOGLE_APPLICATION_CREDENTIALS environment variable
  • Supports HOCR and plain text output
  • Includes unit tests for:
    • Engine initialization
    • HOCR generation

Requirements

  • Added google-cloud-vision dependency in pyproject.toml

  • Google Cloud credentials JSON file

  • Set environment variable:

    • On Windows (PowerShell):

      $env:GOOGLE_APPLICATION_CREDENTIALS="C:\path\to\your\credentials.json"
    • On Linux/macOS (bash):

      export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/credentials.json
@jbarlow83
Copy link
Collaborator

jbarlow83 commented May 27, 2025

Thank you for this. The code looks good at a glance.

There is one serious problem you will need to address.

It seems that you copied code from this repository
https://github.com/grantbarrett/son-of-ocrmypdf_plugin_GoogleVision
which in turn forked this repository
https://github.com/kkrell2016/ocrmypdf_plugin_GoogleVision

Neither of these repositories contain a license file, so the code must be considered "all rights reserved" by the creator and cannot be copied. We will need these two people, kkrell2016 and grantbarrett, to license their code under an open source license that is compatible with OCRmyPDF's MPL 2.0 license. MIT, MPL 2.0, Apache 2.0, BSD would all be fine. It would be best if they can create a license file for their repositories indicating this.

Otherwise, the code you are contributing is not something you own, so it cannot become part of OCRmyPDF or any other open source project. I will also need you to license your changes to their work under a compatible license.

I do look forward to being able to accept this code, but I cannot accept it until the licensing is clarified.

On a more minor note, I would prefer to gvision an optional dependency so that the user is not required to install all Google Cloud if they are not going to use it. That's fairly minor compared to the licensing problem.

@jbarlow83 jbarlow83 added the third party issue Problem with a third party dependency label May 27, 2025
@jbarlow83 jbarlow83 changed the title Add GVision OCR engine support May 27, 2025
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this plugin to my home grown one... I'm simply using

from google.cloud import documentai
from google.cloud.documentai_toolbox import document
....
    wrapped_document = document.Document.from_documentai_document(doc)
    hocr_string = wrapped_document.export_hocr_str(title='brieftech ocr')

Is there any difference in this custom gcv2hocr2 vs what Google team maintain themselves?

@glorat
Copy link

glorat commented Jan 5, 2026

Just add some additional commentary - firstly thank you for this PR as it helped identify an important bug in my own plug-in (DPI detection specfically)

Licensing aside, the other issue is that this is using the Google Cloud Vision API, whereas the Google Document AI API is considered to be superior for OCR purposes.

Also, if I'm reading the code right, this plugin is trying to compose the target PDF itself, rather than using ocrmypdf's hocrtransform.HocrTransform.

I do have my own gvision.py plug-in that I've been using in production a long time on thousands of documents. I'm happy to share but it doesn't come with tests and it has some hardcoding to GCP resources that would need to clean up. If there is interest, I'm willing to tidy it up and contribute back either as a gist or a single file PR. Just ping me.

@jbarlow83
Copy link
Collaborator

@glorat
I do have a better solution that will allow alternate OCR engines to generate some simple Python dataclasses that express the OCR to render, rather than having to directly compose hOCR, but it's not published yet because of other dependencies not being ready. (Essentially, hocrtranform is going to be refactored into "hocr -> element tree" and then "element tree -> PDF".

When those changes are merged I'll make a note to ping you because it should simplify integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

third party issue Problem with a third party dependency

3 participants