Adding support for a new language to Dedoc
By default, dedoc supports handling Russian and English languages. The most important part of language support is OCR (for images, PDF). If you don’t need parse images and PDF files, you don’t need to do anything.
To parse images with a new language, additional Tesseract language packages should be installed. The list of languages supported by Tesseract are enlisted here (see Languages section).
See also
The instruction with Tesseract installation can be found here.
Warning
Not all languages are fully supported by dedoc even with installed Tesseract packages. The more detailed information will appear soon.
Add new language in docker
Similar to the installation tutorial, beforehand one should clone the dedoc repository and go to the dedoc directory:
git clone https://github.com/ispras/dedoc
cd dedoc
Then one should decide, which languages should be supported, and look for them in the
list of supported languages (Languages section).
For each language, LangCode
is used to configure it.
For example, if we need to add French and Spanish, we should use fra
and spa
language codes.
Using docker build
For passing the list of languages while building docker image, the LANGUAGES
argument is used.
Languages should be enlisted in string and separated by spaces.
For example, for adding French and Spanish we should use the following command:
docker build --build-arg LANGUAGES="fra spa" .
One may also choose a tag for an image, e.g. dedocproject/dedoc_multilang:latest
, and run the container:
docker build -t dedocproject/dedoc_multilang:latest --build-arg LANGUAGES="fra spa" .
docker run -p 1231:1231 --rm dedocproject/dedoc_multilang python3 /dedoc_root/dedoc/main.py
Using docker-compose
For passing the list of languages while building docker image, the LANGUAGES
argument is used in the docker-compose.yml
file.
Languages should be enlisted in string and separated by spaces.
For example, for adding French and Spanish we should add the following lines to the docker-compose.yml
file:
version: '2.4'
services:
dedoc:
mem_limit: 16G
build:
context: .
args:
LANGUAGES: "fra spa"
restart: always
tty: true
ports:
- 1231:1231
environment:
DOCREADER_PORT: 1231
GROBID_HOST: "grobid"
GROBID_PORT: 8070
Then, the service can be run with the following command:
docker-compose up --build
Add new language locally
Suppose Tesseract OCR 5 is already installed on the computer (or see instruction).
For each language, the following command should be executed (lang
is one language code):
apt install -y tesseract-ocr-$lang
For example, for adding French and Spanish we should use the following commands:
apt install -y tesseract-ocr-fra
apt install -y tesseract-ocr-spa
Or we can install all packages with one command using LANGUAGES
variable:
export LANGUAGES="fra spa"
for lang in $LANGUAGES; do apt install -y tesseract-ocr-$lang; done
Then the dedoc library can be used with new languages or dedoc API can be run locally (see instruction) for more details.