Have some question regarding finetuning for model to add characters #1943

Vishnu280412 · 2025-05-29T09:42:48Z

Vishnu280412
May 29, 2025

So the doctr v0.10.0 is working fine in detecting all the texts as needed. But there is an issue with recognizing some currency symbols like "₹". So I did create a synthetic data set for it. Where the words had a combination of alphabets, numbers, currency symbols and other valid characters. So when I trained the recognition model, it then even failed to recognise the texts that it was recognising earlier.
So how do i approach with this? I want the OCR to be working as it's working now, but also need it to recognize "₹" symbol too (also some more symbols but wanted to understand how to train the model for this). If someone was able to do this, or if someone knows how to do this. Please do help.

Answered by felixdittrich92

Jun 5, 2025

Hey glad to hear 👍

Yes every model is limited to a fixed char length most models to 29/30 chars + possible EOS / SOS / PAD tokens (32 overall chars) except the master architecture to 47 chars + EOS SOS PAD (50) .
Additional each crop is resized to 32x128 so there is no space to fit more chars .. but under the hood we use a split & merge logic for larger crops

2 days ago we pushed a improvement for this logic into the main branch with #1939 so I would suggest to pull the latest changes from the main branch and test again :)

View full answer

felixdittrich92 · 2025-05-30T08:51:52Z

felixdittrich92
May 30, 2025
Maintainer

Hi @Vishnu280412 👋,

Could you please share the command you used to fine-tune the model ? :)
How have you generated the synth data have you used any tool ? :)

In such cases I would suggest to freeze at a minimum the backbone with --pretrained --freeze-backbone - Normally for such easy cases it would be valid to freeze everything up to the head - but this would require a small modification in the training script depending on the model you want to fine-tune.

Best regards,
Felix

1 reply

Vishnu280412 May 30, 2025
Author

Added this line in doctr/datasets/vocabs.py just after line 58 to add "₹" symbol to add in doctr vocabs:

VOCABS["custom"] = VOCABS["french"] + "₹"

Used this bash command to train the model. And the max-char is decided after generation of synthetic data by getting the max-length of the words that has been generated:

python doctr/references/recognition/train_pytorch.py crnn_vgg16_bn --train_path doctrTrainingData/train --val_path doctrTrainingData/val --max-chars 24 --epochs 5 --freeze-backbone --pretrained --output_dir OUTPUT_model --vocab custom

.

Generated the synthetic data using TextRecognitionDataGenerator. Link: github
So I tried to include all ascii_letters, digits, punctuation and "₹" which was being selected and grouped at random to generate an image containing one word each. At first generated around 5 Lakhs images and trained the recognition model.
.
Am I on the right track or am doing something irrelevant? Should I be adding some more variations of the images and increase the dataset size?

felixdittrich92 · 2025-05-30T12:32:07Z

felixdittrich92
May 30, 2025
Maintainer

--max-chars 24 does not have any effect if you provide your own datasets it's only used for the internal wordgenerator (which is good for debugging but doesn't works well for fine-tuning yet - needs some adjustments), the rest looks good 👍

For fine-tuning you should generate a more divers dataset (start with 200++ images) - Additional I can't suggest the linked repo - with SynthTiger it works much better - I have modified the code a bit (not really clean yet - https://github.com/felixdittrich92/synthtiger/tree/doctr-modified) - You can configure it to your own needs

23 replies

Vishnu280412 Jun 3, 2025
Author

Hello Felix,
I was able to train the model but there was some issues in it, so here is the steps I did in detail along with the issue mentioned:

Data Generation Part:

So I did generate synthetic data using SynthTIGER (doctr-modified branch) using these modifications:

The vocab used was this:
vocab = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~°£€¥¢₹"
Then I created a corpus of words which had almost 1 million different strings.
Added my own set of fonts at /resources/font
Changed /resources/charset/alphanum_special.txt by:
charset = r"""0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~°£€¥¢₹""" writing charset into the file.
Updated config_horizontal.yaml file:

with open('/content/synthtiger/examples/synthtiger/config_horizontal.yaml', 'r') as f:
  lines = f.readlines()
  
lines[20] = '    - paths: [resources/corpus/addedFilter_1M_ocr_corpus.txt]\n'
lines[21] = '      weights: [1]\n'
lines[28] = '    - paths: [resources/corpus/addedFilter_1M_ocr_corpus.txt]\n'
lines[29] = '      weights: [1]\n'
lines[34] = '      augmentation_charset: resources/charset/alphanum_special.txt\n'
lines[35] = '      # augmentation_charset: resources/charset/alphanum_special_extended.txt\n'
lines[44] = '  prob: 0.4\n'

with open('/content/synthtiger/examples/synthtiger/config_horizontal.yaml', 'w') as f:
    f.writelines(lines)

Then I created around 30,000 images using this !python -m synthtiger -o results -w 4 -v -c 30000 examples/synthtiger/template.py SynthTiger examples/synthtiger/config_horizontal.yaml

.
.

Doctr Training part:

Steps I did to train crnn_vgg16_bn recognition model:

Added this in the vocabs.py file:
VOCABS["rupeeAdded"] = (VOCABS["english"] + "₹").replace("฿", "")
The training bash script to train the doctr model:
!python doctr/references/recognition/train_pytorch.py crnn_vgg16_bn --train_path SynthTIGER_Data/train --val_path SynthTIGER_Data/val --device 0 --epochs 50 --freeze-backbone --pretrained --output_dir OUTPUT_model --vocab rupeeAdded

.
.

After Training (inference):

So at around epoch 14 the training loss was: 2.08495 and validation loss was: 1.35665 after that the validation loss was fluctuating up and down.

So I used the model saved for epoch 14 and when I did test the finetuned recognition model, I faced several issues:

The characters: "!", "|" (pipe), "I" (uppercase i) and "l" (lowercase L) these character are being misclassified frequently.
"0" and "O" are being misclassified.
The "-" is sometimes is recognised as "~" or "."

I was able to successfully train the model on recognising "₹" symbol as intended. But these other issues came up after training.
What do you suggest for me to do to fix this issue?

Also when it comes to recognising large strings, even the pretrained model has some issues recognising texts.

felixdittrich92 Jun 4, 2025
Maintainer

Hi @Vishnu280412 👋,

Yes that's a well known problem..

What could help:

fine-tune the model only 1-2 epochs and check the results (independent from the train/val loss) - this should reduce catastrophic forgetting
Modify the train script slightly:

doctr/references/recognition/train_pytorch.py

Line 320 in 0118508

# Backbone freezing

# Backbone freezing
if args.freeze_backbone:
    for p in model.feat_extractor.parameters():
        p.requires_grad = False

to

# Backbone freezing
if args.freeze_backbone:
    for name, m in model.named_modules():
        if name == "linear":
            m.requires_grad_(True)
        else:
            m.requires_grad_(False)

This will freeze everything up to the head for the crrn_vgg16_bn arch (this will require modifications depending on the model you want to train)

The best way would be to concatenate a subset from the dataset which was used for pretraining with the new data - Unfortunately that's not possible because it's mindee internal data (which contains private data)

Hope this helps 👍
Best,
Felix

Vishnu280412 Jun 5, 2025
Author

Hi @felixdittrich92,

So it worked exceptionally well after training it for 2 epochs. But still the recognition for longer words len remains an issue. For some fields like IRN, it is unique 64-character alphanumeric string ([0-9][a-f]). So I also included some instances of it for the model to train. Then it gave me this error:

Namespace(arch='crnn_vgg16_bn', output_dir='OUTPUT_model', train_path='SynthTIGER_Data/train', val_path='SynthTIGER_Data/val', train_samples=1000, val_samples=20, font='FreeMono.ttf,FreeSans.ttf,FreeSerif.ttf', min_chars=1, max_chars=12, name=None, epochs=5, batch_size=64, device=0, input_size=32, lr=0.001, weight_decay=0, workers=None, resume=None, vocab='rupeeAdded', test_only=False, freeze_backbone=True, show_samples=False, wb=False, clearml=False, push_to_hub=False, pretrained=True, optim='adam', sched='cosine', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.2823s (31053 samples in 486 batches)
Train set loaded in 0.6533s (124212 samples in 1940 batches)
  0% 0/1940 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/content/doctr/references/recognition/train_pytorch.py", line 600, in <module>
    main(args)
  File "/content/doctr/references/recognition/train_pytorch.py", line 480, in main
    train_loss, actual_lr = fit_one_epoch(
                            ^^^^^^^^^^^^^^
  File "/content/doctr/references/recognition/train_pytorch.py", line 125, in fit_one_epoch
    train_loss = model(images, targets)["loss"]
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/doctr/models/recognition/crnn/pytorch.py", line 225, in forward
    out["loss"] = self.compute_loss(logits, target)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/doctr/models/recognition/crnn/pytorch.py", line 178, in compute_loss
    ctc_loss = F.ctc_loss(
               ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py", line 3079, in ctc_loss
    return torch.ctc_loss(
           ^^^^^^^^^^^^^^^
RuntimeError: Expected tensor to have size at least 64 at dimension 1, but got size 32 for argument #2 'targets' (while checking arguments for ctc_loss_gpu)

Is this because of larger width of the images?

felixdittrich92 Jun 5, 2025
Maintainer

Hey glad to hear 👍

Yes every model is limited to a fixed char length most models to 29/30 chars + possible EOS / SOS / PAD tokens (32 overall chars) except the master architecture to 47 chars + EOS SOS PAD (50) .
Additional each crop is resized to 32x128 so there is no space to fit more chars .. but under the hood we use a split & merge logic for larger crops

2 days ago we pushed a improvement for this logic into the main branch with #1939 so I would suggest to pull the latest changes from the main branch and test again :)

Answer selected by Vishnu280412

Vishnu280412 Jun 6, 2025
Author

I did pull the latest changes and tried using it, but still it makes some mistakes for longer length words. I'll try fine tuning the model further to see if this can be resolved. Also I'll try to review the new logic and see if anything can be done.

felixdittrich92 Jun 6, 2025
Maintainer

Sounds good :)

Feel free to share your findings / open a PR for further discussions 👍

Have some question regarding finetuning for model to add characters #1943

Uh oh!

Vishnu280412 May 29, 2025

Replies: 2 comments · 24 replies

Uh oh!

felixdittrich92 May 30, 2025 Maintainer

Uh oh!

Vishnu280412 May 30, 2025 Author

Uh oh!

felixdittrich92 May 30, 2025 Maintainer

Uh oh!

Vishnu280412 Jun 3, 2025 Author

Uh oh!

Uh oh!

felixdittrich92 Jun 4, 2025 Maintainer

Uh oh!

Vishnu280412 Jun 5, 2025 Author

Uh oh!

felixdittrich92 Jun 5, 2025 Maintainer

Uh oh!

Vishnu280412 Jun 6, 2025 Author

Uh oh!

felixdittrich92 Jun 6, 2025 Maintainer

Vishnu280412
May 29, 2025

Replies: 2 comments 24 replies

felixdittrich92
May 30, 2025
Maintainer

Vishnu280412 May 30, 2025
Author

felixdittrich92
May 30, 2025
Maintainer

Vishnu280412 Jun 3, 2025
Author

felixdittrich92 Jun 4, 2025
Maintainer

Vishnu280412 Jun 5, 2025
Author

felixdittrich92 Jun 5, 2025
Maintainer

Vishnu280412 Jun 6, 2025
Author

felixdittrich92 Jun 6, 2025
Maintainer