Search on blog:

Linux: How to use tesseract with language different than English

This command shows what languages you have installed with tesseract

tesseract --list-langs

Result

List of available languages (3):
eng
osd
pol

On Linux Mint/Ubuntu/Debian you can use apt to install new languages - ie. Polish needs pol at the end

sudo apt-get install tesseract-ocr-pol

For other languages you can use apt to search files or use names from below link to datasets.

To install all languages you can use tesseract-ocr-all

They will be installed in /usr/share/tesseract-ocr/4.00/tessdata/ (at least on Linux Mint 20)

But you can also download dataset traineddata manually from page

Traineddata Files for Version 4.00 + or from tesseract repo

And you may keep keep it in any folder and you can use --tessdata-dir to work with this folder

You can check if tesseract recognizes these files using

tesseract --list-langs --tessdata-dir /folder/with/traineddata/

It can be absolute or relative path. For files in current folder you can use .

tesseract --list-langs --tessdata-dir .

It should display only languages in this folder (it will skip languages from previous command)

But you can use this folder to recognize text in file

tesseract image.png output_file --tessdata-dir /folder/with/traineddata/ -l pol

Options have to be after image and output file.


Normally tesseract generate file with text but using stdout it can display in console (and you can redirect it to other program using pipeing |)

tesseract image.png stdout --tessdata-dir /folder/with/traineddata/ -l pol

You can also use file with extra options to save it in pdf, tsv, etc.

All options you can see with

tesseract --print-parameters

To see options to create extra files you can filter it

tesseract --print-parameters | grep create

Result

devanagari_split_debugimage 0       Whether to create a debug image for split shiro-rekha process.
tessedit_create_txt         0       Write .txt output file
tessedit_create_hocr            0   Write .html hOCR output file
tessedit_create_alto            0   Write .xml ALTO file
tessedit_create_lstmbox         0   Write .box file for LSTM training
tessedit_create_tsv         0       Write .tsv output file
tessedit_create_wordstrbox  0       Write WordStr format .box output file
tessedit_create_pdf         0       Write .pdf output file
tessedit_create_boxfile         0   Output text with boxes

And then you can use create file ie. my_options with

tessedit_create_pdf 1

and use it as last argument

tesseract image.png output_file --tessdata-dir /folder/with/traineddata/ -l pol my_options

If you like it
Buy a Coffee