Linux: How to use tesseract with language different than English
This command shows what languages you have installed with tesseract
tesseract --list-langs
Result
List of available languages (3):
eng
osd
pol
On Linux Mint/Ubuntu/Debian you can use apt to install new languages - ie. Polish needs pol at the end
sudo apt-get install tesseract-ocr-pol
For other languages you can use apt to search files or use names from below link to datasets.
To install all languages you can use tesseract-ocr-all
They will be installed in /usr/share/tesseract-ocr/4.00/tessdata/ (at least on Linux Mint 20)
But you can also download dataset traineddata manually from page
Traineddata Files for Version 4.00 + or from tesseract repo
And you may keep keep it in any folder and you can use --tessdata-dir to work with this folder
You can check if tesseract recognizes these files using
tesseract --list-langs --tessdata-dir /folder/with/traineddata/
It can be absolute or relative path. For files in current folder you can use .
tesseract --list-langs --tessdata-dir .
It should display only languages in this folder (it will skip languages from previous command)
But you can use this folder to recognize text in file
tesseract image.png output_file --tessdata-dir /folder/with/traineddata/ -l pol
Options have to be after image and output file.
Normally tesseract generate file with text but using stdout it can display in console (and you can redirect it to other program using pipeing |)
tesseract image.png stdout --tessdata-dir /folder/with/traineddata/ -l pol
You can also use file with extra options to save it in pdf, tsv, etc.
All options you can see with
tesseract --print-parameters
To see options to create extra files you can filter it
tesseract --print-parameters | grep create
Result
devanagari_split_debugimage 0 Whether to create a debug image for split shiro-rekha process.
tessedit_create_txt 0 Write .txt output file
tessedit_create_hocr 0 Write .html hOCR output file
tessedit_create_alto 0 Write .xml ALTO file
tessedit_create_lstmbox 0 Write .box file for LSTM training
tessedit_create_tsv 0 Write .tsv output file
tessedit_create_wordstrbox 0 Write WordStr format .box output file
tessedit_create_pdf 0 Write .pdf output file
tessedit_create_boxfile 0 Output text with boxes
And then you can use create file ie. my_options with
tessedit_create_pdf 1
and use it as last argument
tesseract image.png output_file --tessdata-dir /folder/with/traineddata/ -l pol my_options
Notes:
