Informatics

Using tesseract to OCR meroitic text

A tesseract is, in geometry, the four-dimensional analog of the cube.

But this post won’t deal with geometric adventures, as I did on some previous one.

Tesseract is also an OCR software devoted to the extraction of text from printed (scanned) material.

Meroitic was a language and script used in Meroë and the Sudan during the Meroitic period (attested from 300 BCE) and which went extinct about 400 CE. For purposes beyond this discussion, I needed to OCR some meroitic text in hieroglyphic form. Btw, maybe -or maybe not- these purposes were related with some derivative work from Cthulhu Mythos.

So, to begin with, I had some pages written in meroitic which I wanted to transliterate to latin alphabet. Meroitic alphabet is pretty reduced:

Meroitic alphabet (from wikimedia)As there is no language data for meroitic on tesseract’s site, we’ll have to “train” tesseract to recognize it. Fortunately it’s an easy task, provided the excellent help at its site.

First of all, in order to use the “short” training process, we’ll need a font of meroitic (TrueType, for example), and a version of tesseract >= 3.03. It’d be useful also to know the ISO 639 identifier for meroitic (xmr), to be used later to identify our result files.

I downloaded the last fresh version (3.04) from google repository and tried to compile it on an Ubuntu 12.04. Fresh versions (nightly builds, etc) have the disadvantage that they may contain some little glitches… as it occurred :-)

In order to train tesseract, the training tools must be compiled. They require some packages (libpng12-dev, libjpeg62-dev, libtiff4-dev, zlib1g-dev, libicu-dev, libpango1.0-dev, libcairo2-dev). They also require c++11, which is a characteristic found on newer compilers than I had (>= g++-4.8)… so I had to update them.

configure: WARNING: Training tools WILL NOT be built because of missing c++11 support.

Also, last version of leptonica is needed (>=1.71), so I had to compile it, ’cause packages for my platform were older.

configure: error: leptonica 1.71 or higher is required

So:

# apt-get update

# apt-get install libpng12-dev
# apt-get install libjpeg62-dev
# apt-get install libtiff4-dev
# apt-get install zlib1g-dev
# apt-get install libicu-dev
# apt-get install libpango1.0-dev
# apt-get install libcairo2-dev
# apt-get install autoconf automake libtool

# add-apt-repository ppa:ubuntu-toolchain-r/test
# apt-get install g++-4.8
# update-alternatives –install /usr/bin/g++ g++ /usr/bin/g++-4.8 50

# wget http://www.leptonica.org/source/leptonica-1.71.tar.gz
# tar -xzvf leptonica-1.71.tar.gz
# cd leptonica-1.71
# ./configure
# make
# make install

# cd ..
# wget https://tesseract-ocr.googlecode.com/archive/473141c1dedf8ae30a7f1b25fb38b619012d7184.tar.gz
# tar -xzvf 473141c1dedf8ae30a7f1b25fb38b619012d7184.tar.gz
# cd tesseract-ocr-473141c1dedf/
# ./autogen.sh
# ./configure
# make
# make install
# make training
# make training install
# ldconfig

Ok, the tesseract code I used (the last available) had a bug on “api/capi.cpp” introduced with last code updates… So “make” miserably failed:


libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DLOCALEDIR=\”/usr/local/share/locale\” -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct -I../cube -I../viewer -I../textord -I../dict -I../classify -I../ccmain -I../wordrec -I../cutil -I../opencl -I/usr/local/include/leptonica -pthread -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -MT libtesseract_api_la-capi.lo -MD -MP -MF .deps/libtesseract_api_la-capi.Tpo -c capi.cpp  -fPIC -DPIC -o .libs/libtesseract_api_la-capi.o
capi.cpp:672:1: error: expected unqualified-id before ‘{‘ token

Fortunately, it was easy to fix: where this line appears on “api/capi.cpp” (#671), delete the last “;” char:

BOOL *is_list_item, BOOL *is_crown, int *first_line_indent);

and just run “make” again.

“make training” and “make training install” were executed as per manual, but in this version the training tools (under ./training) were already compiled, and they’re not installed away from their compilation place.

Ok, now that we have a “text2image” command (./training/text2image), and a true-type font we can edit a text file with just some ascii text corresponding to keys, as if this text were the keys input for a full fledge word processor editor capable of using the true-type font. (In this case, only some characters results in hieroglyphs “characters”). I used this text file (meroitic.txt):

qwertyuiopasd.g,jklmnbvcxz

Note that “.” and “,” with the font used, didn’t produce these “latin” punctuation signs, but the same hieroglyph as also the “x” key… We’ll fix this later: I needed these punctuation marks as they were present on my meroitic printed text, but the font didn’t provide them.

Note also, *and most importantly*, that this transliteration will not be correct from a historical point of view: so for example, with the “v” key we’re transliterating the symbol for phonemes “t” or “te”. Not to mention the fact that meroitic was written from right to left, and we won’t instruct tesseract to do so. Well, I wasn’t looking for a correct transliteration for, again, reasons beyond this discussion. Anyway, a correct transliteration or even a perfect copy of the characters is possible with tesseract: just modifying the chars pre-pending each line of the “.box” file that we’ll see later, which can be perfectly UTF-8 chars (and meroitic has its own Unicode space) or sets of latin chars (“te”, “ch”…) for the case. Right-to-left scanning direction could also be indicated.

From now on, we’re still inside the compilation directory. I copied here the true-type meroitic font file.

# training/text2image –text=./meroitic.txt –outputbase=xmr.Meroitic-Hieroglyphics.exp0 –font=’Meroitic – Hieroglyphics’ –fonts_dir=./

This creates two files: “xmr.Meroitic-Hieroglyphics.exp0.box” contains the character translation and the boxes coordinates of the characters printed on “xmr.Meroitic-Hieroglyphics.exp0.tif”.

Now, I modified (with gimp) the xmr.Meroitic-Hieroglyphics.exp0.tif file at the place were the “.” and “,” chars should be, painting them by hand as simialr as possible as how they appeared on my printed text.

xmr.Meroitic-Hieroglyphics.exp0.sampleNow the “.box” file must be modified to (approx.) match the new coordinates for these two chars.

This was a funny trick, wasn’t it?

# tesseract xmr.Meroitic-Hieroglyphics.exp0.tif xmr.Meroitic-Hieroglyphics.exp0 box.train

Now, for the next command I downloaded some english files… the manual says it is not important which files are they, so I trusted:

# wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
# tar -xzvf tesseract-ocr-3.02.eng.tar.gz
# cp tesseract-ocr/tessdata/* /usr/local/share/tessdata/

# tesseract xmr.Meroitic-Hieroglyphics.exp0.tif xmr.Meroitic-Hieroglyphics.exp0 box.train

# training/unicharset_extractor xmr.Meroitic-Hieroglyphics.exp0.box

Now a file named “font_properties” must be created with this text content:

Meroitic-Hieroglyphics 1 0 0 0 0

And the rest of the commands:

# training/shapeclustering -F ./font_properties -U unicharset xmr.Meroitic-Hieroglyphics.exp0.tr

# training/mftraining -F font_properties -U unicharset -O xmr.unicharset xmr.Meroitic-Hieroglyphics.exp0.tr

# training/cntraining xmr.Meroitic-Hieroglyphics.exp0.tr

# mv shapetable xmr.shapetable
# mv normproto xmr.normproto
# mv inttemp xmr.inttemp
# mv pffmtable xmr.pffmtable

# training/combine_tessdata xmr.

# cp xmr.traineddata /usr/local/share/tessdata/xmr.traineddata

And that’s all! (finally).

The xmr.traineddata is the reduced and colated quintessence of the process. You can download it from here in order to OCR meroitic text with the transliteration previously discussed.

Now we can feed tesseract with some well and homogeneously illuminated TIFF image containing perfectly straight lines of meroitic text, and it’ll convert it to the chars we put on our first “.box” file… In this case, latin chars.

# tesseract meroitic_sample.tif transliteration -l xmr

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

Last error doesn’t seem to be critical, as the “transliteration.txt” file is created with the so much awaited contents.

Advertisements

One thought on “Using tesseract to OCR meroitic text

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s