You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Robert Neal Clayton <ro...@gmail.com> on 2018/06/18 17:13:15 UTC

Text extraction: locale handling?

Hello,

I’m getting started with Tika for the first time over the past few days, I’m running the latest (1.18) server jar and running some test PDFs through it for text extraction via CURL in a virtual machine.

Consider the sample page here…

https://www.scribd.com/document/382021926/Extract <https://www.scribd.com/document/382021926/Extract>

This text was OCR’d by me with Tesseract 4.0 with an en_US-UTF8 locale on FreeBSD 11.1-RELEASE

Standard letter characters work fine with this, but if I extract text from a machine that is not using the same English UTF8 charset, I’ll get the following, for example on the line containing the word “triangulating”:

We have also
made quite a few selections with an eye to pairing or triangulating?^`^tfor exam-
ple, we chose the famous closing section on writing from Plato?^`^ys Phaedrus,

…because the default ASCII character sets don’t have the same apostrophe and/or emdash.

Which makes me consider possibilities and conundrums:

How do people handle multiple languages with say… bulk/automated extraction involving multiple languages?