You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Абрашин, Игорь Олегович <Ig...@novatek.ru> on 2017/02/10 07:50:44 UTC

Problem with cyrillics letters through Tika OCR indexing

Hello, everyone I'm encountered the error mentioned at the title?
The original image attached and recognized text below:
3ApaBCTyI7ITe 9| )KVIBy xopomo

Does anyone faced the similar?
Need to mentioned that tesseract recognize it more correctly with -l rus option.

Thanks in advance!


С уважением,
Игорь Абрашин
ООО <НОВАТЭК НТЦ>
тел. раб.: +7 (3452) 680-386
тел. внутр. корпор.: 22-586
[121]

Re: Problem with cyrillics letters through Tika OCR indexing

Posted by Игорь Абрашин <vj...@gmail.com>.

The same problem for me. So, first case probably or how to force tika
parser recognize cyrillic character as required. For me it tries to
recognize russian text as eng translit, show up in result russian text
utilize only latin alphabet.

10 февр. 2017 г. 17:55 пользователь "Alexandre Rafalovitch" <
arafalov@gmail.com> написал:

> At what level is this exactly a problem? Are you looking for a way for
> Solr to pass -L rus flag to Tika?
>
> Or you are saying that whatever OCR is used here is bad. In the second
> case, this is probably not a question for Solr or even Tika but for
> whatever underlying OCR library is.
>
> The stack is deep here, more precision is required.
>
> Удачи,
>     Alex
>
> On 10 Feb 2017 2:52 AM, "Абрашин, Игорь Олегович" <
> Igor.Abrashin@novatek.ru> wrote:
>
> Hello, everyone I’m encountered the error mentioned at the title?
>
> The original image attached and recognized text below:
> 3ApaBCTyI7ITe 9| )KVIBy xopomo
>
>
>
> Does anyone faced the similar?
> Need to mentioned that tesseract recognize it more correctly with –l rus
> option.
>
> Thanks in advance!
>
>
>
>
>
> *С уважением, *
>
> *Игорь Абрашин*
>
> *ООО «НОВАТЭК НТЦ»*
>
> *тел. раб.: +7 (3452) 680-386 <+7%20345%20268-03-86>*
>
> *тел. внутр. корпор.: 22-586*
>
> [image: 121]
>
>
>
>
>

Re: Problem with cyrillics letters through Tika OCR indexing

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

At what level is this exactly a problem? Are you looking for a way for Solr
to pass -L rus flag to Tika?

Or you are saying that whatever OCR is used here is bad. In the second
case, this is probably not a question for Solr or even Tika but for
whatever underlying OCR library is.

The stack is deep here, more precision is required.

Удачи,
    Alex

On 10 Feb 2017 2:52 AM, "Абрашин, Игорь Олегович" <Ig...@novatek.ru>
wrote:

Hello, everyone I’m encountered the error mentioned at the title?

The original image attached and recognized text below:
3ApaBCTyI7ITe 9| )KVIBy xopomo



Does anyone faced the similar?
Need to mentioned that tesseract recognize it more correctly with –l rus
option.

Thanks in advance!





*С уважением, *

*Игорь Абрашин*

*ООО «НОВАТЭК НТЦ»*

*тел. раб.: +7 (3452) 680-386 <+7%20345%20268-03-86>*

*тел. внутр. корпор.: 22-586*

[image: 121]