You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Игорь Абрашин <vj...@gmail.com> on 2017/02/10 10:03:41 UTC

OCR image contains cyrillic characters

Hello, community!
Did you manage to recognize jpf,tiff or whatever with cyrillics text inside?
Ive got only latin letter (looks like ugly translite text) in result for
that moment.For image contains only lattin letters it works fine.
Does anyone have any suggestion, best practice or case studies refer to
this situation?

Re: OCR image contains cyrillic characters

Posted by Rick Leir <rl...@leirtech.com>.
No offense taken. 

More on this topic ( opinion only): even the best OCR has a quality ratio, say 95% or 98% correct. And OCR is slow, maybe a minute per image. So it is best to OCR into a filesystem or DB, assess the quality, then index from the DB. 
Cheers -- Rick

On February 12, 2017 1:55:10 PM EST, "Игорь Абрашин" <vj...@gmail.com> wrote:
>Actually, i dont know how to do it((( For now ive just created request
>handler and update chain proccessor for it with capability to detect
>during
>recognize process (LanguageDetect or somthing like that). Really
>appreciate
>for any instructions.
>Sorry, if i was rude, bad english skill for good russian guy))))
>
>11 февр. 2017 г. 19:44 пользователь "Rick Leir" <rl...@leirtech.com>
>написал:
>
>> Yes, you are right. I was just trying to help, and did not have time
>to
>> dig out the details. So the question is: how do you tell Solr to pass
>the
>> language arg to Tika and Tesseract?
>>
>> On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" <
>> vjiastelin@gmail.com> wrote:
>> >Hi, Rick.
>> >I didnt mean that he need to train, because tesseract works well
>> >separetly.
>> >So, tika included in solr doesnt try to use russian dict to
>recognize
>> >cyrillic text and result comes up utilize only eng alphabet.
>> >
>> >10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rl...@leirtech.com>
>> >написал:
>> >
>> >> My guess is that you are using using Tika and Tesseract. The
>latter
>> >is
>> >> complex, and you can start learning at
>> >>
>> >> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work
>with
>> >TIFF
>> >>
>> >> The traineddata for Cyrillic is here:
>> >>
>> >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>> >>
>> >> https://github.com/tesseract-ocr/tesseract/issues/147
>> >>
>> >> You likely need to enhance the images before running Tesseract.
>> >>
>> >> cheers -- Rick
>> >>
>> >> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>> >>
>> >>> Hello, community!
>> >>> Did you manage to recognize jpf,tiff or whatever with cyrillics
>text
>> >>> inside?
>> >>> Ive got only latin letter (looks like ugly translite text) in
>result
>> >for
>> >>> that moment.For image contains only lattin letters it works fine.
>> >>> Does anyone have any suggestion, best practice or case studies
>refer
>> >to
>> >>> this situation?
>> >>>
>> >>>
>> >>
>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: OCR image contains cyrillic characters

Posted by Игорь Абрашин <vj...@gmail.com>.
Actually, i dont know how to do it((( For now ive just created request
handler and update chain proccessor for it with capability to detect during
recognize process (LanguageDetect or somthing like that). Really appreciate
for any instructions.
Sorry, if i was rude, bad english skill for good russian guy))))

11 февр. 2017 г. 19:44 пользователь "Rick Leir" <rl...@leirtech.com>
написал:

> Yes, you are right. I was just trying to help, and did not have time to
> dig out the details. So the question is: how do you tell Solr to pass the
> language arg to Tika and Tesseract?
>
> On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" <
> vjiastelin@gmail.com> wrote:
> >Hi, Rick.
> >I didnt mean that he need to train, because tesseract works well
> >separetly.
> >So, tika included in solr doesnt try to use russian dict to recognize
> >cyrillic text and result comes up utilize only eng alphabet.
> >
> >10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rl...@leirtech.com>
> >написал:
> >
> >> My guess is that you are using using Tika and Tesseract. The latter
> >is
> >> complex, and you can start learning at
> >>
> >> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with
> >TIFF
> >>
> >> The traineddata for Cyrillic is here:
> >>
> >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
> >>
> >> https://github.com/tesseract-ocr/tesseract/issues/147
> >>
> >> You likely need to enhance the images before running Tesseract.
> >>
> >> cheers -- Rick
> >>
> >> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
> >>
> >>> Hello, community!
> >>> Did you manage to recognize jpf,tiff or whatever with cyrillics text
> >>> inside?
> >>> Ive got only latin letter (looks like ugly translite text) in result
> >for
> >>> that moment.For image contains only lattin letters it works fine.
> >>> Does anyone have any suggestion, best practice or case studies refer
> >to
> >>> this situation?
> >>>
> >>>
> >>
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: OCR image contains cyrillic characters

Posted by Rick Leir <rl...@leirtech.com>.
Yes, you are right. I was just trying to help, and did not have time to dig out the details. So the question is: how do you tell Solr to pass the language arg to Tika and Tesseract? 

On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" <vj...@gmail.com> wrote:
>Hi, Rick.
>I didnt mean that he need to train, because tesseract works well
>separetly.
>So, tika included in solr doesnt try to use russian dict to recognize
>cyrillic text and result comes up utilize only eng alphabet.
>
>10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rl...@leirtech.com>
>написал:
>
>> My guess is that you are using using Tika and Tesseract. The latter
>is
>> complex, and you can start learning at
>>
>> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with
>TIFF
>>
>> The traineddata for Cyrillic is here:
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>>
>> https://github.com/tesseract-ocr/tesseract/issues/147
>>
>> You likely need to enhance the images before running Tesseract.
>>
>> cheers -- Rick
>>
>> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>>
>>> Hello, community!
>>> Did you manage to recognize jpf,tiff or whatever with cyrillics text
>>> inside?
>>> Ive got only latin letter (looks like ugly translite text) in result
>for
>>> that moment.For image contains only lattin letters it works fine.
>>> Does anyone have any suggestion, best practice or case studies refer
>to
>>> this situation?
>>>
>>>
>>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: OCR image contains cyrillic characters

Posted by Игорь Абрашин <vj...@gmail.com>.
Hi, Rick.
I didnt mean that he need to train, because tesseract works well separetly.
So, tika included in solr doesnt try to use russian dict to recognize
cyrillic text and result comes up utilize only eng alphabet.

10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rl...@leirtech.com>
написал:

> My guess is that you are using using Tika and Tesseract. The latter is
> complex, and you can start learning at
>
> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with TIFF
>
> The traineddata for Cyrillic is here:
>
> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> https://github.com/tesseract-ocr/tesseract/issues/147
>
> You likely need to enhance the images before running Tesseract.
>
> cheers -- Rick
>
> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>
>> Hello, community!
>> Did you manage to recognize jpf,tiff or whatever with cyrillics text
>> inside?
>> Ive got only latin letter (looks like ugly translite text) in result for
>> that moment.For image contains only lattin letters it works fine.
>> Does anyone have any suggestion, best practice or case studies refer to
>> this situation?
>>
>>
>

Re: OCR image contains cyrillic characters

Posted by Rick Leir <rl...@leirtech.com>.
My guess is that you are using using Tika and Tesseract. The latter is 
complex, and you can start learning at

https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with TIFF

The traineddata for Cyrillic is here:

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

https://github.com/tesseract-ocr/tesseract/issues/147

You likely need to enhance the images before running Tesseract.

cheers -- Rick

On 2017-02-10 05:03 AM, \u0418\u0433\u043e\u0440\u044c \u0410\u0431\u0440\u0430\u0448\u0438\u043d wrote:
> Hello, community!
> Did you manage to recognize jpf,tiff or whatever with cyrillics text inside?
> Ive got only latin letter (looks like ugly translite text) in result for
> that moment.For image contains only lattin letters it works fine.
> Does anyone have any suggestion, best practice or case studies refer to
> this situation?
>