You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2011/11/30 07:59:49 UTC

Tesseract OCR engine

Hey Guys,

FYI: http://code.google.com/p/tesseract-ocr/

I was pointed at this library by someone recently asking me if Tika 
was interested in integrating with this library. It's ALv2 licensed, and 
seems pretty interesting. I'm going to check it out, but just
wanted to give everyone a heads up.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Tesseract OCR engine

Posted by Alex Ott <al...@gmail.com>.
You can also look onto Cuneiform OCR... I think, that easiest way to
integrate them into Tika - allow user to specify external script that
will be called from Tika and that should return recognized text

On Wed, Nov 30, 2011 at 10:48 PM, Albert Law (Logik) <al...@logik.com> wrote:
> Hi Chris,
>
> I agree with Oleg.  Tesseract is free but requires training to get any
> respectable OCR output.  Lastly, I found that Tesseract had memory
> leaks (circa Sept. 2010).
>
> Aside: I noticed Tesseract doesn't have pre-compiled builds nor a Java API.
>
> On Wed, Nov 30, 2011 at 9:51 AM, Mattmann, Chris A (388J)
> <ch...@jpl.nasa.gov> wrote:
>> Hi Oleg,
>>
>> Thanks for the FYI, Oleg and the heads up on what needs to improve
>> here.
>>
>> Cheers,
>> Chris
>>
>> On Nov 29, 2011, at 11:10 PM, Oleg Tikhonov wrote:
>>
>>> Hi Chris,
>>> I was playing with it recently.
>>> One of the big issues with tesseract is a tough process of the preparing
>>> training set for multiple fonts and languages.
>>> In addition, we also have to add an option for image preprocessing (skewing
>>> + filtering etc).
>>>
>>>
>>> BR,
>>> Oleg
>>>
>>> On Wed, Nov 30, 2011 at 8:59 AM, Mattmann, Chris A (388J) <
>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>>
>>>> Hey Guys,
>>>>
>>>> FYI: http://code.google.com/p/tesseract-ocr/
>>>>
>>>> I was pointed at this library by someone recently asking me if Tika
>>>> was interested in integrating with this library. It's ALv2 licensed, and
>>>> seems pretty interesting. I'm going to check it out, but just
>>>> wanted to give everyone a heads up.
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>
>
>
> --
>
> Sincerely,
> Albert Law
> Senior Software Engineer
> Logik.com



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Re: Tesseract OCR engine

Posted by "Albert Law (Logik)" <al...@logik.com>.
Hi Chris,

I agree with Oleg.  Tesseract is free but requires training to get any
respectable OCR output.  Lastly, I found that Tesseract had memory
leaks (circa Sept. 2010).

Aside: I noticed Tesseract doesn't have pre-compiled builds nor a Java API.

On Wed, Nov 30, 2011 at 9:51 AM, Mattmann, Chris A (388J)
<ch...@jpl.nasa.gov> wrote:
> Hi Oleg,
>
> Thanks for the FYI, Oleg and the heads up on what needs to improve
> here.
>
> Cheers,
> Chris
>
> On Nov 29, 2011, at 11:10 PM, Oleg Tikhonov wrote:
>
>> Hi Chris,
>> I was playing with it recently.
>> One of the big issues with tesseract is a tough process of the preparing
>> training set for multiple fonts and languages.
>> In addition, we also have to add an option for image preprocessing (skewing
>> + filtering etc).
>>
>>
>> BR,
>> Oleg
>>
>> On Wed, Nov 30, 2011 at 8:59 AM, Mattmann, Chris A (388J) <
>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>
>>> Hey Guys,
>>>
>>> FYI: http://code.google.com/p/tesseract-ocr/
>>>
>>> I was pointed at this library by someone recently asking me if Tika
>>> was interested in integrating with this library. It's ALv2 licensed, and
>>> seems pretty interesting. I'm going to check it out, but just
>>> wanted to give everyone a heads up.
>>>
>>> Cheers,
>>> Chris
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>



-- 

Sincerely,
Albert Law
Senior Software Engineer
Logik.com

Re: Tesseract OCR engine

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Oleg,

Thanks for the FYI, Oleg and the heads up on what needs to improve
here.

Cheers,
Chris

On Nov 29, 2011, at 11:10 PM, Oleg Tikhonov wrote:

> Hi Chris,
> I was playing with it recently.
> One of the big issues with tesseract is a tough process of the preparing
> training set for multiple fonts and languages.
> In addition, we also have to add an option for image preprocessing (skewing
> + filtering etc).
> 
> 
> BR,
> Oleg
> 
> On Wed, Nov 30, 2011 at 8:59 AM, Mattmann, Chris A (388J) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
> 
>> Hey Guys,
>> 
>> FYI: http://code.google.com/p/tesseract-ocr/
>> 
>> I was pointed at this library by someone recently asking me if Tika
>> was interested in integrating with this library. It's ALv2 licensed, and
>> seems pretty interesting. I'm going to check it out, but just
>> wanted to give everyone a heads up.
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Tesseract OCR engine

Posted by Oleg Tikhonov <ol...@apache.org>.
Hi Chris,
I was playing with it recently.
One of the big issues with tesseract is a tough process of the preparing
training set for multiple fonts and languages.
In addition, we also have to add an option for image preprocessing (skewing
+ filtering etc).


BR,
Oleg

On Wed, Nov 30, 2011 at 8:59 AM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Guys,
>
> FYI: http://code.google.com/p/tesseract-ocr/
>
> I was pointed at this library by someone recently asking me if Tika
> was interested in integrating with this library. It's ALv2 licensed, and
> seems pretty interesting. I'm going to check it out, but just
> wanted to give everyone a heads up.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>