You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by kevin slote <ks...@gmail.com> on 2014/09/30 17:52:07 UTC

OCR with tika-server

Hello all,

I have been testing out the integration of tika with tesseract.
I was wondering if there is  a way to get tika-server to run with
tesseract's OCR capabilities?

Best

Kevin Slote

Re: OCR with tika-server

Posted by kevin slote <ks...@gmail.com>.

Ok, I am signed up.

https://wiki.apache.org/tika/Kevin%20Slote

On Fri, Oct 3, 2014 at 11:02 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Kevin glad it is now fixed with you!
>
> If you get a chance, please feel free to document
> this on the wiki:
>
> https://wiki.apache.org/tika/TikaOCR
>
>
> You can sign up for an account, and then I can grant
> you permissions to edit the file. Let me know!
>
> Cheers,
> Chris
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: kevin slote <ks...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Friday, October 3, 2014 at 4:10 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: OCR with tika-server
>
> >Hi all,
> >
> >I just confirmed that the problem was that my version of tesseract was too
> >old.
> >Maybe it would be a good idea to put something in the canRun method at the
> >top of the tesseract unit test to also check that the version of tesseract
> >is relevant?
> >
> >Older versions of tesseract do not have a "-v" or "--version" flag.  So
> >maybe use ProcessBuilder to run that command and parse the string to see
> >if
> >it returned an error?
> >
> >Thanks for everyone's help.
> >
> >On Fri, Oct 3, 2014 at 2:30 PM, kevin slote <ks...@gmail.com> wrote:
> >
> >> Thanks for following up!
> >>
> >> I was trying to dig deeper before I responded.
> >>
> >> Tyler,
> >>
> >> I followed those instructions.  My version of Tesseract does not ocr the
> >> google logo because it is not a tiff.  I used imagemagick to convert it
> >>to
> >> a tif and tesseract returned "check_legal_image_size:Error:Only
> >>1,2,4,5,6,8
> >> bpp are supported:32" error which usually means it needs to be re-sized
> >> with imagemagick.
> >>
> >>
> >> Chris,
> >>
> >> I wrote a python wrapper for tesseract that can parse the documents that
> >> were in your test-document repository concerning OCR (testOCR.pdf,
> >>etc.) It
> >> looks like right now, in TesseractOCRParser.java, the command line
> >>argument
> >> that is passed to the os points to a .tmp file in /tmp/.
> >>
> >> So the command that is executed is
> >>
> >>    "tesseract /tmp/apache-tika-2409864150710514587.tmp
> >> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
> >>
> >> This is not working for me.  When I grab those .tmp files and try to ocr
> >> them from the command line, tesseract gets thrown for a loop.
> >>
> >> From what I can tell, is the tesseract I have installed can only handle
> >> .tif files.
> >> I can back this up by citing the tesseract page:
> >> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
> >>
> >>  If Tesseract isn't available for your distribution, or you want to use
> >>a
> >> newer version than they offer, you can compile your own
> >> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that
> >>older
> >> versions of Tesseract only supported processing .tiff files.
> >>
> >> So, I think that upgrading tesseract or moving to ubuntu 12 or higher
> >>will
> >> solve my problems.
> >>
> >> I will let the listserv know if that fixes it.
> >>
> >>
> >> Kevin Slote
> >>
> >>
> >>
> >> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
> >> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>
> >>> What type of image is it, Kevin?
> >>>
> >>> If it’s a TIFF, you need to install tesseract with special lib tiff
> >>> parameters. See:
> >>>
> >>> https://gist.github.com/henrik/1967035
> >>>
> >>>
> >>> Can you parse the image file with tesseract by itself, without
> >>> Tika’s tmp image?
> >>>
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Chris Mattmann, Ph.D.
> >>> Chief Architect
> >>> Instrument Software and Science Data Systems Section (398)
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 168-519, Mailstop: 168-527
> >>> Email: chris.a.mattmann@nasa.gov
> >>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Adjunct Associate Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: <Ramirez>, "Paul M   (398J)" <pa...@jpl.nasa.gov>
> >>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>> Date: Wednesday, October 1, 2014 at 1:47 PM
> >>> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
> >>> Subject: Re: OCR with tika-server
> >>>
> >>> >Nothing to be embarrassed about at all Kevin. I actually thought
> >>>maybe it
> >>> >was just a typo issue and I randomly happen to catch that. I've
> >>> >definitely done that one before myself.
> >>> >
> >>> >Bummed that was not the problem.
> >>> >
> >>> >--Paul
> >>> >
> >>> >On Oct 1, 2014, at 1:00 PM, kevin slote <ks...@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> What I wrote there did have a typo in it. (It's not every day you
> >>>get
> >>> to
> >>> >> embarrass yourself in front of a bunch of guys from NASA)
> >>> >>
> >>> >> But that was not what I had in my terminal when I tested it.
> >>> >>
> >>> >>
> >>> >>
> >>> >> The actual PATH was:
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>>
> >>>
> >>>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us
> >>>>>r/g
> >>> >>ames:/usr/bin/tesseract"
> >>> >>
> >>> >>
> >>> >>
> >>> >> I think what was actually wrong with the path is that I added the
> >>> entire
> >>> >> path to the tesseract executable, which was in my /usr/bin/
> >>>directory,
> >>> >> instead of just the directory where tesseract lives.  Is this true?
> >>> >>
> >>> >>
> >>> >>
> >>> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
> >>> >>printed
> >>> >> config.getTesseractPath() to stdout.  This field was empty.
> >>> >>
> >>> >> However, I have tesseract installed system wide on this ubuntu vm.
> >>> >>
> >>> >> So the canRun method evaluated as true whether or not the
> >>>tesseractPath
> >>> >>was
> >>> >> configured correctly.
> >>> >>
> >>> >>
> >>> >>
> >>> >> I have been slowly trying to debug this all day.  It looks like
> >>>tika is
> >>> >> making a tmp file with the .tmp preffix.
> >>> >>
> >>> >> I commented out some of the code to so that they remained in /tmp/.
> >>> >>
> >>> >>
> >>> >>
> >>> >> It looks like tesseract doesn't like that.
> >>> >>
> >>> >> I tried to ocr these .tmp files to see if I could isolate what was
> >>> going
> >>> >> wrong for me.
> >>> >>
> >>> >>
> >>> >>
> >>> >> kslote@ubuntu:~/tika/tika$ tesseract
> >>> >> /tmp/apache-tika-7112319184053570698.tmp out
> >>> >>
> >>> >> Tesseract Open Source OCR Engine
> >>> >>
> >>> >> name_to_image_type:Error:Unrecognized image
> >>> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>> >>
> >>> >> IMAGE::read_header:Error:Can't read this image
> >>> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>> >>
> >>> >> tesseract:Error:Read of file
> >>> >>failed:/tmp/apache-tika-7112319184053570698.tmp
> >>> >>
> >>> >> Segmentation fault
> >>> >>
> >>> >>
> >>> >>
> >>> >> On the wiki it mentions something about getting tesseract to work
> >>>with
> >>> >> .tiff files.  For whatever reason, the tesseract I have installed
> >>>only
> >>> >> works for .tiff files.  Would it be recommend that I re install
> >>> >>tesseract
> >>> >> from the source?
> >>> >>
> >>> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
> >>> >> paul.m.ramirez@jpl.nasa.gov> wrote:
> >>> >>
> >>> >>> Is that a typo in your path to tesseract?
> >>> >>>
> >>> >>> /urs/bin/tesseract => /usr/bin/tesseract
> >>> >>>
> >>> >>> --Paul
> >>> >>>
> >>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com>
> >>> wrote:
> >>> >>>>
> >>> >>>> Unfortunately, that did not do it either.
> >>> >>>>
> >>> >>>> I did:
> >>> >>>>
> >>> >>>>  $export
> >>> >>>>
> >>> >>>
> >>>
> >>>
> >>>>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us
> >>>>>>r/g
> >>> >>>ames:/urs/bin/tesseract
> >>> >>>>
> >>> >>>> Here is the output from printenv
> >>> >>>>
> >>> >>>> kslote@ubuntu:~/tika/tika$ printenv
> >>> >>>> SHELL=/bin/bash
> >>> >>>> USERNAME=kslote
> >>> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> >>> >>>> DESKTOP_SESSION=gnome
> >>> >>>>
> >>> >>>
> >>>
> >>>
> >>>>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
> >>>>>>n:/
> >>> >>>usr/games:/urs/bin/tesseract
> >>> >>>> PWD=/home/kslote/tika/tika
> >>> >>>> HOME=/home/kslote
> >>> >>>> LOGNAME=kslote
> >>> >>>> _=/usr/bin/printenv
> >>> >>>>
> >>> >>>>
> >>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
> >>> >>>><tp...@gmail.com>
> >>> >>>> wrote:
> >>> >>>>
> >>> >>>>> Hi,
> >>> >>>>>
> >>> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you
> >>> install
> >>> >>>>> Tesseract? You should be able to do a straightforward `sudo
> >>>apt-get
> >>> >>> install
> >>> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're
> >>>still
> >>> >>> running
> >>> >>>>> into TIKA-1422, where a mail test fails. But, you can run just
> >>>the
> >>> >>>>>OCR
> >>> >>>>> tests with `mvn test
> >>> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> >>> >>>>> -DfailIfNoTests=false`.
> >>> >>>>>
> >>> >>>>> Let me know if that works for you!
> >>> >>>>> Tyler
> >>> >>>>>
> >>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <kslote1@gmail.com
> >
> >>> >>> wrote:
> >>> >>>>>>
> >>> >>>>>> I am working on ubuntu 10.4. and I am having some trouble.
> >>> >>>>>> Tesseract is installed correctly, but just doing a clone from
> >>>the
> >>> >>>>>>repo
> >>> >>>>> and
> >>> >>>>>> installing with maven, I am getting some errors.
> >>> >>>>>>
> >>> >>>>>> This is before I did anything with tesseract installed.
> >>> >>>>>>
> >>> >>>>>> Failed tests:
> >>> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> >>> >>>>>> Check for the image's text.
> >>> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>> >>>>>>
> >>> >>>>>> Next I hard coded the tesseractPath:
> >>> >>>>>>
> >>> >>>>>> I went into the TesseractOCRConfig.java and hard coded
> >>> >>>>>>'tesseractPath.'
> >>> >>>>>> The all tests passed and it built successfully, but then I went
> >>>to
> >>> >>>>>>post
> >>> >>>>>> some tiff's to the server.
> >>> >>>>>> That didn't work. So I tried adding some
> >>>System.out.println("hello
> >>> >>>>> world")
> >>> >>>>>> (a little crude I know) inside the unit tests to confirm that
> >>> >>>>>>tesseract
> >>> >>>>>> was working correctly.  It looks like something happens in the
> >>>unit
> >>> >>> test
> >>> >>>>> in
> >>> >>>>>> TesseractOCRTest.java
> >>> >>>>>> on the line that says TesseractOCRConfig config = new
> >>> >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I
> >>>get
> >>> >>> nothing
> >>> >>>>>> after. That happens before the assumeTrue(canRun(config));. So
> >>>an
> >>> >>>>> exception
> >>> >>>>>> is not get raised.
> >>> >>>>>>
> >>> >>>>>> Then once everything is built, ocr does not work.  That was why
> >>>I
> >>> >>>>> figured I
> >>> >>>>>> would ask to see if I missed some sort of configuration step in
> >>> >>> building
> >>> >>>>>> it.
> >>> >>>>>>
> >>> >>>>>> Thanks a ton.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> >>> >>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>> >>>>>>
> >>> >>>>>>> Dear Kevin,
> >>> >>>>>>>
> >>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
> >>> >>>>>>>
> >>> >>>>>>> See this wiki page:
> >>> >>>>>>>
> >>> >>>>>>> https://wiki.apache.org/tika/TikaOCR
> >>> >>>>>>>
> >>> >>>>>>> I¹d be happy to discuss more.
> >>> >>>>>>>
> >>> >>>>>>> Thanks!
> >>> >>>>>>>
> >>> >>>>>>> Cheers,
> >>> >>>>>>> Chris
> >>> >>>>>>>
> >>> >>>>>>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >>>>>>> Chris Mattmann, Ph.D.
> >>> >>>>>>> Chief Architect
> >>> >>>>>>> Instrument Software and Science Data Systems Section (398)
> >>> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> >>>>>>> Office: 168-519, Mailstop: 168-527
> >>> >>>>>>> Email: chris.a.mattmann@nasa.gov
> >>> >>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> >>>>>>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >>>>>>> Adjunct Associate Professor, Computer Science Department
> >>> >>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>> >>>>>>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> -----Original Message-----
> >>> >>>>>>> From: kevin slote <ks...@gmail.com>
> >>> >>>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
> >>> >>>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>> >>>>>>> Subject: OCR with tika-server
> >>> >>>>>>>
> >>> >>>>>>>> Hello all,
> >>> >>>>>>>>
> >>> >>>>>>>> I have been testing out the integration of tika with
> >>>tesseract.
> >>> >>>>>>>> I was wondering if there is  a way to get tika-server to run
> >>>with
> >>> >>>>>>>> tesseract's OCR capabilities?
> >>> >>>>>>>>
> >>> >>>>>>>> Best
> >>> >>>>>>>>
> >>> >>>>>>>> Kevin Slote
> >>> >>>>>
> >>> >>>
> >>> >
> >>>
> >>>
> >>
>
>

Re: OCR with tika-server

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Kevin glad it is now fixed with you!

If you get a chance, please feel free to document
this on the wiki:

https://wiki.apache.org/tika/TikaOCR


You can sign up for an account, and then I can grant
you permissions to edit the file. Let me know!

Cheers,
Chris



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: kevin slote <ks...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Friday, October 3, 2014 at 4:10 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: OCR with tika-server

>Hi all,
>
>I just confirmed that the problem was that my version of tesseract was too
>old.
>Maybe it would be a good idea to put something in the canRun method at the
>top of the tesseract unit test to also check that the version of tesseract
>is relevant?
>
>Older versions of tesseract do not have a "-v" or "--version" flag.  So
>maybe use ProcessBuilder to run that command and parse the string to see
>if
>it returned an error?
>
>Thanks for everyone's help.
>
>On Fri, Oct 3, 2014 at 2:30 PM, kevin slote <ks...@gmail.com> wrote:
>
>> Thanks for following up!
>>
>> I was trying to dig deeper before I responded.
>>
>> Tyler,
>>
>> I followed those instructions.  My version of Tesseract does not ocr the
>> google logo because it is not a tiff.  I used imagemagick to convert it
>>to
>> a tif and tesseract returned "check_legal_image_size:Error:Only
>>1,2,4,5,6,8
>> bpp are supported:32" error which usually means it needs to be re-sized
>> with imagemagick.
>>
>>
>> Chris,
>>
>> I wrote a python wrapper for tesseract that can parse the documents that
>> were in your test-document repository concerning OCR (testOCR.pdf,
>>etc.) It
>> looks like right now, in TesseractOCRParser.java, the command line
>>argument
>> that is passed to the os points to a .tmp file in /tmp/.
>>
>> So the command that is executed is
>>
>>    "tesseract /tmp/apache-tika-2409864150710514587.tmp
>> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
>>
>> This is not working for me.  When I grab those .tmp files and try to ocr
>> them from the command line, tesseract gets thrown for a loop.
>>
>> From what I can tell, is the tesseract I have installed can only handle
>> .tif files.
>> I can back this up by citing the tesseract page:
>> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
>>
>>  If Tesseract isn't available for your distribution, or you want to use
>>a
>> newer version than they offer, you can compile your own
>> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that
>>older
>> versions of Tesseract only supported processing .tiff files.
>>
>> So, I think that upgrading tesseract or moving to ubuntu 12 or higher
>>will
>> solve my problems.
>>
>> I will let the listserv know if that fixes it.
>>
>>
>> Kevin Slote
>>
>>
>>
>> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>
>>> What type of image is it, Kevin?
>>>
>>> If it’s a TIFF, you need to install tesseract with special lib tiff
>>> parameters. See:
>>>
>>> https://gist.github.com/henrik/1967035
>>>
>>>
>>> Can you parse the image file with tesseract by itself, without
>>> Tika’s tmp image?
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: <Ramirez>, "Paul M   (398J)" <pa...@jpl.nasa.gov>
>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>> Date: Wednesday, October 1, 2014 at 1:47 PM
>>> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
>>> Subject: Re: OCR with tika-server
>>>
>>> >Nothing to be embarrassed about at all Kevin. I actually thought
>>>maybe it
>>> >was just a typo issue and I randomly happen to catch that. I've
>>> >definitely done that one before myself.
>>> >
>>> >Bummed that was not the problem.
>>> >
>>> >--Paul
>>> >
>>> >On Oct 1, 2014, at 1:00 PM, kevin slote <ks...@gmail.com>
>>> > wrote:
>>> >
>>> >> What I wrote there did have a typo in it. (It's not every day you
>>>get
>>> to
>>> >> embarrass yourself in front of a bunch of guys from NASA)
>>> >>
>>> >> But that was not what I had in my terminal when I tested it.
>>> >>
>>> >>
>>> >>
>>> >> The actual PATH was:
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>>
>>> 
>>>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us
>>>>>r/g
>>> >>ames:/usr/bin/tesseract"
>>> >>
>>> >>
>>> >>
>>> >> I think what was actually wrong with the path is that I added the
>>> entire
>>> >> path to the tesseract executable, which was in my /usr/bin/
>>>directory,
>>> >> instead of just the directory where tesseract lives.  Is this true?
>>> >>
>>> >>
>>> >>
>>> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
>>> >>printed
>>> >> config.getTesseractPath() to stdout.  This field was empty.
>>> >>
>>> >> However, I have tesseract installed system wide on this ubuntu vm.
>>> >>
>>> >> So the canRun method evaluated as true whether or not the
>>>tesseractPath
>>> >>was
>>> >> configured correctly.
>>> >>
>>> >>
>>> >>
>>> >> I have been slowly trying to debug this all day.  It looks like
>>>tika is
>>> >> making a tmp file with the .tmp preffix.
>>> >>
>>> >> I commented out some of the code to so that they remained in /tmp/.
>>> >>
>>> >>
>>> >>
>>> >> It looks like tesseract doesn't like that.
>>> >>
>>> >> I tried to ocr these .tmp files to see if I could isolate what was
>>> going
>>> >> wrong for me.
>>> >>
>>> >>
>>> >>
>>> >> kslote@ubuntu:~/tika/tika$ tesseract
>>> >> /tmp/apache-tika-7112319184053570698.tmp out
>>> >>
>>> >> Tesseract Open Source OCR Engine
>>> >>
>>> >> name_to_image_type:Error:Unrecognized image
>>> >> type:/tmp/apache-tika-7112319184053570698.tmp
>>> >>
>>> >> IMAGE::read_header:Error:Can't read this image
>>> >> type:/tmp/apache-tika-7112319184053570698.tmp
>>> >>
>>> >> tesseract:Error:Read of file
>>> >>failed:/tmp/apache-tika-7112319184053570698.tmp
>>> >>
>>> >> Segmentation fault
>>> >>
>>> >>
>>> >>
>>> >> On the wiki it mentions something about getting tesseract to work
>>>with
>>> >> .tiff files.  For whatever reason, the tesseract I have installed
>>>only
>>> >> works for .tiff files.  Would it be recommend that I re install
>>> >>tesseract
>>> >> from the source?
>>> >>
>>> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>>> >> paul.m.ramirez@jpl.nasa.gov> wrote:
>>> >>
>>> >>> Is that a typo in your path to tesseract?
>>> >>>
>>> >>> /urs/bin/tesseract => /usr/bin/tesseract
>>> >>>
>>> >>> --Paul
>>> >>>
>>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> Unfortunately, that did not do it either.
>>> >>>>
>>> >>>> I did:
>>> >>>>
>>> >>>>  $export
>>> >>>>
>>> >>>
>>>
>>> 
>>>>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us
>>>>>>r/g
>>> >>>ames:/urs/bin/tesseract
>>> >>>>
>>> >>>> Here is the output from printenv
>>> >>>>
>>> >>>> kslote@ubuntu:~/tika/tika$ printenv
>>> >>>> SHELL=/bin/bash
>>> >>>> USERNAME=kslote
>>> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>>> >>>> DESKTOP_SESSION=gnome
>>> >>>>
>>> >>>
>>>
>>> 
>>>>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
>>>>>>n:/
>>> >>>usr/games:/urs/bin/tesseract
>>> >>>> PWD=/home/kslote/tika/tika
>>> >>>> HOME=/home/kslote
>>> >>>> LOGNAME=kslote
>>> >>>> _=/usr/bin/printenv
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
>>> >>>><tp...@gmail.com>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> Hi,
>>> >>>>>
>>> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you
>>> install
>>> >>>>> Tesseract? You should be able to do a straightforward `sudo
>>>apt-get
>>> >>> install
>>> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're
>>>still
>>> >>> running
>>> >>>>> into TIKA-1422, where a mail test fails. But, you can run just
>>>the
>>> >>>>>OCR
>>> >>>>> tests with `mvn test
>>> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>>> >>>>> -DfailIfNoTests=false`.
>>> >>>>>
>>> >>>>> Let me know if that works for you!
>>> >>>>> Tyler
>>> >>>>>
>>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com>
>>> >>> wrote:
>>> >>>>>>
>>> >>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>>> >>>>>> Tesseract is installed correctly, but just doing a clone from
>>>the
>>> >>>>>>repo
>>> >>>>> and
>>> >>>>>> installing with maven, I am getting some errors.
>>> >>>>>>
>>> >>>>>> This is before I did anything with tesseract installed.
>>> >>>>>>
>>> >>>>>> Failed tests:
>>> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>> >>>>>> Check for the image's text.
>>> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>> >>>>>>
>>> >>>>>> Next I hard coded the tesseractPath:
>>> >>>>>>
>>> >>>>>> I went into the TesseractOCRConfig.java and hard coded
>>> >>>>>>'tesseractPath.'
>>> >>>>>> The all tests passed and it built successfully, but then I went
>>>to
>>> >>>>>>post
>>> >>>>>> some tiff's to the server.
>>> >>>>>> That didn't work. So I tried adding some
>>>System.out.println("hello
>>> >>>>> world")
>>> >>>>>> (a little crude I know) inside the unit tests to confirm that
>>> >>>>>>tesseract
>>> >>>>>> was working correctly.  It looks like something happens in the
>>>unit
>>> >>> test
>>> >>>>> in
>>> >>>>>> TesseractOCRTest.java
>>> >>>>>> on the line that says TesseractOCRConfig config = new
>>> >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I
>>>get
>>> >>> nothing
>>> >>>>>> after. That happens before the assumeTrue(canRun(config));. So
>>>an
>>> >>>>> exception
>>> >>>>>> is not get raised.
>>> >>>>>>
>>> >>>>>> Then once everything is built, ocr does not work.  That was why
>>>I
>>> >>>>> figured I
>>> >>>>>> would ask to see if I missed some sort of configuration step in
>>> >>> building
>>> >>>>>> it.
>>> >>>>>>
>>> >>>>>> Thanks a ton.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>> >>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>> >>>>>>
>>> >>>>>>> Dear Kevin,
>>> >>>>>>>
>>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>> >>>>>>>
>>> >>>>>>> See this wiki page:
>>> >>>>>>>
>>> >>>>>>> https://wiki.apache.org/tika/TikaOCR
>>> >>>>>>>
>>> >>>>>>> I¹d be happy to discuss more.
>>> >>>>>>>
>>> >>>>>>> Thanks!
>>> >>>>>>>
>>> >>>>>>> Cheers,
>>> >>>>>>> Chris
>>> >>>>>>>
>>> >>>>>>> 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>>>>>> Chris Mattmann, Ph.D.
>>> >>>>>>> Chief Architect
>>> >>>>>>> Instrument Software and Science Data Systems Section (398)
>>> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> >>>>>>> Office: 168-519, Mailstop: 168-527
>>> >>>>>>> Email: chris.a.mattmann@nasa.gov
>>> >>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> >>>>>>> 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>>>>>> Adjunct Associate Professor, Computer Science Department
>>> >>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>> >>>>>>> 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> -----Original Message-----
>>> >>>>>>> From: kevin slote <ks...@gmail.com>
>>> >>>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>> >>>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>> >>>>>>> Subject: OCR with tika-server
>>> >>>>>>>
>>> >>>>>>>> Hello all,
>>> >>>>>>>>
>>> >>>>>>>> I have been testing out the integration of tika with
>>>tesseract.
>>> >>>>>>>> I was wondering if there is  a way to get tika-server to run
>>>with
>>> >>>>>>>> tesseract's OCR capabilities?
>>> >>>>>>>>
>>> >>>>>>>> Best
>>> >>>>>>>>
>>> >>>>>>>> Kevin Slote
>>> >>>>>
>>> >>>
>>> >
>>>
>>>
>>

Re: OCR with tika-server

Posted by kevin slote <ks...@gmail.com>.

Hi all,

I just confirmed that the problem was that my version of tesseract was too
old.
Maybe it would be a good idea to put something in the canRun method at the
top of the tesseract unit test to also check that the version of tesseract
is relevant?

Older versions of tesseract do not have a "-v" or "--version" flag.  So
maybe use ProcessBuilder to run that command and parse the string to see if
it returned an error?

Thanks for everyone's help.

On Fri, Oct 3, 2014 at 2:30 PM, kevin slote <ks...@gmail.com> wrote:

> Thanks for following up!
>
> I was trying to dig deeper before I responded.
>
> Tyler,
>
> I followed those instructions.  My version of Tesseract does not ocr the
> google logo because it is not a tiff.  I used imagemagick to convert it to
> a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8
> bpp are supported:32" error which usually means it needs to be re-sized
> with imagemagick.
>
>
> Chris,
>
> I wrote a python wrapper for tesseract that can parse the documents that
> were in your test-document repository concerning OCR (testOCR.pdf, etc.) It
> looks like right now, in TesseractOCRParser.java, the command line argument
> that is passed to the os points to a .tmp file in /tmp/.
>
> So the command that is executed is
>
>    "tesseract /tmp/apache-tika-2409864150710514587.tmp
> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
>
> This is not working for me.  When I grab those .tmp files and try to ocr
> them from the command line, tesseract gets thrown for a loop.
>
> From what I can tell, is the tesseract I have installed can only handle
> .tif files.
> I can back this up by citing the tesseract page:
> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
>
>  If Tesseract isn't available for your distribution, or you want to use a
> newer version than they offer, you can compile your own
> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that  older
> versions of Tesseract only supported processing .tiff files.
>
> So, I think that upgrading tesseract or moving to ubuntu 12 or higher will
> solve my problems.
>
> I will let the listserv know if that fixes it.
>
>
> Kevin Slote
>
>
>
> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> What type of image is it, Kevin?
>>
>> If it’s a TIFF, you need to install tesseract with special lib tiff
>> parameters. See:
>>
>> https://gist.github.com/henrik/1967035
>>
>>
>> Can you parse the image file with tesseract by itself, without
>> Tika’s tmp image?
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: <Ramirez>, "Paul M   (398J)" <pa...@jpl.nasa.gov>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Date: Wednesday, October 1, 2014 at 1:47 PM
>> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
>> Subject: Re: OCR with tika-server
>>
>> >Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>> >was just a typo issue and I randomly happen to catch that. I've
>> >definitely done that one before myself.
>> >
>> >Bummed that was not the problem.
>> >
>> >--Paul
>> >
>> >On Oct 1, 2014, at 1:00 PM, kevin slote <ks...@gmail.com>
>> > wrote:
>> >
>> >> What I wrote there did have a typo in it. (It's not every day you get
>> to
>> >> embarrass yourself in front of a bunch of guys from NASA)
>> >>
>> >> But that was not what I had in my terminal when I tested it.
>> >>
>> >>
>> >>
>> >> The actual PATH was:
>> >>
>> >>
>> >>
>> >>
>> >>
>>
>> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>> >>ames:/usr/bin/tesseract"
>> >>
>> >>
>> >>
>> >> I think what was actually wrong with the path is that I added the
>> entire
>> >> path to the tesseract executable, which was in my /usr/bin/ directory,
>> >> instead of just the directory where tesseract lives.  Is this true?
>> >>
>> >>
>> >>
>> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
>> >>printed
>> >> config.getTesseractPath() to stdout.  This field was empty.
>> >>
>> >> However, I have tesseract installed system wide on this ubuntu vm.
>> >>
>> >> So the canRun method evaluated as true whether or not the tesseractPath
>> >>was
>> >> configured correctly.
>> >>
>> >>
>> >>
>> >> I have been slowly trying to debug this all day.  It looks like tika is
>> >> making a tmp file with the .tmp preffix.
>> >>
>> >> I commented out some of the code to so that they remained in /tmp/.
>> >>
>> >>
>> >>
>> >> It looks like tesseract doesn't like that.
>> >>
>> >> I tried to ocr these .tmp files to see if I could isolate what was
>> going
>> >> wrong for me.
>> >>
>> >>
>> >>
>> >> kslote@ubuntu:~/tika/tika$ tesseract
>> >> /tmp/apache-tika-7112319184053570698.tmp out
>> >>
>> >> Tesseract Open Source OCR Engine
>> >>
>> >> name_to_image_type:Error:Unrecognized image
>> >> type:/tmp/apache-tika-7112319184053570698.tmp
>> >>
>> >> IMAGE::read_header:Error:Can't read this image
>> >> type:/tmp/apache-tika-7112319184053570698.tmp
>> >>
>> >> tesseract:Error:Read of file
>> >>failed:/tmp/apache-tika-7112319184053570698.tmp
>> >>
>> >> Segmentation fault
>> >>
>> >>
>> >>
>> >> On the wiki it mentions something about getting tesseract to work with
>> >> .tiff files.  For whatever reason, the tesseract I have installed only
>> >> works for .tiff files.  Would it be recommend that I re install
>> >>tesseract
>> >> from the source?
>> >>
>> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>> >> paul.m.ramirez@jpl.nasa.gov> wrote:
>> >>
>> >>> Is that a typo in your path to tesseract?
>> >>>
>> >>> /urs/bin/tesseract => /usr/bin/tesseract
>> >>>
>> >>> --Paul
>> >>>
>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com>
>> wrote:
>> >>>>
>> >>>> Unfortunately, that did not do it either.
>> >>>>
>> >>>> I did:
>> >>>>
>> >>>>  $export
>> >>>>
>> >>>
>>
>> >>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>> >>>ames:/urs/bin/tesseract
>> >>>>
>> >>>> Here is the output from printenv
>> >>>>
>> >>>> kslote@ubuntu:~/tika/tika$ printenv
>> >>>> SHELL=/bin/bash
>> >>>> USERNAME=kslote
>> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>> >>>> DESKTOP_SESSION=gnome
>> >>>>
>> >>>
>>
>> >>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
>> >>>usr/games:/urs/bin/tesseract
>> >>>> PWD=/home/kslote/tika/tika
>> >>>> HOME=/home/kslote
>> >>>> LOGNAME=kslote
>> >>>> _=/usr/bin/printenv
>> >>>>
>> >>>>
>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
>> >>>><tp...@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you
>> install
>> >>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
>> >>> install
>> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
>> >>> running
>> >>>>> into TIKA-1422, where a mail test fails. But, you can run just the
>> >>>>>OCR
>> >>>>> tests with `mvn test
>> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>> >>>>> -DfailIfNoTests=false`.
>> >>>>>
>> >>>>> Let me know if that works for you!
>> >>>>> Tyler
>> >>>>>
>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com>
>> >>> wrote:
>> >>>>>>
>> >>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>> >>>>>> Tesseract is installed correctly, but just doing a clone from the
>> >>>>>>repo
>> >>>>> and
>> >>>>>> installing with maven, I am getting some errors.
>> >>>>>>
>> >>>>>> This is before I did anything with tesseract installed.
>> >>>>>>
>> >>>>>> Failed tests:
>> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>> >>>>>> Check for the image's text.
>> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>> >>>>>>
>> >>>>>> Next I hard coded the tesseractPath:
>> >>>>>>
>> >>>>>> I went into the TesseractOCRConfig.java and hard coded
>> >>>>>>'tesseractPath.'
>> >>>>>> The all tests passed and it built successfully, but then I went to
>> >>>>>>post
>> >>>>>> some tiff's to the server.
>> >>>>>> That didn't work. So I tried adding some System.out.println("hello
>> >>>>> world")
>> >>>>>> (a little crude I know) inside the unit tests to confirm that
>> >>>>>>tesseract
>> >>>>>> was working correctly.  It looks like something happens in the unit
>> >>> test
>> >>>>> in
>> >>>>>> TesseractOCRTest.java
>> >>>>>> on the line that says TesseractOCRConfig config = new
>> >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get
>> >>> nothing
>> >>>>>> after. That happens before the assumeTrue(canRun(config));. So an
>> >>>>> exception
>> >>>>>> is not get raised.
>> >>>>>>
>> >>>>>> Then once everything is built, ocr does not work.  That was why I
>> >>>>> figured I
>> >>>>>> would ask to see if I missed some sort of configuration step in
>> >>> building
>> >>>>>> it.
>> >>>>>>
>> >>>>>> Thanks a ton.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>> >>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>> >>>>>>
>> >>>>>>> Dear Kevin,
>> >>>>>>>
>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>> >>>>>>>
>> >>>>>>> See this wiki page:
>> >>>>>>>
>> >>>>>>> https://wiki.apache.org/tika/TikaOCR
>> >>>>>>>
>> >>>>>>> I¹d be happy to discuss more.
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>>
>> >>>>>>> Cheers,
>> >>>>>>> Chris
>> >>>>>>>
>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>>> Chris Mattmann, Ph.D.
>> >>>>>>> Chief Architect
>> >>>>>>> Instrument Software and Science Data Systems Section (398)
>> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>>>>>> Office: 168-519, Mailstop: 168-527
>> >>>>>>> Email: chris.a.mattmann@nasa.gov
>> >>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>>> Adjunct Associate Professor, Computer Science Department
>> >>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> -----Original Message-----
>> >>>>>>> From: kevin slote <ks...@gmail.com>
>> >>>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>> >>>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>> >>>>>>> Subject: OCR with tika-server
>> >>>>>>>
>> >>>>>>>> Hello all,
>> >>>>>>>>
>> >>>>>>>> I have been testing out the integration of tika with tesseract.
>> >>>>>>>> I was wondering if there is  a way to get tika-server to run with
>> >>>>>>>> tesseract's OCR capabilities?
>> >>>>>>>>
>> >>>>>>>> Best
>> >>>>>>>>
>> >>>>>>>> Kevin Slote
>> >>>>>
>> >>>
>> >
>>
>>
>

Re: OCR with tika-server

Posted by kevin slote <ks...@gmail.com>.

Thanks for following up!

I was trying to dig deeper before I responded.

Tyler,

I followed those instructions.  My version of Tesseract does not ocr the
google logo because it is not a tiff.  I used imagemagick to convert it to
a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8
bpp are supported:32" error which usually means it needs to be re-sized
with imagemagick.


Chris,

I wrote a python wrapper for tesseract that can parse the documents that
were in your test-document repository concerning OCR (testOCR.pdf, etc.) It
looks like right now, in TesseractOCRParser.java, the command line argument
that is passed to the os points to a .tmp file in /tmp/.

So the command that is executed is

   "tesseract /tmp/apache-tika-2409864150710514587.tmp
/tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"

This is not working for me.  When I grab those .tmp files and try to ocr
them from the command line, tesseract gets thrown for a loop.

>From what I can tell, is the tesseract I have installed can only handle
.tif files.
I can back this up by citing the tesseract page:
https://code.google.com/p/tesseract-ocr/wiki/ReadMe

 If Tesseract isn't available for your distribution, or you want to use a
newer version than they offer, you can compile your own
<https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that  older
versions of Tesseract only supported processing .tiff files.

So, I think that upgrading tesseract or moving to ubuntu 12 or higher will
solve my problems.

I will let the listserv know if that fixes it.


Kevin Slote



On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> What type of image is it, Kevin?
>
> If it’s a TIFF, you need to install tesseract with special lib tiff
> parameters. See:
>
> https://gist.github.com/henrik/1967035
>
>
> Can you parse the image file with tesseract by itself, without
> Tika’s tmp image?
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: <Ramirez>, "Paul M   (398J)" <pa...@jpl.nasa.gov>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Wednesday, October 1, 2014 at 1:47 PM
> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
> Subject: Re: OCR with tika-server
>
> >Nothing to be embarrassed about at all Kevin. I actually thought maybe it
> >was just a typo issue and I randomly happen to catch that. I've
> >definitely done that one before myself.
> >
> >Bummed that was not the problem.
> >
> >--Paul
> >
> >On Oct 1, 2014, at 1:00 PM, kevin slote <ks...@gmail.com>
> > wrote:
> >
> >> What I wrote there did have a typo in it. (It's not every day you get to
> >> embarrass yourself in front of a bunch of guys from NASA)
> >>
> >> But that was not what I had in my terminal when I tested it.
> >>
> >>
> >>
> >> The actual PATH was:
> >>
> >>
> >>
> >>
> >>
> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
> >>ames:/usr/bin/tesseract"
> >>
> >>
> >>
> >> I think what was actually wrong with the path is that I added the entire
> >> path to the tesseract executable, which was in my /usr/bin/ directory,
> >> instead of just the directory where tesseract lives.  Is this true?
> >>
> >>
> >>
> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
> >>printed
> >> config.getTesseractPath() to stdout.  This field was empty.
> >>
> >> However, I have tesseract installed system wide on this ubuntu vm.
> >>
> >> So the canRun method evaluated as true whether or not the tesseractPath
> >>was
> >> configured correctly.
> >>
> >>
> >>
> >> I have been slowly trying to debug this all day.  It looks like tika is
> >> making a tmp file with the .tmp preffix.
> >>
> >> I commented out some of the code to so that they remained in /tmp/.
> >>
> >>
> >>
> >> It looks like tesseract doesn't like that.
> >>
> >> I tried to ocr these .tmp files to see if I could isolate what was going
> >> wrong for me.
> >>
> >>
> >>
> >> kslote@ubuntu:~/tika/tika$ tesseract
> >> /tmp/apache-tika-7112319184053570698.tmp out
> >>
> >> Tesseract Open Source OCR Engine
> >>
> >> name_to_image_type:Error:Unrecognized image
> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> IMAGE::read_header:Error:Can't read this image
> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> tesseract:Error:Read of file
> >>failed:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> Segmentation fault
> >>
> >>
> >>
> >> On the wiki it mentions something about getting tesseract to work with
> >> .tiff files.  For whatever reason, the tesseract I have installed only
> >> works for .tiff files.  Would it be recommend that I re install
> >>tesseract
> >> from the source?
> >>
> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
> >> paul.m.ramirez@jpl.nasa.gov> wrote:
> >>
> >>> Is that a typo in your path to tesseract?
> >>>
> >>> /urs/bin/tesseract => /usr/bin/tesseract
> >>>
> >>> --Paul
> >>>
> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com> wrote:
> >>>>
> >>>> Unfortunately, that did not do it either.
> >>>>
> >>>> I did:
> >>>>
> >>>>  $export
> >>>>
> >>>
> >>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
> >>>ames:/urs/bin/tesseract
> >>>>
> >>>> Here is the output from printenv
> >>>>
> >>>> kslote@ubuntu:~/tika/tika$ printenv
> >>>> SHELL=/bin/bash
> >>>> USERNAME=kslote
> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> >>>> DESKTOP_SESSION=gnome
> >>>>
> >>>
> >>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
> >>>usr/games:/urs/bin/tesseract
> >>>> PWD=/home/kslote/tika/tika
> >>>> HOME=/home/kslote
> >>>> LOGNAME=kslote
> >>>> _=/usr/bin/printenv
> >>>>
> >>>>
> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
> >>>><tp...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you install
> >>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
> >>> install
> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
> >>> running
> >>>>> into TIKA-1422, where a mail test fails. But, you can run just the
> >>>>>OCR
> >>>>> tests with `mvn test
> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> >>>>> -DfailIfNoTests=false`.
> >>>>>
> >>>>> Let me know if that works for you!
> >>>>> Tyler
> >>>>>
> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> I am working on ubuntu 10.4. and I am having some trouble.
> >>>>>> Tesseract is installed correctly, but just doing a clone from the
> >>>>>>repo
> >>>>> and
> >>>>>> installing with maven, I am getting some errors.
> >>>>>>
> >>>>>> This is before I did anything with tesseract installed.
> >>>>>>
> >>>>>> Failed tests:
> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> >>>>>> Check for the image's text.
> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>>>>
> >>>>>> Next I hard coded the tesseractPath:
> >>>>>>
> >>>>>> I went into the TesseractOCRConfig.java and hard coded
> >>>>>>'tesseractPath.'
> >>>>>> The all tests passed and it built successfully, but then I went to
> >>>>>>post
> >>>>>> some tiff's to the server.
> >>>>>> That didn't work. So I tried adding some System.out.println("hello
> >>>>> world")
> >>>>>> (a little crude I know) inside the unit tests to confirm that
> >>>>>>tesseract
> >>>>>> was working correctly.  It looks like something happens in the unit
> >>> test
> >>>>> in
> >>>>>> TesseractOCRTest.java
> >>>>>> on the line that says TesseractOCRConfig config = new
> >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get
> >>> nothing
> >>>>>> after. That happens before the assumeTrue(canRun(config));. So an
> >>>>> exception
> >>>>>> is not get raised.
> >>>>>>
> >>>>>> Then once everything is built, ocr does not work.  That was why I
> >>>>> figured I
> >>>>>> would ask to see if I missed some sort of configuration step in
> >>> building
> >>>>>> it.
> >>>>>>
> >>>>>> Thanks a ton.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> >>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>>>>>
> >>>>>>> Dear Kevin,
> >>>>>>>
> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
> >>>>>>>
> >>>>>>> See this wiki page:
> >>>>>>>
> >>>>>>> https://wiki.apache.org/tika/TikaOCR
> >>>>>>>
> >>>>>>> I¹d be happy to discuss more.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Chris
> >>>>>>>
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>> Chris Mattmann, Ph.D.
> >>>>>>> Chief Architect
> >>>>>>> Instrument Software and Science Data Systems Section (398)
> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>>> Office: 168-519, Mailstop: 168-527
> >>>>>>> Email: chris.a.mattmann@nasa.gov
> >>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>> Adjunct Associate Professor, Computer Science Department
> >>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: kevin slote <ks...@gmail.com>
> >>>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
> >>>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>>>>>> Subject: OCR with tika-server
> >>>>>>>
> >>>>>>>> Hello all,
> >>>>>>>>
> >>>>>>>> I have been testing out the integration of tika with tesseract.
> >>>>>>>> I was wondering if there is  a way to get tika-server to run with
> >>>>>>>> tesseract's OCR capabilities?
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>>
> >>>>>>>> Kevin Slote
> >>>>>
> >>>
> >
>
>

Re: OCR with tika-server

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hi Kevin just checking back - did you get it working?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Mattmann>, Chris Mattmann <Ch...@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, October 1, 2014 at 2:13 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: OCR with tika-server

>What type of image is it, Kevin?
>
>If it’s a TIFF, you need to install tesseract with special lib tiff
>parameters. See:
>
>https://gist.github.com/henrik/1967035
>
>
>Can you parse the image file with tesseract by itself, without
>Tika’s tmp image?
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Ramirez>, "Paul M   (398J)" <pa...@jpl.nasa.gov>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Wednesday, October 1, 2014 at 1:47 PM
>To: "<de...@tika.apache.org>" <de...@tika.apache.org>
>Subject: Re: OCR with tika-server
>
>>Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>>was just a typo issue and I randomly happen to catch that. I've
>>definitely done that one before myself.
>>
>>Bummed that was not the problem.
>>
>>--Paul
>>
>>On Oct 1, 2014, at 1:00 PM, kevin slote <ks...@gmail.com>
>> wrote:
>>
>>> What I wrote there did have a typo in it. (It's not every day you get
>>>to
>>> embarrass yourself in front of a bunch of guys from NASA)
>>> 
>>> But that was not what I had in my terminal when I tested it.
>>> 
>>> 
>>> 
>>> The actual PATH was:
>>> 
>>> 
>>> 
>>> 
>>> 
>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/
>>>g
>>>ames:/usr/bin/tesseract"
>>> 
>>> 
>>> 
>>> I think what was actually wrong with the path is that I added the
>>>entire
>>> path to the tesseract executable, which was in my /usr/bin/ directory,
>>> instead of just the directory where tesseract lives.  Is this true?
>>> 
>>> 
>>> 
>>> I deleted the hard coding from the TesseractOCRConfig.jave and then
>>>printed
>>> config.getTesseractPath() to stdout.  This field was empty.
>>> 
>>> However, I have tesseract installed system wide on this ubuntu vm.
>>> 
>>> So the canRun method evaluated as true whether or not the tesseractPath
>>>was
>>> configured correctly.
>>> 
>>> 
>>> 
>>> I have been slowly trying to debug this all day.  It looks like tika is
>>> making a tmp file with the .tmp preffix.
>>> 
>>> I commented out some of the code to so that they remained in /tmp/.
>>> 
>>> 
>>> 
>>> It looks like tesseract doesn't like that.
>>> 
>>> I tried to ocr these .tmp files to see if I could isolate what was
>>>going
>>> wrong for me.
>>> 
>>> 
>>> 
>>> kslote@ubuntu:~/tika/tika$ tesseract
>>> /tmp/apache-tika-7112319184053570698.tmp out
>>> 
>>> Tesseract Open Source OCR Engine
>>> 
>>> name_to_image_type:Error:Unrecognized image
>>> type:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> IMAGE::read_header:Error:Can't read this image
>>> type:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> tesseract:Error:Read of file
>>>failed:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> Segmentation fault
>>> 
>>> 
>>> 
>>> On the wiki it mentions something about getting tesseract to work with
>>> .tiff files.  For whatever reason, the tesseract I have installed only
>>> works for .tiff files.  Would it be recommend that I re install
>>>tesseract
>>> from the source?
>>> 
>>> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>>> paul.m.ramirez@jpl.nasa.gov> wrote:
>>> 
>>>> Is that a typo in your path to tesseract?
>>>> 
>>>> /urs/bin/tesseract => /usr/bin/tesseract
>>>> 
>>>> --Paul
>>>> 
>>>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com> wrote:
>>>>> 
>>>>> Unfortunately, that did not do it either.
>>>>> 
>>>>> I did:
>>>>> 
>>>>>  $export
>>>>> 
>>>> 
>>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/
>>>>g
>>>>ames:/urs/bin/tesseract
>>>>> 
>>>>> Here is the output from printenv
>>>>> 
>>>>> kslote@ubuntu:~/tika/tika$ printenv
>>>>> SHELL=/bin/bash
>>>>> USERNAME=kslote
>>>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>>>>> DESKTOP_SESSION=gnome
>>>>> 
>>>> 
>>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:
>>>>/
>>>>usr/games:/urs/bin/tesseract
>>>>> PWD=/home/kslote/tika/tika
>>>>> HOME=/home/kslote
>>>>> LOGNAME=kslote
>>>>> _=/usr/bin/printenv
>>>>> 
>>>>> 
>>>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
>>>>><tp...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Hmm. Could you try adding tesseract to your PATH? How did you
>>>>>>install
>>>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
>>>> install
>>>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
>>>> running
>>>>>> into TIKA-1422, where a mail test fails. But, you can run just the
>>>>>>OCR
>>>>>> tests with `mvn test
>>>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>>>>>> -DfailIfNoTests=false`.
>>>>>> 
>>>>>> Let me know if that works for you!
>>>>>> Tyler
>>>>>> 
>>>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>>>>>>> Tesseract is installed correctly, but just doing a clone from the
>>>>>>>repo
>>>>>> and
>>>>>>> installing with maven, I am getting some errors.
>>>>>>> 
>>>>>>> This is before I did anything with tesseract installed.
>>>>>>> 
>>>>>>> Failed tests:
>>>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>>>>>> Check for the image's text.
>>>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>>>> 
>>>>>>> Next I hard coded the tesseractPath:
>>>>>>> 
>>>>>>> I went into the TesseractOCRConfig.java and hard coded
>>>>>>>'tesseractPath.'
>>>>>>> The all tests passed and it built successfully, but then I went to
>>>>>>>post
>>>>>>> some tiff's to the server.
>>>>>>> That didn't work. So I tried adding some System.out.println("hello
>>>>>> world")
>>>>>>> (a little crude I know) inside the unit tests to confirm that
>>>>>>>tesseract
>>>>>>> was working correctly.  It looks like something happens in the unit
>>>> test
>>>>>> in
>>>>>>> TesseractOCRTest.java
>>>>>>> on the line that says TesseractOCRConfig config = new
>>>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get
>>>> nothing
>>>>>>> after. That happens before the assumeTrue(canRun(config));. So an
>>>>>> exception
>>>>>>> is not get raised.
>>>>>>> 
>>>>>>> Then once everything is built, ocr does not work.  That was why I
>>>>>> figured I
>>>>>>> would ask to see if I missed some sort of configuration step in
>>>> building
>>>>>>> it.
>>>>>>> 
>>>>>>> Thanks a ton.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>>>>>> 
>>>>>>>> Dear Kevin,
>>>>>>>> 
>>>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>>>>>>> 
>>>>>>>> See this wiki page:
>>>>>>>> 
>>>>>>>> https://wiki.apache.org/tika/TikaOCR
>>>>>>>> 
>>>>>>>> I¹d be happy to discuss more.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>> Chief Architect
>>>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: kevin slote <ks...@gmail.com>
>>>>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>>>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>>>>> Subject: OCR with tika-server
>>>>>>>> 
>>>>>>>>> Hello all,
>>>>>>>>> 
>>>>>>>>> I have been testing out the integration of tika with tesseract.
>>>>>>>>> I was wondering if there is  a way to get tika-server to run with
>>>>>>>>> tesseract's OCR capabilities?
>>>>>>>>> 
>>>>>>>>> Best
>>>>>>>>> 
>>>>>>>>> Kevin Slote
>>>>>> 
>>>> 
>>
>

Re: OCR with tika-server

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

What type of image is it, Kevin?

If it’s a TIFF, you need to install tesseract with special lib tiff
parameters. See:

https://gist.github.com/henrik/1967035


Can you parse the image file with tesseract by itself, without
Tika’s tmp image?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Ramirez>, "Paul M   (398J)" <pa...@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, October 1, 2014 at 1:47 PM
To: "<de...@tika.apache.org>" <de...@tika.apache.org>
Subject: Re: OCR with tika-server

>Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>was just a typo issue and I randomly happen to catch that. I've
>definitely done that one before myself.
>
>Bummed that was not the problem.
>
>--Paul
>
>On Oct 1, 2014, at 1:00 PM, kevin slote <ks...@gmail.com>
> wrote:
>
>> What I wrote there did have a typo in it. (It's not every day you get to
>> embarrass yourself in front of a bunch of guys from NASA)
>> 
>> But that was not what I had in my terminal when I tested it.
>> 
>> 
>> 
>> The actual PATH was:
>> 
>> 
>> 
>> 
>> 
>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>>ames:/usr/bin/tesseract"
>> 
>> 
>> 
>> I think what was actually wrong with the path is that I added the entire
>> path to the tesseract executable, which was in my /usr/bin/ directory,
>> instead of just the directory where tesseract lives.  Is this true?
>> 
>> 
>> 
>> I deleted the hard coding from the TesseractOCRConfig.jave and then
>>printed
>> config.getTesseractPath() to stdout.  This field was empty.
>> 
>> However, I have tesseract installed system wide on this ubuntu vm.
>> 
>> So the canRun method evaluated as true whether or not the tesseractPath
>>was
>> configured correctly.
>> 
>> 
>> 
>> I have been slowly trying to debug this all day.  It looks like tika is
>> making a tmp file with the .tmp preffix.
>> 
>> I commented out some of the code to so that they remained in /tmp/.
>> 
>> 
>> 
>> It looks like tesseract doesn't like that.
>> 
>> I tried to ocr these .tmp files to see if I could isolate what was going
>> wrong for me.
>> 
>> 
>> 
>> kslote@ubuntu:~/tika/tika$ tesseract
>> /tmp/apache-tika-7112319184053570698.tmp out
>> 
>> Tesseract Open Source OCR Engine
>> 
>> name_to_image_type:Error:Unrecognized image
>> type:/tmp/apache-tika-7112319184053570698.tmp
>> 
>> IMAGE::read_header:Error:Can't read this image
>> type:/tmp/apache-tika-7112319184053570698.tmp
>> 
>> tesseract:Error:Read of file
>>failed:/tmp/apache-tika-7112319184053570698.tmp
>> 
>> Segmentation fault
>> 
>> 
>> 
>> On the wiki it mentions something about getting tesseract to work with
>> .tiff files.  For whatever reason, the tesseract I have installed only
>> works for .tiff files.  Would it be recommend that I re install
>>tesseract
>> from the source?
>> 
>> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>> paul.m.ramirez@jpl.nasa.gov> wrote:
>> 
>>> Is that a typo in your path to tesseract?
>>> 
>>> /urs/bin/tesseract => /usr/bin/tesseract
>>> 
>>> --Paul
>>> 
>>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com> wrote:
>>>> 
>>>> Unfortunately, that did not do it either.
>>>> 
>>>> I did:
>>>> 
>>>>  $export
>>>> 
>>> 
>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>>>ames:/urs/bin/tesseract
>>>> 
>>>> Here is the output from printenv
>>>> 
>>>> kslote@ubuntu:~/tika/tika$ printenv
>>>> SHELL=/bin/bash
>>>> USERNAME=kslote
>>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>>>> DESKTOP_SESSION=gnome
>>>> 
>>> 
>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
>>>usr/games:/urs/bin/tesseract
>>>> PWD=/home/kslote/tika/tika
>>>> HOME=/home/kslote
>>>> LOGNAME=kslote
>>>> _=/usr/bin/printenv
>>>> 
>>>> 
>>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
>>>><tp...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Hmm. Could you try adding tesseract to your PATH? How did you install
>>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
>>> install
>>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
>>> running
>>>>> into TIKA-1422, where a mail test fails. But, you can run just the
>>>>>OCR
>>>>> tests with `mvn test
>>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>>>>> -DfailIfNoTests=false`.
>>>>> 
>>>>> Let me know if that works for you!
>>>>> Tyler
>>>>> 
>>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>>>>>> Tesseract is installed correctly, but just doing a clone from the
>>>>>>repo
>>>>> and
>>>>>> installing with maven, I am getting some errors.
>>>>>> 
>>>>>> This is before I did anything with tesseract installed.
>>>>>> 
>>>>>> Failed tests:
>>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>>>>> Check for the image's text.
>>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>>> 
>>>>>> Next I hard coded the tesseractPath:
>>>>>> 
>>>>>> I went into the TesseractOCRConfig.java and hard coded
>>>>>>'tesseractPath.'
>>>>>> The all tests passed and it built successfully, but then I went to
>>>>>>post
>>>>>> some tiff's to the server.
>>>>>> That didn't work. So I tried adding some System.out.println("hello
>>>>> world")
>>>>>> (a little crude I know) inside the unit tests to confirm that
>>>>>>tesseract
>>>>>> was working correctly.  It looks like something happens in the unit
>>> test
>>>>> in
>>>>>> TesseractOCRTest.java
>>>>>> on the line that says TesseractOCRConfig config = new
>>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get
>>> nothing
>>>>>> after. That happens before the assumeTrue(canRun(config));. So an
>>>>> exception
>>>>>> is not get raised.
>>>>>> 
>>>>>> Then once everything is built, ocr does not work.  That was why I
>>>>> figured I
>>>>>> would ask to see if I missed some sort of configuration step in
>>> building
>>>>>> it.
>>>>>> 
>>>>>> Thanks a ton.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>>>>> 
>>>>>>> Dear Kevin,
>>>>>>> 
>>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>>>>>> 
>>>>>>> See this wiki page:
>>>>>>> 
>>>>>>> https://wiki.apache.org/tika/TikaOCR
>>>>>>> 
>>>>>>> I¹d be happy to discuss more.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Chris
>>>>>>> 
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Chief Architect
>>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: kevin slote <ks...@gmail.com>
>>>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>>>> Subject: OCR with tika-server
>>>>>>> 
>>>>>>>> Hello all,
>>>>>>>> 
>>>>>>>> I have been testing out the integration of tika with tesseract.
>>>>>>>> I was wondering if there is  a way to get tika-server to run with
>>>>>>>> tesseract's OCR capabilities?
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> 
>>>>>>>> Kevin Slote
>>>>> 
>>> 
>

Re: OCR with tika-server

Posted by "Ramirez, Paul M (398J)" <pa...@jpl.nasa.gov>.

Nothing to be embarrassed about at all Kevin. I actually thought maybe it was just a typo issue and I randomly happen to catch that. I've definitely done that one before myself. 

Bummed that was not the problem. 

--Paul

On Oct 1, 2014, at 1:00 PM, kevin slote <ks...@gmail.com>
 wrote:

> What I wrote there did have a typo in it. (It's not every day you get to
> embarrass yourself in front of a bunch of guys from NASA)
> 
> But that was not what I had in my terminal when I tested it.
> 
> 
> 
> The actual PATH was:
> 
> 
> 
> 
> "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/bin/tesseract"
> 
> 
> 
> I think what was actually wrong with the path is that I added the entire
> path to the tesseract executable, which was in my /usr/bin/ directory,
> instead of just the directory where tesseract lives.  Is this true?
> 
> 
> 
> I deleted the hard coding from the TesseractOCRConfig.jave and then printed
> config.getTesseractPath() to stdout.  This field was empty.
> 
> However, I have tesseract installed system wide on this ubuntu vm.
> 
> So the canRun method evaluated as true whether or not the tesseractPath was
> configured correctly.
> 
> 
> 
> I have been slowly trying to debug this all day.  It looks like tika is
> making a tmp file with the .tmp preffix.
> 
> I commented out some of the code to so that they remained in /tmp/.
> 
> 
> 
> It looks like tesseract doesn't like that.
> 
> I tried to ocr these .tmp files to see if I could isolate what was going
> wrong for me.
> 
> 
> 
> kslote@ubuntu:~/tika/tika$ tesseract
> /tmp/apache-tika-7112319184053570698.tmp out
> 
> Tesseract Open Source OCR Engine
> 
> name_to_image_type:Error:Unrecognized image
> type:/tmp/apache-tika-7112319184053570698.tmp
> 
> IMAGE::read_header:Error:Can't read this image
> type:/tmp/apache-tika-7112319184053570698.tmp
> 
> tesseract:Error:Read of file failed:/tmp/apache-tika-7112319184053570698.tmp
> 
> Segmentation fault
> 
> 
> 
> On the wiki it mentions something about getting tesseract to work with
> .tiff files.  For whatever reason, the tesseract I have installed only
> works for .tiff files.  Would it be recommend that I re install tesseract
> from the source?
> 
> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
> paul.m.ramirez@jpl.nasa.gov> wrote:
> 
>> Is that a typo in your path to tesseract?
>> 
>> /urs/bin/tesseract => /usr/bin/tesseract
>> 
>> --Paul
>> 
>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com> wrote:
>>> 
>>> Unfortunately, that did not do it either.
>>> 
>>> I did:
>>> 
>>>  $export
>>> 
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
>>> 
>>> Here is the output from printenv
>>> 
>>> kslote@ubuntu:~/tika/tika$ printenv
>>> SHELL=/bin/bash
>>> USERNAME=kslote
>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>>> DESKTOP_SESSION=gnome
>>> 
>> PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
>>> PWD=/home/kslote/tika/tika
>>> HOME=/home/kslote
>>> LOGNAME=kslote
>>> _=/usr/bin/printenv
>>> 
>>> 
>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich <tp...@gmail.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Hmm. Could you try adding tesseract to your PATH? How did you install
>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
>> install
>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
>> running
>>>> into TIKA-1422, where a mail test fails. But, you can run just the OCR
>>>> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>>>> -DfailIfNoTests=false`.
>>>> 
>>>> Let me know if that works for you!
>>>> Tyler
>>>> 
>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com>
>> wrote:
>>>>> 
>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>>>>> Tesseract is installed correctly, but just doing a clone from the repo
>>>> and
>>>>> installing with maven, I am getting some errors.
>>>>> 
>>>>> This is before I did anything with tesseract installed.
>>>>> 
>>>>> Failed tests:
>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>>>> Check for the image's text.
>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>> 
>>>>> Next I hard coded the tesseractPath:
>>>>> 
>>>>> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
>>>>> The all tests passed and it built successfully, but then I went to post
>>>>> some tiff's to the server.
>>>>> That didn't work. So I tried adding some System.out.println("hello
>>>> world")
>>>>> (a little crude I know) inside the unit tests to confirm that tesseract
>>>>> was working correctly.  It looks like something happens in the unit
>> test
>>>> in
>>>>> TesseractOCRTest.java
>>>>> on the line that says TesseractOCRConfig config = new
>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get
>> nothing
>>>>> after. That happens before the assumeTrue(canRun(config));. So an
>>>> exception
>>>>> is not get raised.
>>>>> 
>>>>> Then once everything is built, ocr does not work.  That was why I
>>>> figured I
>>>>> would ask to see if I missed some sort of configuration step in
>> building
>>>>> it.
>>>>> 
>>>>> Thanks a ton.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>>>> 
>>>>>> Dear Kevin,
>>>>>> 
>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>>>>> 
>>>>>> See this wiki page:
>>>>>> 
>>>>>> https://wiki.apache.org/tika/TikaOCR
>>>>>> 
>>>>>> I¹d be happy to discuss more.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> Cheers,
>>>>>> Chris
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: kevin slote <ks...@gmail.com>
>>>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>>>> Subject: OCR with tika-server
>>>>>> 
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> I have been testing out the integration of tika with tesseract.
>>>>>>> I was wondering if there is  a way to get tika-server to run with
>>>>>>> tesseract's OCR capabilities?
>>>>>>> 
>>>>>>> Best
>>>>>>> 
>>>>>>> Kevin Slote
>>>> 
>>

Re: OCR with tika-server

Posted by Tyler Palsulich <tp...@gmail.com>.

Hmm. Can you run Tesseract on a simple png file? This example doesn't work
very well for OCR... But, for the sake of example:

$ sudo apt-get install tesseract-ocr
$ java -jar tika-server/target/tika-server-1.7-SNAPSHOT.jar
// New terminal.
// Grab the Google logo.
$ curl -O https://www.google.com/images/srpr/logo11w.png
$ curl -sT ./logo11w.png localhost:9998/tika
(300316

Hope that works,
Tyler

On Wed, Oct 1, 2014 at 4:10 PM, Tom Barber <to...@meteorite.bi> wrote:

> Ah you get used to it after a while! ;)
>
>
> On 01/10/14 21:00, kevin slote wrote:
>
>> (It's not every day you get to
>> embarrass yourself in front of a bunch of guys from NASA)
>>
>>
>>  --
> *Tom Barber* | Technical Director
>
> meteorite bi
> *T:* +44 20 8133 3730
> *W:* www.meteorite.bi | *Skype:* meteorite.consulting
> *A:* Surrey Technology Centre, Surrey Research Park, Guildford, GU2 7YG, UK
>

Re: OCR with tika-server

Posted by Tom Barber <to...@meteorite.bi>.

Ah you get used to it after a while! ;)

On 01/10/14 21:00, kevin slote wrote:
> (It's not every day you get to
> embarrass yourself in front of a bunch of guys from NASA)
>
>
-- 
*Tom Barber* | Technical Director

meteorite bi
*T:* +44 20 8133 3730
*W:* www.meteorite.bi | *Skype:* meteorite.consulting
*A:* Surrey Technology Centre, Surrey Research Park, Guildford, GU2 7YG, UK

Re: OCR with tika-server

Posted by kevin slote <ks...@gmail.com>.

What I wrote there did have a typo in it. (It's not every day you get to
embarrass yourself in front of a bunch of guys from NASA)

But that was not what I had in my terminal when I tested it.



The actual PATH was:




"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/bin/tesseract"



I think what was actually wrong with the path is that I added the entire
path to the tesseract executable, which was in my /usr/bin/ directory,
instead of just the directory where tesseract lives.  Is this true?



I deleted the hard coding from the TesseractOCRConfig.jave and then printed
config.getTesseractPath() to stdout.  This field was empty.

However, I have tesseract installed system wide on this ubuntu vm.

So the canRun method evaluated as true whether or not the tesseractPath was
configured correctly.



I have been slowly trying to debug this all day.  It looks like tika is
making a tmp file with the .tmp preffix.

I commented out some of the code to so that they remained in /tmp/.



It looks like tesseract doesn't like that.

I tried to ocr these .tmp files to see if I could isolate what was going
wrong for me.



kslote@ubuntu:~/tika/tika$ tesseract
/tmp/apache-tika-7112319184053570698.tmp out

Tesseract Open Source OCR Engine

name_to_image_type:Error:Unrecognized image
type:/tmp/apache-tika-7112319184053570698.tmp

IMAGE::read_header:Error:Can't read this image
type:/tmp/apache-tika-7112319184053570698.tmp

tesseract:Error:Read of file failed:/tmp/apache-tika-7112319184053570698.tmp

Segmentation fault



On the wiki it mentions something about getting tesseract to work with
.tiff files.  For whatever reason, the tesseract I have installed only
works for .tiff files.  Would it be recommend that I re install tesseract
from the source?

On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
paul.m.ramirez@jpl.nasa.gov> wrote:

> Is that a typo in your path to tesseract?
>
> /urs/bin/tesseract => /usr/bin/tesseract
>
> --Paul
>
> > On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com> wrote:
> >
> > Unfortunately, that did not do it either.
> >
> > I did:
> >
> >   $export
> >
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
> >
> > Here is the output from printenv
> >
> > kslote@ubuntu:~/tika/tika$ printenv
> > SHELL=/bin/bash
> > USERNAME=kslote
> > XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> > DESKTOP_SESSION=gnome
> >
> PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
> > PWD=/home/kslote/tika/tika
> > HOME=/home/kslote
> > LOGNAME=kslote
> > _=/usr/bin/printenv
> >
> >
> > On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich <tp...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Hmm. Could you try adding tesseract to your PATH? How did you install
> >> Tesseract? You should be able to do a straightforward `sudo apt-get
> install
> >> tesseract-ocr`. After that, the OCR tests should pass. We're still
> running
> >> into TIKA-1422, where a mail test fails. But, you can run just the OCR
> >> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> >> -DfailIfNoTests=false`.
> >>
> >> Let me know if that works for you!
> >> Tyler
> >>
> >>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com>
> wrote:
> >>>
> >>> I am working on ubuntu 10.4. and I am having some trouble.
> >>> Tesseract is installed correctly, but just doing a clone from the repo
> >> and
> >>> installing with maven, I am getting some errors.
> >>>
> >>> This is before I did anything with tesseract installed.
> >>>
> >>> Failed tests:
>  testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> >>> Check for the image's text.
> >>>  testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>  testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>
> >>> Next I hard coded the tesseractPath:
> >>>
> >>> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
> >>> The all tests passed and it built successfully, but then I went to post
> >>> some tiff's to the server.
> >>> That didn't work. So I tried adding some System.out.println("hello
> >> world")
> >>> (a little crude I know) inside the unit tests to confirm that tesseract
> >>> was working correctly.  It looks like something happens in the unit
> test
> >> in
> >>> TesseractOCRTest.java
> >>> on the line that says TesseractOCRConfig config = new
> >>> TesseractOCRConfig();. Printing to stdout before works, but I get
> nothing
> >>> after. That happens before the assumeTrue(canRun(config));. So an
> >> exception
> >>> is not get raised.
> >>>
> >>> Then once everything is built, ocr does not work.  That was why I
> >> figured I
> >>> would ask to see if I missed some sort of configuration step in
> building
> >>> it.
> >>>
> >>> Thanks a ton.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> >>> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>>
> >>>> Dear Kevin,
> >>>>
> >>>> Sure, it already works :) 1.7-SNAPSHOT.
> >>>>
> >>>> See this wiki page:
> >>>>
> >>>> https://wiki.apache.org/tika/TikaOCR
> >>>>
> >>>> I¹d be happy to discuss more.
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Cheers,
> >>>> Chris
> >>>>
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Chief Architect
> >>>> Instrument Software and Science Data Systems Section (398)
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 168-519, Mailstop: 168-527
> >>>> Email: chris.a.mattmann@nasa.gov
> >>>> WWW:  http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Associate Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: kevin slote <ks...@gmail.com>
> >>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>>> Date: Tuesday, September 30, 2014 at 8:52 AM
> >>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
> >>>> Subject: OCR with tika-server
> >>>>
> >>>>> Hello all,
> >>>>>
> >>>>> I have been testing out the integration of tika with tesseract.
> >>>>> I was wondering if there is  a way to get tika-server to run with
> >>>>> tesseract's OCR capabilities?
> >>>>>
> >>>>> Best
> >>>>>
> >>>>> Kevin Slote
> >>
>

Re: OCR with tika-server

Posted by "Ramirez, Paul M (398J)" <pa...@jpl.nasa.gov>.

Is that a typo in your path to tesseract?

/urs/bin/tesseract => /usr/bin/tesseract 

--Paul

> On Sep 30, 2014, at 1:48 PM, "kevin slote" <ks...@gmail.com> wrote:
> 
> Unfortunately, that did not do it either.
> 
> I did:
> 
>   $export
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
> 
> Here is the output from printenv
> 
> kslote@ubuntu:~/tika/tika$ printenv
> SHELL=/bin/bash
> USERNAME=kslote
> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> DESKTOP_SESSION=gnome
> PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
> PWD=/home/kslote/tika/tika
> HOME=/home/kslote
> LOGNAME=kslote
> _=/usr/bin/printenv
> 
> 
> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich <tp...@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> Hmm. Could you try adding tesseract to your PATH? How did you install
>> Tesseract? You should be able to do a straightforward `sudo apt-get install
>> tesseract-ocr`. After that, the OCR tests should pass. We're still running
>> into TIKA-1422, where a mail test fails. But, you can run just the OCR
>> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>> -DfailIfNoTests=false`.
>> 
>> Let me know if that works for you!
>> Tyler
>> 
>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com> wrote:
>>> 
>>> I am working on ubuntu 10.4. and I am having some trouble.
>>> Tesseract is installed correctly, but just doing a clone from the repo
>> and
>>> installing with maven, I am getting some errors.
>>> 
>>> This is before I did anything with tesseract installed.
>>> 
>>> Failed tests:   testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>> Check for the image's text.
>>>  testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>  testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>> 
>>> Next I hard coded the tesseractPath:
>>> 
>>> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
>>> The all tests passed and it built successfully, but then I went to post
>>> some tiff's to the server.
>>> That didn't work. So I tried adding some System.out.println("hello
>> world")
>>> (a little crude I know) inside the unit tests to confirm that tesseract
>>> was working correctly.  It looks like something happens in the unit test
>> in
>>> TesseractOCRTest.java
>>> on the line that says TesseractOCRConfig config = new
>>> TesseractOCRConfig();. Printing to stdout before works, but I get nothing
>>> after. That happens before the assumeTrue(canRun(config));. So an
>> exception
>>> is not get raised.
>>> 
>>> Then once everything is built, ocr does not work.  That was why I
>> figured I
>>> would ask to see if I missed some sort of configuration step in building
>>> it.
>>> 
>>> Thanks a ton.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>> 
>>>> Dear Kevin,
>>>> 
>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>>> 
>>>> See this wiki page:
>>>> 
>>>> https://wiki.apache.org/tika/TikaOCR
>>>> 
>>>> I¹d be happy to discuss more.
>>>> 
>>>> Thanks!
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: kevin slote <ks...@gmail.com>
>>>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>>>> Subject: OCR with tika-server
>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I have been testing out the integration of tika with tesseract.
>>>>> I was wondering if there is  a way to get tika-server to run with
>>>>> tesseract's OCR capabilities?
>>>>> 
>>>>> Best
>>>>> 
>>>>> Kevin Slote
>>

Re: OCR with tika-server

Posted by kevin slote <ks...@gmail.com>.

Unfortunately, that did not do it either.

I did:

   $export
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract

Here is the output from printenv

kslote@ubuntu:~/tika/tika$ printenv
SHELL=/bin/bash
USERNAME=kslote
XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
DESKTOP_SESSION=gnome
PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
PWD=/home/kslote/tika/tika
HOME=/home/kslote
LOGNAME=kslote
_=/usr/bin/printenv


On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich <tp...@gmail.com>
wrote:

> Hi,
>
> Hmm. Could you try adding tesseract to your PATH? How did you install
> Tesseract? You should be able to do a straightforward `sudo apt-get install
> tesseract-ocr`. After that, the OCR tests should pass. We're still running
> into TIKA-1422, where a mail test fails. But, you can run just the OCR
> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> -DfailIfNoTests=false`.
>
> Let me know if that works for you!
> Tyler
>
> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com> wrote:
>
> > I am working on ubuntu 10.4. and I am having some trouble.
> > Tesseract is installed correctly, but just doing a clone from the repo
> and
> > installing with maven, I am getting some errors.
> >
> > This is before I did anything with tesseract installed.
> >
> > Failed tests:   testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> > Check for the image's text.
> >   testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >   testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >
> > Next I hard coded the tesseractPath:
> >
> > I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
> > The all tests passed and it built successfully, but then I went to post
> > some tiff's to the server.
> > That didn't work. So I tried adding some System.out.println("hello
> world")
> >  (a little crude I know) inside the unit tests to confirm that tesseract
> > was working correctly.  It looks like something happens in the unit test
> in
> > TesseractOCRTest.java
> > on the line that says TesseractOCRConfig config = new
> > TesseractOCRConfig();. Printing to stdout before works, but I get nothing
> > after. That happens before the assumeTrue(canRun(config));. So an
> exception
> > is not get raised.
> >
> > Then once everything is built, ocr does not work.  That was why I
> figured I
> > would ask to see if I missed some sort of configuration step in building
> > it.
> >
> > Thanks a ton.
> >
> >
> >
> >
> >
> > On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> > chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> > > Dear Kevin,
> > >
> > > Sure, it already works :) 1.7-SNAPSHOT.
> > >
> > > See this wiki page:
> > >
> > > https://wiki.apache.org/tika/TikaOCR
> > >
> > > I¹d be happy to discuss more.
> > >
> > > Thanks!
> > >
> > > Cheers,
> > > Chris
> > >
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > Chris Mattmann, Ph.D.
> > > Chief Architect
> > > Instrument Software and Science Data Systems Section (398)
> > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > Office: 168-519, Mailstop: 168-527
> > > Email: chris.a.mattmann@nasa.gov
> > > WWW:  http://sunset.usc.edu/~mattmann/
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > Adjunct Associate Professor, Computer Science Department
> > > University of Southern California, Los Angeles, CA 90089 USA
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >
> > >
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: kevin slote <ks...@gmail.com>
> > > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> > > Date: Tuesday, September 30, 2014 at 8:52 AM
> > > To: "dev@tika.apache.org" <de...@tika.apache.org>
> > > Subject: OCR with tika-server
> > >
> > > >Hello all,
> > > >
> > > >I have been testing out the integration of tika with tesseract.
> > > >I was wondering if there is  a way to get tika-server to run with
> > > >tesseract's OCR capabilities?
> > > >
> > > >Best
> > > >
> > > >Kevin Slote
> > >
> > >
> >
>

Re: OCR with tika-server

Posted by Tyler Palsulich <tp...@gmail.com>.

Hi,

Hmm. Could you try adding tesseract to your PATH? How did you install
Tesseract? You should be able to do a straightforward `sudo apt-get install
tesseract-ocr`. After that, the OCR tests should pass. We're still running
into TIKA-1422, where a mail test fails. But, you can run just the OCR
tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
-DfailIfNoTests=false`.

Let me know if that works for you!
Tyler

On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <ks...@gmail.com> wrote:

> I am working on ubuntu 10.4. and I am having some trouble.
> Tesseract is installed correctly, but just doing a clone from the repo and
> installing with maven, I am getting some errors.
>
> This is before I did anything with tesseract installed.
>
> Failed tests:   testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> Check for the image's text.
>   testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>   testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>
> Next I hard coded the tesseractPath:
>
> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
> The all tests passed and it built successfully, but then I went to post
> some tiff's to the server.
> That didn't work. So I tried adding some System.out.println("hello world")
>  (a little crude I know) inside the unit tests to confirm that tesseract
> was working correctly.  It looks like something happens in the unit test in
> TesseractOCRTest.java
> on the line that says TesseractOCRConfig config = new
> TesseractOCRConfig();. Printing to stdout before works, but I get nothing
> after. That happens before the assumeTrue(canRun(config));. So an exception
> is not get raised.
>
> Then once everything is built, ocr does not work.  That was why I figured I
> would ask to see if I missed some sort of configuration step in building
> it.
>
> Thanks a ton.
>
>
>
>
>
> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> > Dear Kevin,
> >
> > Sure, it already works :) 1.7-SNAPSHOT.
> >
> > See this wiki page:
> >
> > https://wiki.apache.org/tika/TikaOCR
> >
> > I¹d be happy to discuss more.
> >
> > Thanks!
> >
> > Cheers,
> > Chris
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattmann@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: kevin slote <ks...@gmail.com>
> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> > Date: Tuesday, September 30, 2014 at 8:52 AM
> > To: "dev@tika.apache.org" <de...@tika.apache.org>
> > Subject: OCR with tika-server
> >
> > >Hello all,
> > >
> > >I have been testing out the integration of tika with tesseract.
> > >I was wondering if there is  a way to get tika-server to run with
> > >tesseract's OCR capabilities?
> > >
> > >Best
> > >
> > >Kevin Slote
> >
> >
>

Re: OCR with tika-server

Posted by kevin slote <ks...@gmail.com>.

I am working on ubuntu 10.4. and I am having some trouble.
Tesseract is installed correctly, but just doing a clone from the repo and
installing with maven, I am getting some errors.

This is before I did anything with tesseract installed.

Failed tests:   testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
Check for the image's text.
  testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
  testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)

Next I hard coded the tesseractPath:

I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
The all tests passed and it built successfully, but then I went to post
some tiff's to the server.
That didn't work. So I tried adding some System.out.println("hello world")
 (a little crude I know) inside the unit tests to confirm that tesseract
was working correctly.  It looks like something happens in the unit test in
TesseractOCRTest.java
on the line that says TesseractOCRConfig config = new
TesseractOCRConfig();. Printing to stdout before works, but I get nothing
after. That happens before the assumeTrue(canRun(config));. So an exception
is not get raised.

Then once everything is built, ocr does not work.  That was why I figured I
would ask to see if I missed some sort of configuration step in building it.

Thanks a ton.

On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Dear Kevin,
>
> Sure, it already works :) 1.7-SNAPSHOT.
>
> See this wiki page:
>
> https://wiki.apache.org/tika/TikaOCR
>
> I¹d be happy to discuss more.
>
> Thanks!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: kevin slote <ks...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Tuesday, September 30, 2014 at 8:52 AM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: OCR with tika-server
>
> >Hello all,
> >
> >I have been testing out the integration of tika with tesseract.
> >I was wondering if there is  a way to get tika-server to run with
> >tesseract's OCR capabilities?
> >
> >Best
> >
> >Kevin Slote
>
>

Re: OCR with tika-server

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Dear Kevin,

Sure, it already works :) 1.7-SNAPSHOT.

See this wiki page:

https://wiki.apache.org/tika/TikaOCR

I¹d be happy to discuss more.

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: kevin slote <ks...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, September 30, 2014 at 8:52 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: OCR with tika-server

>Hello all,
>
>I have been testing out the integration of tika with tesseract.
>I was wondering if there is  a way to get tika-server to run with
>tesseract's OCR capabilities?
>
>Best
>
>Kevin Slote