You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2014/10/07 01:47:36 UTC

Tesseract OCR always activeated parser for images

Hi Folks,
Now, once I install Tesseract, it is run for every image I pass through
Tika server or Tika app.
This is not okay as it does not give me the type of MD I am looking for.
This is a just a note to folks, to say that AFAIK you would need to
unregister the the parser from [0] then rebuild from source in order to
maintain backwards compatability in this regard.
Before I log a ticket for this, can anyone else confirm this please?
Thanks
Lewis

[0]
https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser

-- 
*Lewis*

Re: Tesseract OCR always activeated parser for images

Posted by Tyler Palsulich <tp...@gmail.com>.
I have a local patch with the two combined. But, was getting different
results between Mac and Linux... I'm not sure why. I'll post it in a couple
hours.

Tyler
On Oct 7, 2014 10:55 AM, "Mattmann, Chris A (3980)" <
chris.a.mattmann@jpl.nasa.gov> wrote:

> I¹ll try and combine mine and Tyler¹s patch for 1422 and see if it
> fixes it :) Will test today.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <tp...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Tuesday, October 7, 2014 at 1:49 AM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: Tesseract OCR always activeated parser for images
>
> >Confirmed. This is why we ran into TIKA-1422. But, Chris' patch may
> >provide
> >the backwards compatibility you're looking for. What do you think?
> >
> >Tyler
> >
> >On Mon, Oct 6, 2014 at 7:47 PM, Lewis John Mcgibbney <
> >lewis.mcgibbney@gmail.com> wrote:
> >
> >> Hi Folks,
> >> Now, once I install Tesseract, it is run for every image I pass through
> >> Tika server or Tika app.
> >> This is not okay as it does not give me the type of MD I am looking for.
> >> This is a just a note to folks, to say that AFAIK you would need to
> >> unregister the the parser from [0] then rebuild from source in order to
> >> maintain backwards compatability in this regard.
> >> Before I log a ticket for this, can anyone else confirm this please?
> >> Thanks
> >> Lewis
> >>
> >> [0]
> >>
> >>
> >>
> https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resourc
> >>es/META-INF/services/org.apache.tika.parser.Parser
> >>
> >> --
> >> *Lewis*
> >>
>
>

Re: Tesseract OCR always activeated parser for images

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
I¹ll try and combine mine and Tyler¹s patch for 1422 and see if it
fixes it :) Will test today.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Tyler Palsulich <tp...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, October 7, 2014 at 1:49 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: Tesseract OCR always activeated parser for images

>Confirmed. This is why we ran into TIKA-1422. But, Chris' patch may
>provide
>the backwards compatibility you're looking for. What do you think?
>
>Tyler
>
>On Mon, Oct 6, 2014 at 7:47 PM, Lewis John Mcgibbney <
>lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Folks,
>> Now, once I install Tesseract, it is run for every image I pass through
>> Tika server or Tika app.
>> This is not okay as it does not give me the type of MD I am looking for.
>> This is a just a note to folks, to say that AFAIK you would need to
>> unregister the the parser from [0] then rebuild from source in order to
>> maintain backwards compatability in this regard.
>> Before I log a ticket for this, can anyone else confirm this please?
>> Thanks
>> Lewis
>>
>> [0]
>>
>> 
>>https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resourc
>>es/META-INF/services/org.apache.tika.parser.Parser
>>
>> --
>> *Lewis*
>>


Re: Tesseract OCR always activeated parser for images

Posted by Tyler Palsulich <tp...@gmail.com>.
Confirmed. This is why we ran into TIKA-1422. But, Chris' patch may provide
the backwards compatibility you're looking for. What do you think?

Tyler

On Mon, Oct 6, 2014 at 7:47 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Folks,
> Now, once I install Tesseract, it is run for every image I pass through
> Tika server or Tika app.
> This is not okay as it does not give me the type of MD I am looking for.
> This is a just a note to folks, to say that AFAIK you would need to
> unregister the the parser from [0] then rebuild from source in order to
> maintain backwards compatability in this regard.
> Before I log a ticket for this, can anyone else confirm this please?
> Thanks
> Lewis
>
> [0]
>
> https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>
> --
> *Lewis*
>