You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Sergey Tsalkov <st...@gmail.com> on 2015/08/20 08:19:52 UTC

want to disable tesseract ocr parser

Hey awesome Tika folks!
The reason I'm writing is that I want to disable the
TesseractOCRParser. The reason is that it makes Tika take longer to
finish, and I don't need the OCRed results.

I can't simply uninstall tesseract from the system because I use it
for other things.

I thought about sending Tika a custom PATH that excludes /usr/bin so
it can't find tesseract, but that seems ugly and likely to break
things.

Then I thought I could pass a custom config.xml to disable it, but I
can't figure out how to write the config file.

I would greatly appreciate any help!

Thanks,
Sergey

Re: want to disable tesseract ocr parser

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hey Sergey - why not just remove your tesseract binary from your
$PATH environment variable - leave it where it is, e.g., /usr/bin,
etc., but simply exclude that from your path.

If you want to go the exclude route, check out:

http://s.apache.org/3O0

You could use mime-excludes to exclude the TesseractParser for
MIME types that you don’t want it to get called on.

HTH.

Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Tsalkov <st...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Wednesday, August 19, 2015 at 11:19 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: want to disable tesseract ocr parser

>Hey awesome Tika folks!
>The reason I'm writing is that I want to disable the
>TesseractOCRParser. The reason is that it makes Tika take longer to
>finish, and I don't need the OCRed results.
>
>I can't simply uninstall tesseract from the system because I use it
>for other things.
>
>I thought about sending Tika a custom PATH that excludes /usr/bin so
>it can't find tesseract, but that seems ugly and likely to break
>things.
>
>Then I thought I could pass a custom config.xml to disable it, but I
>can't figure out how to write the config file.
>
>I would greatly appreciate any help!
>
>Thanks,
>Sergey

Re: want to disable tesseract ocr parser

Posted by Chris Mattmann <ch...@gmail.com>.

Perms granted! :-) https://wiki.apache.org/tika/ContributorsGroup

—
Chris Mattmann
chris.mattmann@gmail.com






-----Original Message-----
From: Sergey Tsalkov <st...@gmail.com>
Reply-To: <us...@tika.apache.org>
Date: Thursday, August 20, 2015 at 10:31 PM
To: <us...@tika.apache.org>
Subject: Re: want to disable tesseract ocr parser

>Happy to do that, Chris! I've created my account, username is
>SergeyTsalkov.
>
>On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>> Thanks Sergey!
>>
>> Please feel free to add a page on the wiki:
>>
>> http://wiki.apache.org/tika/
>>
>> Describing your use case. I would appreciate it!
>> If you remember to sign up, tell me your username, or tell anyone
>> on this list (dev@tika), we’ll get you permissions and you can
>> create the page.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Tsalkov <st...@gmail.com>
>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>> Date: Thursday, August 20, 2015 at 10:22 PM
>> To: "user@tika.apache.org" <us...@tika.apache.org>
>> Subject: Re: want to disable tesseract ocr parser
>>
>>>Thanks guys! Nick, your config file was exactly what I was looking
>>>for, though it took a minor tweak because you forgot to open the
>>>parser tag. I'm posting the corrected config below for anyone who
>>>refers to this thread in the future:
>>>
>>><?xml version="1.0" encoding="UTF-8"?>
>>><properties>
>>>  <parsers>
>>>    <parser class="org.apache.tika.parser.DefaultParser">
>>>      <parser-exclude
>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>    </parser>
>>>  </parsers>
>>></properties>
>>>
>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <ni...@apache.org> wrote:
>>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>>
>>>>> Then I thought I could pass a custom config.xml to disable it, but I
>>>>> can't figure out how to write the config file.
>>>>
>>>>
>>>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers
>>>>for
>>>> details of the parser configuration
>>>>
>>>> You should be fine with a config file like:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <properties>
>>>>   <parsers>
>>>>     <!-- Default Parser except no OCR -->
>>>>       <parser-exclude
>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>     </parser>
>>>>   </parsers>
>>>> </properties>
>>>>
>>>> Thanks
>>>> Nick
>>

Re: want to disable tesseract ocr parser

Posted by Chris Mattmann <ch...@gmail.com>.

You are totally right and your use case is as
valid as any other one! :-)

Thanks for clarifying and yes we can make this more
clear in the documentation and should.

Cheers,
Chris

—
Chris Mattmann
chris.mattmann@gmail.com






-----Original Message-----
From: Sergey Tsalkov <st...@gmail.com>
Reply-To: <us...@tika.apache.org>
Date: Thursday, August 20, 2015 at 10:51 PM
To: <us...@tika.apache.org>
Subject: Re: want to disable tesseract ocr parser

>It's great functionality to have, Chris, and tesseract is certainly my
>choice for OCR, too! I'm certainly not suggesting that it be removed
>-- maybe just that the user be made aware of it more deliberately,
>with the official documentation mentioning that this happens and how
>to disable it. In my case, it triggered on an image embedded within an
>office doc, so it caught me by surprise more so than if I'd been
>throwing jpegs at Tika directly.
>
>But then again, maybe my use case is the oddball here -- most people
>aren't cranking servers around the clock parsing countless millions of
>documents, and therefore wouldn't notice some increase in CPU use!
>
>
>
>On Thu, Aug 20, 2015 at 10:41 PM, Chris Mattmann <ma...@apache.org>
>wrote:
>> Thanks Sergey. It’s certainly something that adds overhead
>> I’ve seen it too, but with all the capability that tesseract
>> adds (the OCR) it’s something that we’re willing to trade since
>> we can disable it pretty easily via configuration, etc.
>>
>> Speaking from a biased perspective of helping to implement it ;)
>>
>> Cheers,
>> Chris
>>
>> —
>> Chris Mattmann
>> chris.mattmann@gmail.com
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Tsalkov <st...@gmail.com>
>> Reply-To: <us...@tika.apache.org>
>> Date: Thursday, August 20, 2015 at 10:40 PM
>> To: <us...@tika.apache.org>
>> Subject: Re: want to disable tesseract ocr parser
>>
>>>I'll mention my situation again on the wiki, but if any Tika
>>>developers are reading this, I'd consider this a bug report! I've been
>>>using Tika for quite a while. I use very expensive hardware to churn
>>>through tens of millions of documents very rapidly, pulling out
>>>plaintext and metadata. Tika has generally performed extremely well
>>>under this stress -- never a crash or screwup!
>>>
>>>But then one day it got about 50% slower, and I couldn't figure out
>>>why for a while. I just happened to run 'ps xf' and noticed that Tika
>>>was spawning all these tesseract processes. Turned out that I'd never
>>>had tesseract installed before. I had installed it just recently for a
>>>separate project, and Tika's behavior silently changed because of
>>>that.
>>>
>>>Not sure if that fits your definition of a bug, but it's certainly
>>>unexpected behavior as far as I'm concerned!
>>>
>>>Thanks again everyone!
>>>
>>>On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <st...@gmail.com>
>>>wrote:
>>>> Happy to do that, Chris! I've created my account, username is
>>>>SergeyTsalkov.
>>>>
>>>> On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
>>>> <ch...@jpl.nasa.gov> wrote:
>>>>> Thanks Sergey!
>>>>>
>>>>> Please feel free to add a page on the wiki:
>>>>>
>>>>> http://wiki.apache.org/tika/
>>>>>
>>>>> Describing your use case. I would appreciate it!
>>>>> If you remember to sign up, tell me your username, or tell anyone
>>>>> on this list (dev@tika), we’ll get you permissions and you can
>>>>> create the page.
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email: chris.a.mattmann@nasa.gov
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sergey Tsalkov <st...@gmail.com>
>>>>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>>>>> Date: Thursday, August 20, 2015 at 10:22 PM
>>>>> To: "user@tika.apache.org" <us...@tika.apache.org>
>>>>> Subject: Re: want to disable tesseract ocr parser
>>>>>
>>>>>>Thanks guys! Nick, your config file was exactly what I was looking
>>>>>>for, though it took a minor tweak because you forgot to open the
>>>>>>parser tag. I'm posting the corrected config below for anyone who
>>>>>>refers to this thread in the future:
>>>>>>
>>>>>><?xml version="1.0" encoding="UTF-8"?>
>>>>>><properties>
>>>>>>  <parsers>
>>>>>>    <parser class="org.apache.tika.parser.DefaultParser">
>>>>>>      <parser-exclude
>>>>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>>    </parser>
>>>>>>  </parsers>
>>>>>></properties>
>>>>>>
>>>>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <ni...@apache.org> wrote:
>>>>>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>>>>>
>>>>>>>> Then I thought I could pass a custom config.xml to disable it,
>>>>>>>>but I
>>>>>>>> can't figure out how to write the config file.
>>>>>>>
>>>>>>>
>>>>>>> See 
>>>>>>>http://tika.apache.org/1.10/configuring.html#Configuring_Parsers
>>>>>>>for
>>>>>>> details of the parser configuration
>>>>>>>
>>>>>>> You should be fine with a config file like:
>>>>>>>
>>>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>>>> <properties>
>>>>>>>   <parsers>
>>>>>>>     <!-- Default Parser except no OCR -->
>>>>>>>       <parser-exclude
>>>>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>>>     </parser>
>>>>>>>   </parsers>
>>>>>>> </properties>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Nick
>>>>>
>>
>>

Re: want to disable tesseract ocr parser

Posted by Sergey Tsalkov <st...@gmail.com>.

It's great functionality to have, Chris, and tesseract is certainly my
choice for OCR, too! I'm certainly not suggesting that it be removed
-- maybe just that the user be made aware of it more deliberately,
with the official documentation mentioning that this happens and how
to disable it. In my case, it triggered on an image embedded within an
office doc, so it caught me by surprise more so than if I'd been
throwing jpegs at Tika directly.

But then again, maybe my use case is the oddball here -- most people
aren't cranking servers around the clock parsing countless millions of
documents, and therefore wouldn't notice some increase in CPU use!



On Thu, Aug 20, 2015 at 10:41 PM, Chris Mattmann <ma...@apache.org> wrote:
> Thanks Sergey. It’s certainly something that adds overhead
> I’ve seen it too, but with all the capability that tesseract
> adds (the OCR) it’s something that we’re willing to trade since
> we can disable it pretty easily via configuration, etc.
>
> Speaking from a biased perspective of helping to implement it ;)
>
> Cheers,
> Chris
>
> —
> Chris Mattmann
> chris.mattmann@gmail.com
>
>
>
>
>
>
> -----Original Message-----
> From: Sergey Tsalkov <st...@gmail.com>
> Reply-To: <us...@tika.apache.org>
> Date: Thursday, August 20, 2015 at 10:40 PM
> To: <us...@tika.apache.org>
> Subject: Re: want to disable tesseract ocr parser
>
>>I'll mention my situation again on the wiki, but if any Tika
>>developers are reading this, I'd consider this a bug report! I've been
>>using Tika for quite a while. I use very expensive hardware to churn
>>through tens of millions of documents very rapidly, pulling out
>>plaintext and metadata. Tika has generally performed extremely well
>>under this stress -- never a crash or screwup!
>>
>>But then one day it got about 50% slower, and I couldn't figure out
>>why for a while. I just happened to run 'ps xf' and noticed that Tika
>>was spawning all these tesseract processes. Turned out that I'd never
>>had tesseract installed before. I had installed it just recently for a
>>separate project, and Tika's behavior silently changed because of
>>that.
>>
>>Not sure if that fits your definition of a bug, but it's certainly
>>unexpected behavior as far as I'm concerned!
>>
>>Thanks again everyone!
>>
>>On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <st...@gmail.com>
>>wrote:
>>> Happy to do that, Chris! I've created my account, username is
>>>SergeyTsalkov.
>>>
>>> On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
>>> <ch...@jpl.nasa.gov> wrote:
>>>> Thanks Sergey!
>>>>
>>>> Please feel free to add a page on the wiki:
>>>>
>>>> http://wiki.apache.org/tika/
>>>>
>>>> Describing your use case. I would appreciate it!
>>>> If you remember to sign up, tell me your username, or tell anyone
>>>> on this list (dev@tika), we’ll get you permissions and you can
>>>> create the page.
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sergey Tsalkov <st...@gmail.com>
>>>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>>>> Date: Thursday, August 20, 2015 at 10:22 PM
>>>> To: "user@tika.apache.org" <us...@tika.apache.org>
>>>> Subject: Re: want to disable tesseract ocr parser
>>>>
>>>>>Thanks guys! Nick, your config file was exactly what I was looking
>>>>>for, though it took a minor tweak because you forgot to open the
>>>>>parser tag. I'm posting the corrected config below for anyone who
>>>>>refers to this thread in the future:
>>>>>
>>>>><?xml version="1.0" encoding="UTF-8"?>
>>>>><properties>
>>>>>  <parsers>
>>>>>    <parser class="org.apache.tika.parser.DefaultParser">
>>>>>      <parser-exclude
>>>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>    </parser>
>>>>>  </parsers>
>>>>></properties>
>>>>>
>>>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <ni...@apache.org> wrote:
>>>>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>>>>
>>>>>>> Then I thought I could pass a custom config.xml to disable it, but I
>>>>>>> can't figure out how to write the config file.
>>>>>>
>>>>>>
>>>>>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers
>>>>>>for
>>>>>> details of the parser configuration
>>>>>>
>>>>>> You should be fine with a config file like:
>>>>>>
>>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>>> <properties>
>>>>>>   <parsers>
>>>>>>     <!-- Default Parser except no OCR -->
>>>>>>       <parser-exclude
>>>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>>     </parser>
>>>>>>   </parsers>
>>>>>> </properties>
>>>>>>
>>>>>> Thanks
>>>>>> Nick
>>>>
>
>

Re: want to disable tesseract ocr parser

Posted by Chris Mattmann <ma...@apache.org>.

Thanks Sergey. It’s certainly something that adds overhead
I’ve seen it too, but with all the capability that tesseract
adds (the OCR) it’s something that we’re willing to trade since
we can disable it pretty easily via configuration, etc.

Speaking from a biased perspective of helping to implement it ;)

Cheers,
Chris

—
Chris Mattmann
chris.mattmann@gmail.com






-----Original Message-----
From: Sergey Tsalkov <st...@gmail.com>
Reply-To: <us...@tika.apache.org>
Date: Thursday, August 20, 2015 at 10:40 PM
To: <us...@tika.apache.org>
Subject: Re: want to disable tesseract ocr parser

>I'll mention my situation again on the wiki, but if any Tika
>developers are reading this, I'd consider this a bug report! I've been
>using Tika for quite a while. I use very expensive hardware to churn
>through tens of millions of documents very rapidly, pulling out
>plaintext and metadata. Tika has generally performed extremely well
>under this stress -- never a crash or screwup!
>
>But then one day it got about 50% slower, and I couldn't figure out
>why for a while. I just happened to run 'ps xf' and noticed that Tika
>was spawning all these tesseract processes. Turned out that I'd never
>had tesseract installed before. I had installed it just recently for a
>separate project, and Tika's behavior silently changed because of
>that.
>
>Not sure if that fits your definition of a bug, but it's certainly
>unexpected behavior as far as I'm concerned!
>
>Thanks again everyone!
>
>On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <st...@gmail.com>
>wrote:
>> Happy to do that, Chris! I've created my account, username is
>>SergeyTsalkov.
>>
>> On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
>> <ch...@jpl.nasa.gov> wrote:
>>> Thanks Sergey!
>>>
>>> Please feel free to add a page on the wiki:
>>>
>>> http://wiki.apache.org/tika/
>>>
>>> Describing your use case. I would appreciate it!
>>> If you remember to sign up, tell me your username, or tell anyone
>>> on this list (dev@tika), we’ll get you permissions and you can
>>> create the page.
>>>
>>> Cheers,
>>> Chris
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Sergey Tsalkov <st...@gmail.com>
>>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>>> Date: Thursday, August 20, 2015 at 10:22 PM
>>> To: "user@tika.apache.org" <us...@tika.apache.org>
>>> Subject: Re: want to disable tesseract ocr parser
>>>
>>>>Thanks guys! Nick, your config file was exactly what I was looking
>>>>for, though it took a minor tweak because you forgot to open the
>>>>parser tag. I'm posting the corrected config below for anyone who
>>>>refers to this thread in the future:
>>>>
>>>><?xml version="1.0" encoding="UTF-8"?>
>>>><properties>
>>>>  <parsers>
>>>>    <parser class="org.apache.tika.parser.DefaultParser">
>>>>      <parser-exclude
>>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>    </parser>
>>>>  </parsers>
>>>></properties>
>>>>
>>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <ni...@apache.org> wrote:
>>>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>>>
>>>>>> Then I thought I could pass a custom config.xml to disable it, but I
>>>>>> can't figure out how to write the config file.
>>>>>
>>>>>
>>>>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers
>>>>>for
>>>>> details of the parser configuration
>>>>>
>>>>> You should be fine with a config file like:
>>>>>
>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>> <properties>
>>>>>   <parsers>
>>>>>     <!-- Default Parser except no OCR -->
>>>>>       <parser-exclude
>>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>     </parser>
>>>>>   </parsers>
>>>>> </properties>
>>>>>
>>>>> Thanks
>>>>> Nick
>>>

Re: want to disable tesseract ocr parser

Posted by Sergey Tsalkov <st...@gmail.com>.

I'll mention my situation again on the wiki, but if any Tika
developers are reading this, I'd consider this a bug report! I've been
using Tika for quite a while. I use very expensive hardware to churn
through tens of millions of documents very rapidly, pulling out
plaintext and metadata. Tika has generally performed extremely well
under this stress -- never a crash or screwup!

But then one day it got about 50% slower, and I couldn't figure out
why for a while. I just happened to run 'ps xf' and noticed that Tika
was spawning all these tesseract processes. Turned out that I'd never
had tesseract installed before. I had installed it just recently for a
separate project, and Tika's behavior silently changed because of
that.

Not sure if that fits your definition of a bug, but it's certainly
unexpected behavior as far as I'm concerned!

Thanks again everyone!

On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <st...@gmail.com> wrote:
> Happy to do that, Chris! I've created my account, username is SergeyTsalkov.
>
> On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
> <ch...@jpl.nasa.gov> wrote:
>> Thanks Sergey!
>>
>> Please feel free to add a page on the wiki:
>>
>> http://wiki.apache.org/tika/
>>
>> Describing your use case. I would appreciate it!
>> If you remember to sign up, tell me your username, or tell anyone
>> on this list (dev@tika), we’ll get you permissions and you can
>> create the page.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Tsalkov <st...@gmail.com>
>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>> Date: Thursday, August 20, 2015 at 10:22 PM
>> To: "user@tika.apache.org" <us...@tika.apache.org>
>> Subject: Re: want to disable tesseract ocr parser
>>
>>>Thanks guys! Nick, your config file was exactly what I was looking
>>>for, though it took a minor tweak because you forgot to open the
>>>parser tag. I'm posting the corrected config below for anyone who
>>>refers to this thread in the future:
>>>
>>><?xml version="1.0" encoding="UTF-8"?>
>>><properties>
>>>  <parsers>
>>>    <parser class="org.apache.tika.parser.DefaultParser">
>>>      <parser-exclude
>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>    </parser>
>>>  </parsers>
>>></properties>
>>>
>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <ni...@apache.org> wrote:
>>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>>
>>>>> Then I thought I could pass a custom config.xml to disable it, but I
>>>>> can't figure out how to write the config file.
>>>>
>>>>
>>>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for
>>>> details of the parser configuration
>>>>
>>>> You should be fine with a config file like:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <properties>
>>>>   <parsers>
>>>>     <!-- Default Parser except no OCR -->
>>>>       <parser-exclude
>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>     </parser>
>>>>   </parsers>
>>>> </properties>
>>>>
>>>> Thanks
>>>> Nick
>>

Re: want to disable tesseract ocr parser

Posted by Sergey Tsalkov <st...@gmail.com>.

Happy to do that, Chris! I've created my account, username is SergeyTsalkov.

On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
<ch...@jpl.nasa.gov> wrote:
> Thanks Sergey!
>
> Please feel free to add a page on the wiki:
>
> http://wiki.apache.org/tika/
>
> Describing your use case. I would appreciate it!
> If you remember to sign up, tell me your username, or tell anyone
> on this list (dev@tika), we’ll get you permissions and you can
> create the page.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Sergey Tsalkov <st...@gmail.com>
> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> Date: Thursday, August 20, 2015 at 10:22 PM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: Re: want to disable tesseract ocr parser
>
>>Thanks guys! Nick, your config file was exactly what I was looking
>>for, though it took a minor tweak because you forgot to open the
>>parser tag. I'm posting the corrected config below for anyone who
>>refers to this thread in the future:
>>
>><?xml version="1.0" encoding="UTF-8"?>
>><properties>
>>  <parsers>
>>    <parser class="org.apache.tika.parser.DefaultParser">
>>      <parser-exclude
>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>    </parser>
>>  </parsers>
>></properties>
>>
>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <ni...@apache.org> wrote:
>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>
>>>> Then I thought I could pass a custom config.xml to disable it, but I
>>>> can't figure out how to write the config file.
>>>
>>>
>>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for
>>> details of the parser configuration
>>>
>>> You should be fine with a config file like:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <properties>
>>>   <parsers>
>>>     <!-- Default Parser except no OCR -->
>>>       <parser-exclude
>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>     </parser>
>>>   </parsers>
>>> </properties>
>>>
>>> Thanks
>>> Nick
>

Re: want to disable tesseract ocr parser

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Thanks Sergey!

Please feel free to add a page on the wiki:

http://wiki.apache.org/tika/

Describing your use case. I would appreciate it!
If you remember to sign up, tell me your username, or tell anyone
on this list (dev@tika), we’ll get you permissions and you can
create the page.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Tsalkov <st...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, August 20, 2015 at 10:22 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: want to disable tesseract ocr parser

>Thanks guys! Nick, your config file was exactly what I was looking
>for, though it took a minor tweak because you forgot to open the
>parser tag. I'm posting the corrected config below for anyone who
>refers to this thread in the future:
>
><?xml version="1.0" encoding="UTF-8"?>
><properties>
>  <parsers>
>    <parser class="org.apache.tika.parser.DefaultParser">
>      <parser-exclude
>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>    </parser>
>  </parsers>
></properties>
>
>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <ni...@apache.org> wrote:
>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>
>>> Then I thought I could pass a custom config.xml to disable it, but I
>>> can't figure out how to write the config file.
>>
>>
>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for
>> details of the parser configuration
>>
>> You should be fine with a config file like:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <properties>
>>   <parsers>
>>     <!-- Default Parser except no OCR -->
>>       <parser-exclude
>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>     </parser>
>>   </parsers>
>> </properties>
>>
>> Thanks
>> Nick

Re: want to disable tesseract ocr parser

Posted by Sergey Tsalkov <st...@gmail.com>.

Thanks guys! Nick, your config file was exactly what I was looking
for, though it took a minor tweak because you forgot to open the
parser tag. I'm posting the corrected config below for anyone who
refers to this thread in the future:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <ni...@apache.org> wrote:
> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>
>> Then I thought I could pass a custom config.xml to disable it, but I
>> can't figure out how to write the config file.
>
>
> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for
> details of the parser configuration
>
> You should be fine with a config file like:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>   <parsers>
>     <!-- Default Parser except no OCR -->
>       <parser-exclude
> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>     </parser>
>   </parsers>
> </properties>
>
> Thanks
> Nick

Re: want to disable tesseract ocr parser

Posted by Nick Burch <ni...@apache.org>.

On 20/08/15 07:19, Sergey Tsalkov wrote:
> Then I thought I could pass a custom config.xml to disable it, but I
> can't figure out how to write the config file.

See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for 
details of the parser configuration

You should be fine with a config file like:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
   <parsers>
     <!-- Default Parser except no OCR -->
       <parser-exclude 
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
     </parser>
   </parsers>
</properties>

Thanks
Nick