You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/02/11 20:45:01 UTC

RE: How is Tika used with Solr

x-post to Tika user's

Y and n.  If you run tika app as: 

java -jar tika-app.jar <input_dir> <output_dir>

It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This creates a parent and child process, if the child process notices a hung thread, it dies, and the parent restarts it.  Or if your OS gets upset with the child process and kills it out of self preservation, the parent restarts the child, or if there's an OOM...and you can configure how often the child shuts itself down (with parental restarting) to mitigate memory leaks.

So, y, if your use case allows <input_dir> <output_dir>, then we now have that in Tika.

I've been wanting to add a similar watchdog to tika-server ... any interest in that?


-----Original Message-----
From: xavi jmlucjav [mailto:jmlucjav@gmail.com] 
Sent: Thursday, February 11, 2016 2:16 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: How is Tika used with Solr

I have found that when you deal with large amounts of all sort of files, in the end you find stuff (pdfs are typically nasty) that will hang tika. That is even worse that a crash or OOM.
We used aperture instead of tika because at the time it provided a watchdog feature to kill what seemed like a hanged extracting thread. That feature is super important for a robust text extracting pipeline. Has Tika gained such feature already?

xavier

On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <er...@gmail.com>
wrote:

> Timothy's points are absolutely spot-on. In production scenarios, if 
> you use the simple "run Tika in a SolrJ program" approach you _must_ 
> abort the program on OOM errors and the like and  figure out what's 
> going on with the offending document(s). Or record the name somewhere 
> and skip it next time 'round. Or........
>
> How much you have to build in here really depends on your use case.
> For "small enough"
> sets of documents or one-time indexing, you can get by with dealing 
> with errors one at a time.
> For robust systems where you have to have indexing available at all 
> times and _especially_ where you don't control the document corpus, 
> you have to build something far more tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. 
> <ta...@mitre.org>
> wrote:
> > I completely agree on the impulse, and for the vast majority of the 
> > time
> (regular catchable exceptions), that'll work.  And, by vast majority, 
> aside from oom on very large files, we aren't seeing these problems 
> any more in our 3 million doc corpus (y, I know, small by today's 
> standards) from
> govdocs1 and Common Crawl over on our Rackspace vm.
> >
> > Given my focus on Tika, I'm overly sensitive to the worst case
> scenarios.  I find it encouraging, Erick, that you haven't seen these 
> types of problems, that users aren't complaining too often about 
> catastrophic failures of Tika within Solr Cell, and that this thread 
> is not yet swamped with integrators agreeing with me. :)
> >
> > However, because oom can leave memory in a corrupted state (right?),
> because you can't actually kill a thread for a permanent hang and 
> because Tika is a kitchen sink and we can't prevent memory leaks in 
> our dependencies, one needs to be aware that bad things can 
> happen...if only very, very rarely.  For a fellow traveler who has run 
> into these issues on massive data sets, see also [0].
> >
> > Configuring Hadoop to work around these types of problems is not too
> difficult -- it has to be done with some thought, though.  On 
> conventional single box setups, the ForkParser within Tika is one 
> option, tika-batch is another.  Hand rolling your own parent/child 
> process is non-trivial and is not necessary for the vast majority of use cases.
> >
> >
> > [0]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> >
> >
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Tuesday, February 09, 2016 10:05 PM
> > To: solr-user <so...@lucene.apache.org>
> > Subject: Re: How is Tika used with Solr
> >
> > My impulse would be to _not_ run Tika in its own JVM, just catch any
> exceptions in my code and "do the right thing". I'm not sure I see any 
> real benefit in yet another JVM.
> >
> > FWIW,
> > Erick
> >
> > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. 
> > <ta...@mitre.org>
> wrote:
> >> I have one answer here [0], but I'd be interested to hear what Solr
> users/devs/integrators have experienced on this topic.
> >>
> >> [0]
> >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
> >> Y1P 
> >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou
> >> tlo
> >> ok.com%3E
> >>
> >> -----Original Message-----
> >> From: Steven White [mailto:swhite4141@gmail.com]
> >> Sent: Tuesday, February 09, 2016 6:33 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How is Tika used with Solr
> >>
> >> Thank you Erick and Alex.
> >>
> >> My main question is with a long running process using Tika in the 
> >> same
> JVM as my application.  I'm running my file-system-crawler in its own 
> JVM (not Solr's).  On Tika mailing list, it is suggested to run Tika's 
> code in it's own JVM and invoke it from my file-system-crawler using 
> Runtime.getRuntime().exec().
> >>
> >> I fully understand from Alex suggestion and link provided by Erick 
> >> to
> use Tika outside Solr.  But what about using Tika within the same JVM 
> as my file-system-crawler application or should I be making a system 
> call to invoke another JAR, that runs in its own JVM to extract the 
> raw text?  Are there known issues with Tika when used in a long running process?
> >>
> >> Steve
> >>
> >>
>

RE: How is Tika used with Solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Y, and you can't actually kill a thread.  You can ask nicely via Thread.interrupt(), but some of our dependencies don't bother to listen  for that.  So, you're pretty much left with a separate process as the only robust solution.

So, we did the parent-child process thing for directory-> directory processing in tika-app via tika-batch.

The next step is to harden tika-server and to kick that off in a child process in a similar way.

For those who want to test their Tika harnesses (whether on single box, Hadoop/Spark etc), we added a MockParser that will do whatever you tell it when it hits an "application/xml+mock" file...full set of options:

<mock>

    <!-- action can be "add" or "set" -->
    <metadata action="add" name="author">Nikolai Lobachevsky</metadata>

    <!-- element is the name of the sax event to write, p=paragraph
        if the element is not specified, the default is <p> -->

    <write element="p">some content</write>

    <!-- write something to System.out -->

    <print_out>writing to System.out</print_out>

    <!-- write something to System.err -->
    <print_err>writing to System.err</print_err>

    <!-- hang
        millis: how many milliseconds to pause.  The actual hang time will probably
            be a bit longer than the value specified.        
        
        heavy: whether or not the hang should do something computationally expensive.
            If the value is false, this just does a Thread.sleep(millis).
            This attribute is optional, with default of heavy=false.

        pulse_millis: (required if "heavy" is true), how often to check to see
            whether the thread was interrupted or that the total hang time exceeded the millis

        interruptible: whether or not the parser will check to see if its thread
            has been interrupted; this attribute is optional with default of true
    -->

    <hang millis="100" heavy="true" pulse_millis="10" interruptible="true" />

    <!-- throw an exception or error; optionally include a message or not -->

    <throw class="java.io.IOException">not another IOException</throw>

    <!-- perform a genuine OutOfMemoryError -->

    <oom/>

</mock>  

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Thursday, February 11, 2016 7:46 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: How is Tika used with Solr

Well, I'd imagine you could spawn threads and monitor/kill them as necessary, although that doesn't deal with OOM errors....

FWIW,
Erick

On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav <jm...@gmail.com> wrote:
> For sure, if I need heavy duty text extraction again, Tika would be 
> the obvious choice if it covers dealing with hangs. I never used 
> tika-server myself (not sure if it existed at the time) just used tika from my own jvm.
>
> On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. 
> <ta...@mitre.org>
> wrote:
>
>> x-post to Tika user's
>>
>> Y and n.  If you run tika app as:
>>
>> java -jar tika-app.jar <input_dir> <output_dir>
>>
>> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  
>> This creates a parent and child process, if the child process notices 
>> a hung thread, it dies, and the parent restarts it.  Or if your OS 
>> gets upset with the child process and kills it out of self 
>> preservation, the parent restarts the child, or if there's an 
>> OOM...and you can configure how often the child shuts itself down 
>> (with parental restarting) to mitigate memory leaks.
>>
>> So, y, if your use case allows <input_dir> <output_dir>, then we now 
>> have that in Tika.
>>
>> I've been wanting to add a similar watchdog to tika-server ... any 
>> interest in that?
>>
>>
>> -----Original Message-----
>> From: xavi jmlucjav [mailto:jmlucjav@gmail.com]
>> Sent: Thursday, February 11, 2016 2:16 PM
>> To: solr-user <so...@lucene.apache.org>
>> Subject: Re: How is Tika used with Solr
>>
>> I have found that when you deal with large amounts of all sort of 
>> files, in the end you find stuff (pdfs are typically nasty) that will hang tika.
>> That is even worse that a crash or OOM.
>> We used aperture instead of tika because at the time it provided a 
>> watchdog feature to kill what seemed like a hanged extracting thread. 
>> That feature is super important for a robust text extracting 
>> pipeline. Has Tika gained such feature already?
>>
>> xavier
>>
>> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
>> <er...@gmail.com>
>> wrote:
>>
>> > Timothy's points are absolutely spot-on. In production scenarios, 
>> > if you use the simple "run Tika in a SolrJ program" approach you 
>> > _must_ abort the program on OOM errors and the like and  figure out 
>> > what's going on with the offending document(s). Or record the name 
>> > somewhere and skip it next time 'round. Or........
>> >
>> > How much you have to build in here really depends on your use case.
>> > For "small enough"
>> > sets of documents or one-time indexing, you can get by with dealing 
>> > with errors one at a time.
>> > For robust systems where you have to have indexing available at all 
>> > times and _especially_ where you don't control the document corpus, 
>> > you have to build something far more tolerant as per Tim's comments.
>> >
>> > FWIW,
>> > Erick
>> >
>> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
>> > <ta...@mitre.org>
>> > wrote:
>> > > I completely agree on the impulse, and for the vast majority of 
>> > > the time
>> > (regular catchable exceptions), that'll work.  And, by vast 
>> > majority, aside from oom on very large files, we aren't seeing 
>> > these problems any more in our 3 million doc corpus (y, I know, 
>> > small by today's
>> > standards) from
>> > govdocs1 and Common Crawl over on our Rackspace vm.
>> > >
>> > > Given my focus on Tika, I'm overly sensitive to the worst case
>> > scenarios.  I find it encouraging, Erick, that you haven't seen 
>> > these types of problems, that users aren't complaining too often 
>> > about catastrophic failures of Tika within Solr Cell, and that this 
>> > thread is not yet swamped with integrators agreeing with me. :)
>> > >
>> > > However, because oom can leave memory in a corrupted state 
>> > > (right?),
>> > because you can't actually kill a thread for a permanent hang and 
>> > because Tika is a kitchen sink and we can't prevent memory leaks in 
>> > our dependencies, one needs to be aware that bad things can 
>> > happen...if only very, very rarely.  For a fellow traveler who has 
>> > run into these issues on massive data sets, see also [0].
>> > >
>> > > Configuring Hadoop to work around these types of problems is not 
>> > > too
>> > difficult -- it has to be done with some thought, though.  On 
>> > conventional single box setups, the ForkParser within Tika is one 
>> > option, tika-batch is another.  Hand rolling your own parent/child 
>> > process is non-trivial and is not necessary for the vast majority 
>> > of use
>> cases.
>> > >
>> > >
>> > > [0]
>> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterisin
>> > g-w
>> > eb-content-nanite/
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > > Sent: Tuesday, February 09, 2016 10:05 PM
>> > > To: solr-user <so...@lucene.apache.org>
>> > > Subject: Re: How is Tika used with Solr
>> > >
>> > > My impulse would be to _not_ run Tika in its own JVM, just catch 
>> > > any
>> > exceptions in my code and "do the right thing". I'm not sure I see 
>> > any real benefit in yet another JVM.
>> > >
>> > > FWIW,
>> > > Erick
>> > >
>> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
>> > > <ta...@mitre.org>
>> > wrote:
>> > >> I have one answer here [0], but I'd be interested to hear what 
>> > >> Solr
>> > users/devs/integrators have experienced on this topic.
>> > >>
>> > >> [0]
>> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%
>> > >> 3CC
>> > >> Y1P
>> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod
>> > >> .ou
>> > >> tlo
>> > >> ok.com%3E
>> > >>
>> > >> -----Original Message-----
>> > >> From: Steven White [mailto:swhite4141@gmail.com]
>> > >> Sent: Tuesday, February 09, 2016 6:33 PM
>> > >> To: solr-user@lucene.apache.org
>> > >> Subject: Re: How is Tika used with Solr
>> > >>
>> > >> Thank you Erick and Alex.
>> > >>
>> > >> My main question is with a long running process using Tika in 
>> > >> the same
>> > JVM as my application.  I'm running my file-system-crawler in its 
>> > own JVM (not Solr's).  On Tika mailing list, it is suggested to run 
>> > Tika's code in it's own JVM and invoke it from my 
>> > file-system-crawler using Runtime.getRuntime().exec().
>> > >>
>> > >> I fully understand from Alex suggestion and link provided by 
>> > >> Erick to
>> > use Tika outside Solr.  But what about using Tika within the same 
>> > JVM as my file-system-crawler application or should I be making a 
>> > system call to invoke another JAR, that runs in its own JVM to 
>> > extract the raw text?  Are there known issues with Tika when used 
>> > in a long running
>> process?
>> > >>
>> > >> Steve
>> > >>
>> > >>
>> >
>>

Re: How is Tika used with Solr

Posted by xavi jmlucjav <jm...@gmail.com>.

Of course, but that code is very tricky, so if the extraction library takes
care of all that, it's a huge gain. The Aperture library I used worked very
well in that regard, and even though it did not use processes as Timothy
says, it never got stuck if I remember correctly.

On Fri, Feb 12, 2016 at 1:46 AM, Erick Erickson <er...@gmail.com>
wrote:

> Well, I'd imagine you could spawn threads and monitor/kill them as
> necessary, although that doesn't deal with OOM errors....
>
> FWIW,
> Erick
>
> On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav <jm...@gmail.com> wrote:
> > For sure, if I need heavy duty text extraction again, Tika would be the
> > obvious choice if it covers dealing with hangs. I never used tika-server
> > myself (not sure if it existed at the time) just used tika from my own
> jvm.
> >
> > On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. <tallison@mitre.org
> >
> > wrote:
> >
> >> x-post to Tika user's
> >>
> >> Y and n.  If you run tika app as:
> >>
> >> java -jar tika-app.jar <input_dir> <output_dir>
> >>
> >> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).
> This
> >> creates a parent and child process, if the child process notices a hung
> >> thread, it dies, and the parent restarts it.  Or if your OS gets upset
> with
> >> the child process and kills it out of self preservation, the parent
> >> restarts the child, or if there's an OOM...and you can configure how
> often
> >> the child shuts itself down (with parental restarting) to mitigate
> memory
> >> leaks.
> >>
> >> So, y, if your use case allows <input_dir> <output_dir>, then we now
> have
> >> that in Tika.
> >>
> >> I've been wanting to add a similar watchdog to tika-server ... any
> >> interest in that?
> >>
> >>
> >> -----Original Message-----
> >> From: xavi jmlucjav [mailto:jmlucjav@gmail.com]
> >> Sent: Thursday, February 11, 2016 2:16 PM
> >> To: solr-user <so...@lucene.apache.org>
> >> Subject: Re: How is Tika used with Solr
> >>
> >> I have found that when you deal with large amounts of all sort of files,
> >> in the end you find stuff (pdfs are typically nasty) that will hang
> tika.
> >> That is even worse that a crash or OOM.
> >> We used aperture instead of tika because at the time it provided a
> >> watchdog feature to kill what seemed like a hanged extracting thread.
> That
> >> feature is super important for a robust text extracting pipeline. Has
> Tika
> >> gained such feature already?
> >>
> >> xavier
> >>
> >> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <
> erickerickson@gmail.com>
> >> wrote:
> >>
> >> > Timothy's points are absolutely spot-on. In production scenarios, if
> >> > you use the simple "run Tika in a SolrJ program" approach you _must_
> >> > abort the program on OOM errors and the like and  figure out what's
> >> > going on with the offending document(s). Or record the name somewhere
> >> > and skip it next time 'round. Or........
> >> >
> >> > How much you have to build in here really depends on your use case.
> >> > For "small enough"
> >> > sets of documents or one-time indexing, you can get by with dealing
> >> > with errors one at a time.
> >> > For robust systems where you have to have indexing available at all
> >> > times and _especially_ where you don't control the document corpus,
> >> > you have to build something far more tolerant as per Tim's comments.
> >> >
> >> > FWIW,
> >> > Erick
> >> >
> >> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> >> > <ta...@mitre.org>
> >> > wrote:
> >> > > I completely agree on the impulse, and for the vast majority of the
> >> > > time
> >> > (regular catchable exceptions), that'll work.  And, by vast majority,
> >> > aside from oom on very large files, we aren't seeing these problems
> >> > any more in our 3 million doc corpus (y, I know, small by today's
> >> > standards) from
> >> > govdocs1 and Common Crawl over on our Rackspace vm.
> >> > >
> >> > > Given my focus on Tika, I'm overly sensitive to the worst case
> >> > scenarios.  I find it encouraging, Erick, that you haven't seen these
> >> > types of problems, that users aren't complaining too often about
> >> > catastrophic failures of Tika within Solr Cell, and that this thread
> >> > is not yet swamped with integrators agreeing with me. :)
> >> > >
> >> > > However, because oom can leave memory in a corrupted state (right?),
> >> > because you can't actually kill a thread for a permanent hang and
> >> > because Tika is a kitchen sink and we can't prevent memory leaks in
> >> > our dependencies, one needs to be aware that bad things can
> >> > happen...if only very, very rarely.  For a fellow traveler who has run
> >> > into these issues on massive data sets, see also [0].
> >> > >
> >> > > Configuring Hadoop to work around these types of problems is not too
> >> > difficult -- it has to be done with some thought, though.  On
> >> > conventional single box setups, the ForkParser within Tika is one
> >> > option, tika-batch is another.  Hand rolling your own parent/child
> >> > process is non-trivial and is not necessary for the vast majority of
> use
> >> cases.
> >> > >
> >> > >
> >> > > [0]
> >> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> >> > eb-content-nanite/
> >> > >
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> >> > > Sent: Tuesday, February 09, 2016 10:05 PM
> >> > > To: solr-user <so...@lucene.apache.org>
> >> > > Subject: Re: How is Tika used with Solr
> >> > >
> >> > > My impulse would be to _not_ run Tika in its own JVM, just catch any
> >> > exceptions in my code and "do the right thing". I'm not sure I see any
> >> > real benefit in yet another JVM.
> >> > >
> >> > > FWIW,
> >> > > Erick
> >> > >
> >> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
> >> > > <ta...@mitre.org>
> >> > wrote:
> >> > >> I have one answer here [0], but I'd be interested to hear what Solr
> >> > users/devs/integrators have experienced on this topic.
> >> > >>
> >> > >> [0]
> >> > >>
> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
> >> > >> Y1P
> >> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou
> >> > >> tlo
> >> > >> ok.com%3E
> >> > >>
> >> > >> -----Original Message-----
> >> > >> From: Steven White [mailto:swhite4141@gmail.com]
> >> > >> Sent: Tuesday, February 09, 2016 6:33 PM
> >> > >> To: solr-user@lucene.apache.org
> >> > >> Subject: Re: How is Tika used with Solr
> >> > >>
> >> > >> Thank you Erick and Alex.
> >> > >>
> >> > >> My main question is with a long running process using Tika in the
> >> > >> same
> >> > JVM as my application.  I'm running my file-system-crawler in its own
> >> > JVM (not Solr's).  On Tika mailing list, it is suggested to run Tika's
> >> > code in it's own JVM and invoke it from my file-system-crawler using
> >> > Runtime.getRuntime().exec().
> >> > >>
> >> > >> I fully understand from Alex suggestion and link provided by Erick
> >> > >> to
> >> > use Tika outside Solr.  But what about using Tika within the same JVM
> >> > as my file-system-crawler application or should I be making a system
> >> > call to invoke another JAR, that runs in its own JVM to extract the
> >> > raw text?  Are there known issues with Tika when used in a long
> running
> >> process?
> >> > >>
> >> > >> Steve
> >> > >>
> >> > >>
> >> >
> >>
>

Re: How is Tika used with Solr

Posted by Erick Erickson <er...@gmail.com>.

Well, I'd imagine you could spawn threads and monitor/kill them as
necessary, although that doesn't deal with OOM errors....

FWIW,
Erick

On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav <jm...@gmail.com> wrote:
> For sure, if I need heavy duty text extraction again, Tika would be the
> obvious choice if it covers dealing with hangs. I never used tika-server
> myself (not sure if it existed at the time) just used tika from my own jvm.
>
> On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
>> x-post to Tika user's
>>
>> Y and n.  If you run tika app as:
>>
>> java -jar tika-app.jar <input_dir> <output_dir>
>>
>> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This
>> creates a parent and child process, if the child process notices a hung
>> thread, it dies, and the parent restarts it.  Or if your OS gets upset with
>> the child process and kills it out of self preservation, the parent
>> restarts the child, or if there's an OOM...and you can configure how often
>> the child shuts itself down (with parental restarting) to mitigate memory
>> leaks.
>>
>> So, y, if your use case allows <input_dir> <output_dir>, then we now have
>> that in Tika.
>>
>> I've been wanting to add a similar watchdog to tika-server ... any
>> interest in that?
>>
>>
>> -----Original Message-----
>> From: xavi jmlucjav [mailto:jmlucjav@gmail.com]
>> Sent: Thursday, February 11, 2016 2:16 PM
>> To: solr-user <so...@lucene.apache.org>
>> Subject: Re: How is Tika used with Solr
>>
>> I have found that when you deal with large amounts of all sort of files,
>> in the end you find stuff (pdfs are typically nasty) that will hang tika.
>> That is even worse that a crash or OOM.
>> We used aperture instead of tika because at the time it provided a
>> watchdog feature to kill what seemed like a hanged extracting thread. That
>> feature is super important for a robust text extracting pipeline. Has Tika
>> gained such feature already?
>>
>> xavier
>>
>> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>> > Timothy's points are absolutely spot-on. In production scenarios, if
>> > you use the simple "run Tika in a SolrJ program" approach you _must_
>> > abort the program on OOM errors and the like and  figure out what's
>> > going on with the offending document(s). Or record the name somewhere
>> > and skip it next time 'round. Or........
>> >
>> > How much you have to build in here really depends on your use case.
>> > For "small enough"
>> > sets of documents or one-time indexing, you can get by with dealing
>> > with errors one at a time.
>> > For robust systems where you have to have indexing available at all
>> > times and _especially_ where you don't control the document corpus,
>> > you have to build something far more tolerant as per Tim's comments.
>> >
>> > FWIW,
>> > Erick
>> >
>> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
>> > <ta...@mitre.org>
>> > wrote:
>> > > I completely agree on the impulse, and for the vast majority of the
>> > > time
>> > (regular catchable exceptions), that'll work.  And, by vast majority,
>> > aside from oom on very large files, we aren't seeing these problems
>> > any more in our 3 million doc corpus (y, I know, small by today's
>> > standards) from
>> > govdocs1 and Common Crawl over on our Rackspace vm.
>> > >
>> > > Given my focus on Tika, I'm overly sensitive to the worst case
>> > scenarios.  I find it encouraging, Erick, that you haven't seen these
>> > types of problems, that users aren't complaining too often about
>> > catastrophic failures of Tika within Solr Cell, and that this thread
>> > is not yet swamped with integrators agreeing with me. :)
>> > >
>> > > However, because oom can leave memory in a corrupted state (right?),
>> > because you can't actually kill a thread for a permanent hang and
>> > because Tika is a kitchen sink and we can't prevent memory leaks in
>> > our dependencies, one needs to be aware that bad things can
>> > happen...if only very, very rarely.  For a fellow traveler who has run
>> > into these issues on massive data sets, see also [0].
>> > >
>> > > Configuring Hadoop to work around these types of problems is not too
>> > difficult -- it has to be done with some thought, though.  On
>> > conventional single box setups, the ForkParser within Tika is one
>> > option, tika-batch is another.  Hand rolling your own parent/child
>> > process is non-trivial and is not necessary for the vast majority of use
>> cases.
>> > >
>> > >
>> > > [0]
>> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
>> > eb-content-nanite/
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > > Sent: Tuesday, February 09, 2016 10:05 PM
>> > > To: solr-user <so...@lucene.apache.org>
>> > > Subject: Re: How is Tika used with Solr
>> > >
>> > > My impulse would be to _not_ run Tika in its own JVM, just catch any
>> > exceptions in my code and "do the right thing". I'm not sure I see any
>> > real benefit in yet another JVM.
>> > >
>> > > FWIW,
>> > > Erick
>> > >
>> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
>> > > <ta...@mitre.org>
>> > wrote:
>> > >> I have one answer here [0], but I'd be interested to hear what Solr
>> > users/devs/integrators have experienced on this topic.
>> > >>
>> > >> [0]
>> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
>> > >> Y1P
>> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou
>> > >> tlo
>> > >> ok.com%3E
>> > >>
>> > >> -----Original Message-----
>> > >> From: Steven White [mailto:swhite4141@gmail.com]
>> > >> Sent: Tuesday, February 09, 2016 6:33 PM
>> > >> To: solr-user@lucene.apache.org
>> > >> Subject: Re: How is Tika used with Solr
>> > >>
>> > >> Thank you Erick and Alex.
>> > >>
>> > >> My main question is with a long running process using Tika in the
>> > >> same
>> > JVM as my application.  I'm running my file-system-crawler in its own
>> > JVM (not Solr's).  On Tika mailing list, it is suggested to run Tika's
>> > code in it's own JVM and invoke it from my file-system-crawler using
>> > Runtime.getRuntime().exec().
>> > >>
>> > >> I fully understand from Alex suggestion and link provided by Erick
>> > >> to
>> > use Tika outside Solr.  But what about using Tika within the same JVM
>> > as my file-system-crawler application or should I be making a system
>> > call to invoke another JAR, that runs in its own JVM to extract the
>> > raw text?  Are there known issues with Tika when used in a long running
>> process?
>> > >>
>> > >> Steve
>> > >>
>> > >>
>> >
>>

Re: How is Tika used with Solr

Posted by xavi jmlucjav <jm...@gmail.com>.

For sure, if I need heavy duty text extraction again, Tika would be the
obvious choice if it covers dealing with hangs. I never used tika-server
myself (not sure if it existed at the time) just used tika from my own jvm.

On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar <input_dir> <output_dir>
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This
> creates a parent and child process, if the child process notices a hung
> thread, it dies, and the parent restarts it.  Or if your OS gets upset with
> the child process and kills it out of self preservation, the parent
> restarts the child, or if there's an OOM...and you can configure how often
> the child shuts itself down (with parental restarting) to mitigate memory
> leaks.
>
> So, y, if your use case allows <input_dir> <output_dir>, then we now have
> that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any
> interest in that?
>
>
> -----Original Message-----
> From: xavi jmlucjav [mailto:jmlucjav@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of files,
> in the end you find stuff (pdfs are typically nasty) that will hang tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a
> watchdog feature to kill what seemed like a hanged extracting thread. That
> feature is super important for a robust text extracting pipeline. Has Tika
> gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if
> > you use the simple "run Tika in a SolrJ program" approach you _must_
> > abort the program on OOM errors and the like and  figure out what's
> > going on with the offending document(s). Or record the name somewhere
> > and skip it next time 'round. Or........
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all
> > times and _especially_ where you don't control the document corpus,
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > <ta...@mitre.org>
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of the
> > > time
> > (regular catchable exceptions), that'll work.  And, by vast majority,
> > aside from oom on very large files, we aren't seeing these problems
> > any more in our 3 million doc corpus (y, I know, small by today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen these
> > types of problems, that users aren't complaining too often about
> > catastrophic failures of Tika within Solr Cell, and that this thread
> > is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state (right?),
> > because you can't actually kill a thread for a permanent hang and
> > because Tika is a kitchen sink and we can't prevent memory leaks in
> > our dependencies, one needs to be aware that bad things can
> > happen...if only very, very rarely.  For a fellow traveler who has run
> > into these issues on massive data sets, see also [0].
> > >
> > > Configuring Hadoop to work around these types of problems is not too
> > difficult -- it has to be done with some thought, though.  On
> > conventional single box setups, the ForkParser within Tika is one
> > option, tika-batch is another.  Hand rolling your own parent/child
> > process is non-trivial and is not necessary for the vast majority of use
> cases.
> > >
> > >
> > > [0]
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Tuesday, February 09, 2016 10:05 PM
> > > To: solr-user <so...@lucene.apache.org>
> > > Subject: Re: How is Tika used with Solr
> > >
> > > My impulse would be to _not_ run Tika in its own JVM, just catch any
> > exceptions in my code and "do the right thing". I'm not sure I see any
> > real benefit in yet another JVM.
> > >
> > > FWIW,
> > > Erick
> > >
> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
> > > <ta...@mitre.org>
> > wrote:
> > >> I have one answer here [0], but I'd be interested to hear what Solr
> > users/devs/integrators have experienced on this topic.
> > >>
> > >> [0]
> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
> > >> Y1P
> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou
> > >> tlo
> > >> ok.com%3E
> > >>
> > >> -----Original Message-----
> > >> From: Steven White [mailto:swhite4141@gmail.com]
> > >> Sent: Tuesday, February 09, 2016 6:33 PM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: How is Tika used with Solr
> > >>
> > >> Thank you Erick and Alex.
> > >>
> > >> My main question is with a long running process using Tika in the
> > >> same
> > JVM as my application.  I'm running my file-system-crawler in its own
> > JVM (not Solr's).  On Tika mailing list, it is suggested to run Tika's
> > code in it's own JVM and invoke it from my file-system-crawler using
> > Runtime.getRuntime().exec().
> > >>
> > >> I fully understand from Alex suggestion and link provided by Erick
> > >> to
> > use Tika outside Solr.  But what about using Tika within the same JVM
> > as my file-system-crawler application or should I be making a system
> > call to invoke another JAR, that runs in its own JVM to extract the
> > raw text?  Are there known issues with Tika when used in a long running
> process?
> > >>
> > >> Steve
> > >>
> > >>
> >
>

RE: How is Tika used with Solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Right.  If you can't dump to a mirrored output directory, then you'll have to do your own monitoring.

If you can dump to a mirrored output directory, then tika-app will do all of the watchdog stuff for you.

If you can't, then, y, you're on your own.

If you want to get fancy, you could try implementing FileResourceConsumer in tika-batch....Look at FSFileResourceConsumer as an example.  I've done this for reading Tika output and indexing w/ Lucene.

You might also look at StrawmanTikaAppDriver in the tika-batch module for an example of some basic multithreaded code that does what you suggest below.

-----Original Message-----
From: Steven White [mailto:swhite4141@gmail.com] 
Sent: Thursday, February 11, 2016 6:03 PM
To: solr-user@lucene.apache.org
Subject: Re: How is Tika used with Solr

Tim,

In my case, I have to use Tika as follows:

    java -jar tika-app.jar -t <input_file>

I will be invoking the above command from my Java app using Runtime.getRuntime().exec().  I will capture stdout and stderr to get back the raw text i need.  My app use case will not allow me to use a <input_dir> <output_dir>, it is out of the question.

Reading your summary, it looks like I won't get this watch-dog monitoring and thus I have to implement my own.  Can you confirm?

Thanks

Steve


On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar <input_dir> <output_dir>
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  
> This creates a parent and child process, if the child process notices 
> a hung thread, it dies, and the parent restarts it.  Or if your OS 
> gets upset with the child process and kills it out of self 
> preservation, the parent restarts the child, or if there's an 
> OOM...and you can configure how often the child shuts itself down 
> (with parental restarting) to mitigate memory leaks.
>
> So, y, if your use case allows <input_dir> <output_dir>, then we now 
> have that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any 
> interest in that?
>
>
> -----Original Message-----
> From: xavi jmlucjav [mailto:jmlucjav@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of 
> files, in the end you find stuff (pdfs are typically nasty) that will hang tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a 
> watchdog feature to kill what seemed like a hanged extracting thread. 
> That feature is super important for a robust text extracting pipeline. 
> Has Tika gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
> <er...@gmail.com>
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if 
> > you use the simple "run Tika in a SolrJ program" approach you _must_ 
> > abort the program on OOM errors and the like and  figure out what's 
> > going on with the offending document(s). Or record the name 
> > somewhere and skip it next time 'round. Or........
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing 
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all 
> > times and _especially_ where you don't control the document corpus, 
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > <ta...@mitre.org>
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of 
> > > the time
> > (regular catchable exceptions), that'll work.  And, by vast 
> > majority, aside from oom on very large files, we aren't seeing these 
> > problems any more in our 3 million doc corpus (y, I know, small by 
> > today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen 
> > these types of problems, that users aren't complaining too often 
> > about catastrophic failures of Tika within Solr Cell, and that this 
> > thread is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state 
> > > (right?),
> > because you can't actually kill a thread for a permanent hang and 
> > because Tika is a kitchen sink and we can't prevent memory leaks in 
> > our dependencies, one needs to be aware that bad things can 
> > happen...if only very, very rarely.  For a fellow traveler who has 
> > run into these issues on massive data sets, see also [0].
> > >
> > > Configuring Hadoop to work around these types of problems is not 
> > > too
> > difficult -- it has to be done with some thought, though.  On 
> > conventional single box setups, the ForkParser within Tika is one 
> > option, tika-batch is another.  Hand rolling your own parent/child 
> > process is non-trivial and is not necessary for the vast majority of 
> > use
> cases.
> > >
> > >
> > > [0]
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising
> > -w
> > eb-content-nanite/
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Tuesday, February 09, 2016 10:05 PM
> > > To: solr-user <so...@lucene.apache.org>
> > > Subject: Re: How is Tika used with Solr
> > >
> > > My impulse would be to _not_ run Tika in its own JVM, just catch 
> > > any
> > exceptions in my code and "do the right thing". I'm not sure I see 
> > any real benefit in yet another JVM.
> > >
> > > FWIW,
> > > Erick
> > >
> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
> > > <ta...@mitre.org>
> > wrote:
> > >> I have one answer here [0], but I'd be interested to hear what 
> > >> Solr
> > users/devs/integrators have experienced on this topic.
> > >>
> > >> [0]
> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3
> > >> CC
> > >> Y1P
> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.
> > >> ou
> > >> tlo
> > >> ok.com%3E
> > >>
> > >> -----Original Message-----
> > >> From: Steven White [mailto:swhite4141@gmail.com]
> > >> Sent: Tuesday, February 09, 2016 6:33 PM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: How is Tika used with Solr
> > >>
> > >> Thank you Erick and Alex.
> > >>
> > >> My main question is with a long running process using Tika in the 
> > >> same
> > JVM as my application.  I'm running my file-system-crawler in its 
> > own JVM (not Solr's).  On Tika mailing list, it is suggested to run 
> > Tika's code in it's own JVM and invoke it from my 
> > file-system-crawler using Runtime.getRuntime().exec().
> > >>
> > >> I fully understand from Alex suggestion and link provided by 
> > >> Erick to
> > use Tika outside Solr.  But what about using Tika within the same 
> > JVM as my file-system-crawler application or should I be making a 
> > system call to invoke another JAR, that runs in its own JVM to 
> > extract the raw text?  Are there known issues with Tika when used in 
> > a long running
> process?
> > >>
> > >> Steve
> > >>
> > >>
> >
>

Re: How is Tika used with Solr

Posted by Steven White <sw...@gmail.com>.

Tim,

In my case, I have to use Tika as follows:

    java -jar tika-app.jar -t <input_file>

I will be invoking the above command from my Java app
using Runtime.getRuntime().exec().  I will capture stdout and stderr to get
back the raw text i need.  My app use case will not allow me to use a
<input_dir> <output_dir>, it is out of the question.

Reading your summary, it looks like I won't get this watch-dog monitoring
and thus I have to implement my own.  Can you confirm?

Thanks

Steve


On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar <input_dir> <output_dir>
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This
> creates a parent and child process, if the child process notices a hung
> thread, it dies, and the parent restarts it.  Or if your OS gets upset with
> the child process and kills it out of self preservation, the parent
> restarts the child, or if there's an OOM...and you can configure how often
> the child shuts itself down (with parental restarting) to mitigate memory
> leaks.
>
> So, y, if your use case allows <input_dir> <output_dir>, then we now have
> that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any
> interest in that?
>
>
> -----Original Message-----
> From: xavi jmlucjav [mailto:jmlucjav@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of files,
> in the end you find stuff (pdfs are typically nasty) that will hang tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a
> watchdog feature to kill what seemed like a hanged extracting thread. That
> feature is super important for a robust text extracting pipeline. Has Tika
> gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if
> > you use the simple "run Tika in a SolrJ program" approach you _must_
> > abort the program on OOM errors and the like and  figure out what's
> > going on with the offending document(s). Or record the name somewhere
> > and skip it next time 'round. Or........
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all
> > times and _especially_ where you don't control the document corpus,
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > <ta...@mitre.org>
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of the
> > > time
> > (regular catchable exceptions), that'll work.  And, by vast majority,
> > aside from oom on very large files, we aren't seeing these problems
> > any more in our 3 million doc corpus (y, I know, small by today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen these
> > types of problems, that users aren't complaining too often about
> > catastrophic failures of Tika within Solr Cell, and that this thread
> > is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state (right?),
> > because you can't actually kill a thread for a permanent hang and
> > because Tika is a kitchen sink and we can't prevent memory leaks in
> > our dependencies, one needs to be aware that bad things can
> > happen...if only very, very rarely.  For a fellow traveler who has run
> > into these issues on massive data sets, see also [0].
> > >
> > > Configuring Hadoop to work around these types of problems is not too
> > difficult -- it has to be done with some thought, though.  On
> > conventional single box setups, the ForkParser within Tika is one
> > option, tika-batch is another.  Hand rolling your own parent/child
> > process is non-trivial and is not necessary for the vast majority of use
> cases.
> > >
> > >
> > > [0]
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Tuesday, February 09, 2016 10:05 PM
> > > To: solr-user <so...@lucene.apache.org>
> > > Subject: Re: How is Tika used with Solr
> > >
> > > My impulse would be to _not_ run Tika in its own JVM, just catch any
> > exceptions in my code and "do the right thing". I'm not sure I see any
> > real benefit in yet another JVM.
> > >
> > > FWIW,
> > > Erick
> > >
> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
> > > <ta...@mitre.org>
> > wrote:
> > >> I have one answer here [0], but I'd be interested to hear what Solr
> > users/devs/integrators have experienced on this topic.
> > >>
> > >> [0]
> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
> > >> Y1P
> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou
> > >> tlo
> > >> ok.com%3E
> > >>
> > >> -----Original Message-----
> > >> From: Steven White [mailto:swhite4141@gmail.com]
> > >> Sent: Tuesday, February 09, 2016 6:33 PM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: How is Tika used with Solr
> > >>
> > >> Thank you Erick and Alex.
> > >>
> > >> My main question is with a long running process using Tika in the
> > >> same
> > JVM as my application.  I'm running my file-system-crawler in its own
> > JVM (not Solr's).  On Tika mailing list, it is suggested to run Tika's
> > code in it's own JVM and invoke it from my file-system-crawler using
> > Runtime.getRuntime().exec().
> > >>
> > >> I fully understand from Alex suggestion and link provided by Erick
> > >> to
> > use Tika outside Solr.  But what about using Tika within the same JVM
> > as my file-system-crawler application or should I be making a system
> > call to invoke another JAR, that runs in its own JVM to extract the
> > raw text?  Are there known issues with Tika when used in a long running
> process?
> > >>
> > >> Steve
> > >>
> > >>
> >
>