You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/03/21 18:06:41 UTC

[jira] [Comment Edited] (SOLR-9552) Upgrade to Tika 1.14 when available

    [ https://issues.apache.org/jira/browse/SOLR-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934996#comment-15934996 ] 

Tim Allison edited comment on SOLR-9552 at 3/21/17 6:06 PM:
------------------------------------------------------------

Um, sure, we call it Tika 2.0. Do take a look at that branch and please do contribute to the design.  Let us know if that will meet your needs.

[~elyograg] already mentioned the key problem.  Even if there is a clean break in the design between native libs and non-native libs, and even though we're now running Tika against a TB of data (3 million files) from Common Crawl before we do a release, and even though we try to be as responsive as we possibly can to JIRA issues about Tika behaving badly, some parser at some point is going to do something awful (OOM blow out, permanent hang, execute malicious code, slow burning memory leak, etc...things that cannot be handled by catch blocks and asking a Thread to kindly stop), so it is better to keep Tika in a separate JVM via tika-server or encapsulate it in another way: tika-batch (spawns a child process) or ForkParser (child jar and child process).

In short, [~daddywri], I regret that there is no bright line between well-behaved and ill-behaved parsers.  There is a bright line between native libs and non-Apache friendly parsers...users have to know about them and "opt in".  But anything else we can do to help ManifoldCF, let us know!

There are a few things that are holding up 2.0 (e.g., I've chosen to work on tactical issues instead of the more strategic 2.0 stuff), but take a look at the architecture and see if it's something that makes sense.

As for GSOC, yes, please, please, please let a bright GSOC'er figure out how to integrate Tika at a distance, whether that's through tika-server (SOLR-7632) or some other means.

And, y, please let the same or another GSOC'er figure out how allow handling of child documents and their metadata (SOLR-7229).

Oh, and if you have some extra GSOC cycles, there's always TIKA-1443. :)
I'm more than happy to chip in.


was (Author: tallison@mitre.org):
Um, sure, we call it Tika 2.0. Do take a look at that branch and please do contribute to the design.  Let us know if that will meet your needs.

[~elyograg] already mentioned the key problem.  Even if there is a clean break in the design between native libs and non-native libs, and even though we're now running Tika against a TB of data (3 million files) from Common Crawl before we do a release, and even though we try to be as responsive as we possibly can to JIRA issues about Tika behaving badly, some parser at some point is going to do something awful (OOM blow out, permanent hang, execute malicious code, slow burning memory leak, etc...things that cannot be handled by catch blocks and asking a Thread to kindly stop), so it is better to keep Tika in a separate JVM via tika-server or encapsulate it in another way: tika-batch (spawns a child process) or ForkParser (child jar and child process).

In short, [~daddywri], I regret that there is no bright line between well-behaved and ill-behaved parsers.  There is a bright line between native libs and non-Apache friendly parsers...users have to know about them and "opt in".  But anything else we can do to help ManifoldCF, let us know!

There are a few things that are holding up 2.0 (e.g., I've chosen to work on tactical issues instead of the more strategic 2.0 stuff), but take a look at the architecture and see if its something that makes sense.

As for GSOC, yes, please, please, please let a bright GSOC'er figure out how to integrate Tika at a distance, whether that's through tika-server (SOLR-7632) or some other means.

And, y, please let the same or another GSOC'er figure out how allow handling of child documents and their metadata (SOLR-7229).

Oh, and if you have some extra GSOC cycles, there's always TIKA-1443. :)
I'm more than happy to chip in.

> Upgrade to Tika 1.14 when available
> -----------------------------------
>
>                 Key: SOLR-9552
>                 URL: https://issues.apache.org/jira/browse/SOLR-9552
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - DataImportHandler
>            Reporter: Tim Allison
>
>  Let's upgrade Solr as soon as 1.14 is available.
> P.S. I _think_ we're soon to wrap up work on 1.14.  Any last requests? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org