You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/04/30 16:39:53 UTC

[jira] Created: (TIKA-416) Out-of-process text extraction

Out-of-process text extraction
------------------------------

                 Key: TIKA-416
                 URL: https://issues.apache.org/jira/browse/TIKA-416
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Jukka Zitting
            Priority: Minor


There's currently no easy way to guard against JVM crashes or excessive memory or CPU use caused by parsing very large, broken or intentionally malicious input documents. To better protect against such cases and to generally improve the manageability of resource consumption by Tika it would be great if we had a way to run Tika parsers in separate JVM processes. This could be handled either as a separate "Tika parser daemon" or as an explicitly managed pool of forked JVMs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-416) Out-of-process text extraction

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862690#action_12862690 ] 

Chris A. Mattmann commented on TIKA-416:
----------------------------------------

+1, this sounds like a great idea!

We did some work on this in OODT in terms of simple external met extractors and so forth. Maybe we could follow a similar approach here. Check out:

http://svn.apache.org/repos/asf/incubator/oodt/cas-metadata/trunk/src/main/java/gov/nasa/jpl/oodt/cas/metadata/extractors/ExternMetExtractor.java

and 

http://svn.apache.org/repos/asf/incubator/oodt/cas-metadata/trunk/src/main/resources/examples/extern-config.xml

as some examples of how to deal with this (NOTE, in OODT-3, we are still in the process of converting over the licenses and there are no "official" incubator releases of OODT yet, but I just wanted to let you know about it as some pointers to ways to get this done). You rock and I can't wait for this feature!

> Out-of-process text extraction
> ------------------------------
>
>                 Key: TIKA-416
>                 URL: https://issues.apache.org/jira/browse/TIKA-416
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> There's currently no easy way to guard against JVM crashes or excessive memory or CPU use caused by parsing very large, broken or intentionally malicious input documents. To better protect against such cases and to generally improve the manageability of resource consumption by Tika it would be great if we had a way to run Tika parsers in separate JVM processes. This could be handled either as a separate "Tika parser daemon" or as an explicitly managed pool of forked JVMs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.