You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/04/07 15:23:13 UTC

[jira] [Commented] (TIKA-456) Support timeouts for parsers

    [ https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483137#comment-14483137 ] 

Tim Allison commented on TIKA-456:
----------------------------------

bq. And in Hadoop we can limit the number of times a child JVM is reused, so eventually the hung threads get cleaned up.

[~kkrugler], thank you for this pointer.  We're in [discussion | https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0] with Common Crawl to integrate Tika into their process.  If you'd like to participate in that discussion or contribute your lessons learned, we'd very much appreciate your input!

At the very least, please consider sharing hints: [here|https://wiki.apache.org/tika/TikaInHadoop].  

> Support timeouts for parsers
> ----------------------------
>
>                 Key: TIKA-456
>                 URL: https://issues.apache.org/jira/browse/TIKA-456
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this:
>     parser = new AutoDetectParser();
>     Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream, metadata);
>     FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
>     Thread t = new Thread(task);
>     t.start();
>     ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable<ParsedDatum> {
>     public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata) {
>         _parser = parser;
>         _handler = handler;
>         _input = is;
>         _metadata = metadata;
>         ...
>     }
>     public ParsedDatum call() throws Exception {
>         ....
>         _parser.parse(_input, _handler, _metadata, new ParseContext());
>         ....
>     }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like:
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)