You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/07/05 22:48:54 UTC

[jira] Commented: (TIKA-456) Support timeouts for parsers

    [ https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885340#action_12885340 ] 

Ken Krugler commented on TIKA-456:
----------------------------------

Here is the comment thread on this issue w/Jukka - this is about the right thing to use in place of the <ParsedDatum> response type.

Jukka said:

This would certainly be useful and something I've been hoping to focus
more on (see the "security" point in
http://markmail.org/message/ggihw2cns53t6ayl from 2007 :-).

The biggest problem with the FutureTask approach is that it's quite
difficult if not impossible to support proper streaming with it. One
possible approach that would avoid this problem is to modify the
ParsingReader class to include timeout support in the pipe it uses.
Modifying that approach to support the full SAX event stream should be
doable, though not necessarily trivial.

Jukka added:

The main problem is that in the streaming case there is no single
ParsedDatum instance that would be the result of the parsing process.
Instead you have a sequence of SAX events that you should be able to
deliver to the client application as soon as they become available
without waiting for all the rest of the events to arrive first.

Instead of a FutureTask, I'd envision us using a BlockingQueue or
something similar to insert timeout handling between the parser that
generates the SAX events from the client that consumes them.

Ken adds (now):

My thoughts on this - if you provide the content handler to the Callable constructor, then you just return <Metadata> even though that's not strictly needed, as the incoming map would be modified in-place.

The content handler would be called during processing, so this should take care of streaming.


> Support timeouts for parsers
> ----------------------------
>
>                 Key: TIKA-456
>                 URL: https://issues.apache.org/jira/browse/TIKA-456
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this:
>     parser = new AutoDetectParser();
>     Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream, metadata);
>     FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
>     Thread t = new Thread(task);
>     t.start();
>     ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable<ParsedDatum> {
>     public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata) {
>         _parser = parser;
>         _handler = handler;
>         _input = is;
>         _metadata = metadata;
>         ...
>     }
>     public ParsedDatum call() throws Exception {
>         ....
>         _parser.parse(_input, _handler, _metadata, new ParseContext());
>         ....
>     }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like:
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.