You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Sergey Beryozkin (JIRA)" <ji...@apache.org> on 2014/07/11 12:30:05 UTC

[jira] [Commented] (TIKA-1351) Parser implementations should accept null content handlers

    [ https://issues.apache.org/jira/browse/TIKA-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058619#comment-14058619 ] 

Sergey Beryozkin commented on TIKA-1351:
----------------------------------------

See r1609677 for an initial update. PDFParser is checking if content handler is null and skips the content extraction if it null. Other parsers can be gradually updated as well. That will let those application services that want to offer a metadata-only based search to significantly optimize the parse process

> Parser implementations should accept null content handlers
> ----------------------------------------------------------
>
>                 Key: TIKA-1351
>                 URL: https://issues.apache.org/jira/browse/TIKA-1351
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Sergey Beryozkin
>            Priority: Minor
>
> Applications which want to let users search documents based only on their metadata do not need to get the content parsed. 
> The only workaround I've found so far is to pass a no op content handler which can ignore the content events but it does not stop the parser such as PDFParser from parsing the content.
> Proposal: update parser API docs to let implementers know ContentHandler can be null and update the shipped implementations to parse the metadata only if ContentHandler is null



--
This message was sent by Atlassian JIRA
(v6.2#6252)