You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/07/30 17:17:04 UTC

[jira] [Updated] (TIKA-1701) Fix DigestingParser to handle truncated package files more robustly

     [ https://issues.apache.org/jira/browse/TIKA-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-1701:
------------------------------
    Description: 
On a recent run against Common Crawl data, I found that the DigestingParser's strategy of mark() --> digest stream -->reset() _before_ the parse is causing problems with truncated package files...the digester is hitting the EOF exception before the parsing of the embedded files is able to take place.

We might want to do the digesting after the parse (?) or wrap the InputStream to digest each byte as it is read.

In a very few cases, more attachments were able to be read with the DigestingParser than without, but the opposite was far more often.

  was:
On a recent run against Common Crawl data, I found that the DigestingParser's strategy of mark()->read stream->reset() _before_ the parse is causing problems with truncated package files...the digester is hitting the EOF exception before the parsing of the embedded files is able to take place.

We might want to do the digesting after the parse (?) or wrap the InputStream to digest each byte as it is read.

In a very few cases, more attachments were able to be read with the DigestingParser than without, but the opposite was far more often.


> Fix DigestingParser to handle truncated package files more robustly
> -------------------------------------------------------------------
>
>                 Key: TIKA-1701
>                 URL: https://issues.apache.org/jira/browse/TIKA-1701
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Trivial
>
> On a recent run against Common Crawl data, I found that the DigestingParser's strategy of mark() --> digest stream -->reset() _before_ the parse is causing problems with truncated package files...the digester is hitting the EOF exception before the parsing of the embedded files is able to take place.
> We might want to do the digesting after the parse (?) or wrap the InputStream to digest each byte as it is read.
> In a very few cases, more attachments were able to be read with the DigestingParser than without, but the opposite was far more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)