You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/03/29 13:06:25 UTC

[jira] [Commented] (TIKA-1912) Figure out how to parse truncated PDFs that were handled by PDFBox 1.8.x but not by 2.0.0

    [ https://issues.apache.org/jira/browse/TIKA-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215850#comment-15215850 ] 

Tim Allison commented on TIKA-1912:
-----------------------------------

Overall, I see two options:

1. Improve PDFBox 2.0.x's handling of truncated files.
2. Shade 1.8.x and use that as a backoff parser.

As [~jahewson] pointed out, it would be far better to improve PDFBox 2.0.0's handling of truncated files, and I agree.  On this [thread|http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201603.mbox/%3C56F9466D.90305%40lehmi.de%3E], it looked like there may be some willingness on the PDFBox team to work on this.

For the second option, I've set up a standalone project on github that shades PDFBox 1.8.11 [here|https://github.com/tballison/tika-addons], and uses Tika's last pre-2.0.0 PDFParser.  It was mildly tricky because TextStripper loads classes from a .properties file that wasn't automatically shaded...I'm sure there is a more elegant solution than I what I did...advice is welcomed!

> Figure out how to parse truncated PDFs that were handled by PDFBox 1.8.x but not by 2.0.0
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1912
>                 URL: https://issues.apache.org/jira/browse/TIKA-1912
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> While working on TIKA-1285, we found that PDFBox 2.0.0 is not able to handle truncated files as well as PDFBox 1.8.11.  Let's figure out how to gain the benefits from 2.0.0 without losing the ability to extract some content from truncated files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)