You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/06/23 18:35:24 UTC

[jira] [Updated] (TIKA-1352) Upgrade to PDFBox 1.8.6

     [ https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-1352:
------------------------------

    Attachment: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip

Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 workarounds removed) on a random selection of 10k pdf files in govdocs1.

Both runs used the older "sequential" parser.

The table file is a tab-delimited UTF-16LE file.

This is a first go at the initial/raw output of comparison code for TIKA-1302.  Much more work remains.

The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds. 

Other than that, we should probably look at the few hundred files that have token overlap of < 98%.

To view the original files from gov docs (e.g. 765470), navigate to:

http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf

> Upgrade to PDFBox 1.8.6
> -----------------------
>
>                 Key: TIKA-1352
>                 URL: https://issues.apache.org/jira/browse/TIKA-1352
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)