You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/06/23 18:35:24 UTC
[jira] [Updated] (TIKA-1352) Upgrade to PDFBox 1.8.6
[ https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-1352:
------------------------------
Attachment: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip
Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 workarounds removed) on a random selection of 10k pdf files in govdocs1.
Both runs used the older "sequential" parser.
The table file is a tab-delimited UTF-16LE file.
This is a first go at the initial/raw output of comparison code for TIKA-1302. Much more work remains.
The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds.
Other than that, we should probably look at the few hundred files that have token overlap of < 98%.
To view the original files from gov docs (e.g. 765470), navigate to:
http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf
> Upgrade to PDFBox 1.8.6
> -----------------------
>
> Key: TIKA-1352
> URL: https://issues.apache.org/jira/browse/TIKA-1352
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)