You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/03/10 16:26:40 UTC

[jira] [Commented] (PDFBOX-3058) Support TIKA Migration to PDFBox 2.0

    [ https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189421#comment-15189421 ] 

Tim Allison commented on PDFBOX-3058:
-------------------------------------

Finished a run comparing 1.8.11 and pdfbox-2.0.0-20160304.180026-2013 against ~500k PDFs.

 Reports are here: https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2_0_20160310.zip 

The vast majority of new exceptions are caused by truncated files (as before).  More work remains to remove those.

As for content, we're extracting ~1.5 million more "common English" words in 2.0 vs 1.8.11. 

There are still some files that had more content "common English" words in 1.8.11 than 2.0, but overall, there is an increase.

There are a handful of files with significantly more metadata in 1.8.11 than 2.0, but just a handful.

> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
>                 Key: PDFBOX-3058
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3058
>             Project: PDFBox
>          Issue Type: Task
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Maruan Sahyoun
>            Assignee: Andreas Lehmkühler
>             Fix For: 2.0.0
>
>         Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json, NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json, content_diffs-1.8-to-2.0.xlsx, content_diffs-4.xlsx, textLostFromACausedByNewExceptionsInB.zip
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285 (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org