You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/09/22 20:34:22 UTC

comparing 1.8.6 and 1.8.7 on 50k govdocs1

In case you have an interest, see below...  Thank you, all, for all of the improvements in the 1.8.7 release!

Best,
  
         Tim

-----Original Message-----
From: Tim Allison (JIRA) [mailto:jira@apache.org] 
Sent: Monday, September 22, 2014 2:31 PM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7


    [ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143588#comment-14143588 ] 

Tim Allison commented on TIKA-1419:
-----------------------------------

I just finished the run on 50,000 random pdfs from govdocs1.  With the move to PDFBox 1.8.7, we've gone from 53 exceptions down to 32.  In manually reviewing the handful of docs with a token overlap < 0.80, there are quite a few improvements.  It also looks like there may be some regressions in character mapping in several of the files.  I'll submit issues for these over on PDFBox.  Unless there are objections, I'll bump Tika to PDFBox 1.8.7.

Unfortunately, the individual file links don't seem to be working today on the govdocs1 site.

> Upgrade to PDFBox 1.8.7
> -----------------------
>
>                 Key: TIKA-1419
>                 URL: https://issues.apache.org/jira/browse/TIKA-1419
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)