You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/03/13 16:48:38 UTC

[jira] [Comment Edited] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

    [ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360492#comment-14360492 ] 

Tim Allison edited comment on TIKA-1575 at 3/13/15 3:47 PM:
------------------------------------------------------------

Form clutter...This was embedded inside 776568.

With PDFBox 1.8.8, we extracted the keys for the subform (but there was no meaningful content in this doc):
{noformat}Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n
19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: \n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: \n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: \n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: \n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: \n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: \n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: \n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: \n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: \n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: \n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: \n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: \n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: \n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: \n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: \n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: \n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: \n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: \n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: \n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: \n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: \n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: \n\tCheckBox6[11]: \n\n\n\n\n",{noformat}

In 1.8.9, there's just this:
{noformat}
Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\n\n\n
{noformat}


There's no difference with PDFBox app's ExtractText between 1.8.8 and 1.8.9 on this file.


was (Author: tallison@mitre.org):
Form clutter...This was embedded inside 776568.

With PDFBox 1.8.8, we extracted the keys for the subform (but there was no meaningful content in this doc):
{noformat}Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n
19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: \n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: \n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: \n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: \n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: \n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: \n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: \n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: \n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: \n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: \n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: \n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: \n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: \n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: \n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: \n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: \n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: \n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: \n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: \n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: \n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: \n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: \n\tCheckBox6[11]: \n\n\n\n\n",{noformat}

In 1.8.9, there's just this:
{noformat}
Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\n\n\n
{noformat}


> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
>                 Key: TIKA-1575
>                 URL: https://issues.apache.org/jira/browse/TIKA-1575
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)