You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jeremy Anderson (Created) (JIRA)" <ji...@apache.org> on 2011/12/12 20:29:30 UTC

[jira] [Created] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Upgrade to PDFbox 1.7.0 as available
------------------------------------

                 Key: TIKA-810
                 URL: https://issues.apache.org/jira/browse/TIKA-810
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0
            Reporter: Jeremy Anderson
            Priority: Minor


This isssue is to track upgrading the PDFbox dependency 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Posted by "Jeremy Anderson (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171040#comment-13171040 ] 

Jeremy Anderson edited comment on TIKA-810 at 12/16/11 4:28 PM:
----------------------------------------------------------------

Appears that the issues I'm seeing with the PdfParserTest failing is related to the inclussion of Tika's PDFParser and PDF2XHTML files into PDFBox on October 13, rev 1182880.  Subsequent Patches made to Tika's PDFParser file, for which the test case relies upon, is overridden by the Parser version contained in PDFBOX.

This has been a bit of a discussed issue based on parser usage when dependencies are/are not present I believe.

But as is, when using the daily builds of PDFBox and TIKA, fixes applied to these two files in Tika, should probably be replicated in the PDFBox file versions as well.  Currently, as of 12/16, the following TIKA issues have caused changes to these files: TIKA-612, TIKA-724, TIKA-738, TIKA-767, TIKA-778.
                
      was (Author: rpialum):
    Appears that the isues I'm seeing with the PdfParserTest failing is related to the inclussion of Tika's PDFParser and PDF2XHTML files into PDFBox on October 13, rev 1182880.  Subsequent Patches made to Tika's PDFParser file, for which the test case relies upon, is overridden by the version in PDFBOX. 
                  
> Upgrade to PDFbox 1.7.0 as available
> ------------------------------------
>
>                 Key: TIKA-810
>                 URL: https://issues.apache.org/jira/browse/TIKA-810
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Priority: Minor
>         Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Posted by "Jeremy Anderson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175417#comment-13175417 ] 

Jeremy Anderson commented on TIKA-810:
--------------------------------------

Thanks Jukka...

I assume the discussions you're referring to is the "Pushing Parsers Upstream" thread. ( http://www.lucidimagination.com/search/document/a792e63b788051f ).

If you know of any other pertinent threads, or other search terms to use within these dev lists please feel free to provide.



(I'm still relatively new to working with/contributing on these projects.  I'm picking up the work flow and proper methods to be a helpful contributor on the fly.  Any suggestions in regards to refinement of practices or submissions are always welcome.)

Have a great holidays!!

                
> Upgrade to PDFbox 1.7.0 as available
> ------------------------------------
>
>                 Key: TIKA-810
>                 URL: https://issues.apache.org/jira/browse/TIKA-810
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Priority: Minor
>         Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-810.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.2
         Assignee: Jukka Zitting

Upgraded to PDFBox 1.7.0 in revision 1355744.

The annotation setting got handled in TIKA-612 is now enabled by default, so no need to change the existing test cases.
                
> Upgrade to PDFbox 1.7.0 as available
> ------------------------------------
>
>                 Key: TIKA-810
>                 URL: https://issues.apache.org/jira/browse/TIKA-810
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Posted by "Antoni Mylka (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171098#comment-13171098 ] 

Antoni Mylka commented on TIKA-810:
-----------------------------------

That's a very important question IMHO, crucial to the feasibility of pushing parsers outside Tika codebase. Why don't we just remove the PDFParser classes from tika-parsers and completely drop the dependency from tika-parsers to pdfbox? We could deal with the resulting user-unfriendliness in a different way. With this dependency, and separate release cycles of both libraries, life becomes difficult.
                
> Upgrade to PDFbox 1.7.0 as available
> ------------------------------------
>
>                 Key: TIKA-810
>                 URL: https://issues.apache.org/jira/browse/TIKA-810
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Priority: Minor
>         Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Posted by "Jeremy Anderson (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Anderson updated TIKA-810:
---------------------------------

    Attachment: pdfbox-1.7.0.diff

Upgraded to 1.7.0 in revision 1213227 as of 2011-12-12.

Change is to TestCase where annotation text extraction is now off by default in PDFBox. (Appeared to be on in 1.6.0 release but no longer is in 1.7.0 daily)

Note, a proper fix may be required to change the Tika PDF Parser to turn on annotation extraction by default and then modify the test case appropriately.  Or to submit a fix in PDF box to have 1.7.0 behave the same as 1.6.0.
                
> Upgrade to PDFbox 1.7.0 as available
> ------------------------------------
>
>                 Key: TIKA-810
>                 URL: https://issues.apache.org/jira/browse/TIKA-810
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Priority: Minor
>         Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172335#comment-13172335 ] 

Jukka Zitting commented on TIKA-810:
------------------------------------

In revision 1220781 I updated the parser code in PDFBox to match latest changes in Tika.

See discussions on dev@tika and dev@pdfbox on how and where to maintain the code going forward.
                
> Upgrade to PDFbox 1.7.0 as available
> ------------------------------------
>
>                 Key: TIKA-810
>                 URL: https://issues.apache.org/jira/browse/TIKA-810
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Priority: Minor
>         Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Posted by "Jeremy Anderson (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Anderson updated TIKA-810:
---------------------------------

    Description: This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once it's available, and the daily build before then  (was: This isssue is to track upgrading the PDFbox dependency 1.7.0 Final once it's available, and the daily build before then)
    
> Upgrade to PDFbox 1.7.0 as available
> ------------------------------------
>
>                 Key: TIKA-810
>                 URL: https://issues.apache.org/jira/browse/TIKA-810
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Priority: Minor
>         Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

Posted by "Jeremy Anderson (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171040#comment-13171040 ] 

Jeremy Anderson edited comment on TIKA-810 at 12/16/11 4:50 PM:
----------------------------------------------------------------

Appears that the issues I'm seeing with the PdfParserTest failing is related to the inclussion of Tika's PDFParser and PDF2XHTML files into PDFBox on October 13, rev 1182880 (PDFBOX-1132).  Subsequent Patches made to Tika's PDFParser file, for which the test case relies upon, is overridden by the Parser version contained in PDFBOX. (the AutoDetectParser returns the parser contained in PDFBox, rather than Tika's PDFParser)

This has been a bit of a discussed issue based on parser usage when dependencies are/are not present I believe.

But as is, when using the daily builds of PDFBox and TIKA, fixes applied to these two files in Tika, should probably be replicated in the PDFBox file versions as well.  Currently, as of 12/16, the following TIKA issues have caused changes to these files and should likely be applied to the files on PDFBox's side: TIKA-612, TIKA-724, TIKA-738, TIKA-767, TIKA-778.
                
      was (Author: rpialum):
    Appears that the issues I'm seeing with the PdfParserTest failing is related to the inclussion of Tika's PDFParser and PDF2XHTML files into PDFBox on October 13, rev 1182880.  Subsequent Patches made to Tika's PDFParser file, for which the test case relies upon, is overridden by the Parser version contained in PDFBOX.

This has been a bit of a discussed issue based on parser usage when dependencies are/are not present I believe.

But as is, when using the daily builds of PDFBox and TIKA, fixes applied to these two files in Tika, should probably be replicated in the PDFBox file versions as well.  Currently, as of 12/16, the following TIKA issues have caused changes to these files: TIKA-612, TIKA-724, TIKA-738, TIKA-767, TIKA-778.
                  
> Upgrade to PDFbox 1.7.0 as available
> ------------------------------------
>
>                 Key: TIKA-810
>                 URL: https://issues.apache.org/jira/browse/TIKA-810
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Priority: Minor
>         Attachments: pdfbox-1.7.0.diff
>
>
> This isssue is to track upgrading the PDFbox dependency to 1.7.0 Final once it's available, and the daily build before then

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira