You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sasha Goodman (Created) (JIRA)" <ji...@apache.org> on 2012/02/14 02:18:59 UTC

[jira] [Created] (TIKA-861) Parse links in PDF

Parse links in PDF
------------------

                 Key: TIKA-861
                 URL: https://issues.apache.org/jira/browse/TIKA-861
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.0
            Reporter: Sasha Goodman
            Priority: Minor
             Fix For: 1.1


Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. 

The PDF2XHTML method loops through the annotations. 

See: 
{code:java}
136: for(Object o : page.getAnnotations()) {
{code}

 I found some code for dealing with links in annotations:
http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link

It involves checking the class. 
{code:java}
if( annotation instanceof PDAnnotationLink ) {
                PDAnnotationLink link = (PDAnnotationLink)annotation;
{code}

I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-861) Parse links in PDF

Posted by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-861:
-----------------------------------

    Fix Version/s:     (was: 1.1)
                   1.2

- push out to 1.2
                
> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. 
> The PDF2XHTML method loops through the annotations. 
> See: 
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class. 
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-861) Parse links in PDF

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260429#comment-13260429 ] 

Nick Burch commented on TIKA-861:
---------------------------------

testPDFVarious.pdf in /tika-parsers/src/test/resources/test-documents/ contains a hyperlink on page one, so would be a good file to use for a unit test

Is anyone able to work up a unit test for link parsing to go with this patch? (PDFParserTest already has some xhtml based tests, which could be used as a pattern.)
                
> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. 
> The PDF2XHTML method loops through the annotations. 
> See: 
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class. 
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-861) Parse links in PDF

Posted by "Ryan Quam (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Quam updated TIKA-861:
---------------------------

    Attachment: TIKA-861-test.patch

Here is a simple unit test for the PDF link parsing.
                
> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861-test.patch, TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. 
> The PDF2XHTML method loops through the annotations. 
> See: 
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class. 
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-861) Parse links in PDF

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-861.
-----------------------------

    Resolution: Fixed

Thanks, patches committed in r1331434.

One thing to note is that links are extracted for now at the end of the page. Further work may be wanted in future, in order to match them to the text they apply to
                
> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861-test.patch, TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. 
> The PDF2XHTML method loops through the annotations. 
> See: 
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class. 
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-861) Parse links in PDF

Posted by "Ryan Quam (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Quam updated TIKA-861:
---------------------------

    Attachment: TIKA-861.patch

Patch that adds PDF links to the DOM.
                
> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. 
> The PDF2XHTML method loops through the annotations. 
> See: 
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class. 
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira