You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sasha Goodman (Created) (JIRA)" <ji...@apache.org> on 2012/02/14 02:18:59 UTC
[jira] [Created] (TIKA-861) Parse links in PDF
Parse links in PDF
------------------
Key: TIKA-861
URL: https://issues.apache.org/jira/browse/TIKA-861
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 1.0
Reporter: Sasha Goodman
Priority: Minor
Fix For: 1.1
Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
The PDF2XHTML method loops through the annotations.
See:
{code:java}
136: for(Object o : page.getAnnotations()) {
{code}
I found some code for dealing with links in annotations:
http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
It involves checking the class.
{code:java}
if( annotation instanceof PDAnnotationLink ) {
PDAnnotationLink link = (PDAnnotationLink)annotation;
{code}
I hope this helps someone.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-861) Parse links in PDF
Posted by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-861:
-----------------------------------
Fix Version/s: (was: 1.1)
1.2
- push out to 1.2
> Parse links in PDF
> ------------------
>
> Key: TIKA-861
> URL: https://issues.apache.org/jira/browse/TIKA-861
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Reporter: Sasha Goodman
> Priority: Minor
> Labels: links, pdfbox
> Fix For: 1.2
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
> I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
> PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-861) Parse links in PDF
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260429#comment-13260429 ]
Nick Burch commented on TIKA-861:
---------------------------------
testPDFVarious.pdf in /tika-parsers/src/test/resources/test-documents/ contains a hyperlink on page one, so would be a good file to use for a unit test
Is anyone able to work up a unit test for link parsing to go with this patch? (PDFParserTest already has some xhtml based tests, which could be used as a pattern.)
> Parse links in PDF
> ------------------
>
> Key: TIKA-861
> URL: https://issues.apache.org/jira/browse/TIKA-861
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Reporter: Sasha Goodman
> Priority: Minor
> Labels: links, pdfbox
> Fix For: 1.2
>
> Attachments: TIKA-861.patch
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
> I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
> PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-861) Parse links in PDF
Posted by "Ryan Quam (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Quam updated TIKA-861:
---------------------------
Attachment: TIKA-861-test.patch
Here is a simple unit test for the PDF link parsing.
> Parse links in PDF
> ------------------
>
> Key: TIKA-861
> URL: https://issues.apache.org/jira/browse/TIKA-861
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Reporter: Sasha Goodman
> Priority: Minor
> Labels: links, pdfbox
> Fix For: 1.2
>
> Attachments: TIKA-861-test.patch, TIKA-861.patch
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
> I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
> PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-861) Parse links in PDF
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-861.
-----------------------------
Resolution: Fixed
Thanks, patches committed in r1331434.
One thing to note is that links are extracted for now at the end of the page. Further work may be wanted in future, in order to match them to the text they apply to
> Parse links in PDF
> ------------------
>
> Key: TIKA-861
> URL: https://issues.apache.org/jira/browse/TIKA-861
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Reporter: Sasha Goodman
> Priority: Minor
> Labels: links, pdfbox
> Fix For: 1.2
>
> Attachments: TIKA-861-test.patch, TIKA-861.patch
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
> I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
> PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-861) Parse links in PDF
Posted by "Ryan Quam (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Quam updated TIKA-861:
---------------------------
Attachment: TIKA-861.patch
Patch that adds PDF links to the DOM.
> Parse links in PDF
> ------------------
>
> Key: TIKA-861
> URL: https://issues.apache.org/jira/browse/TIKA-861
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.0
> Reporter: Sasha Goodman
> Priority: Minor
> Labels: links, pdfbox
> Fix For: 1.2
>
> Attachments: TIKA-861.patch
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
> I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
> PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira