You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Svensson, Kristian" <Kr...@wsp.com> on 2019/02/28 09:56:11 UTC

Extract link annotations (hyperlinks) with tika app?

Using the tika app (tika-app-1.20.jar), is it possible to extract link annotations (hyperlinks)? Ideally I would like to get a href in the xhtml output. I failed finding any documentation regarding this.

I found out that pdfbox can extract link annotations:
https://stackoverflow.com/questions/38587567/how-to-extract-hyperlink-information-pdfbox
But I'm not sure how to use this with the tika app.

I think the tika app is using pdfbox for pdf content extraction, but I might be wrong😊

Any help greatly appreciated!

Best Regards,

Kristian

________________________________


NOTICE: This communication and any attachments ("this message") may contain information which is privileged, confidential, proprietary or otherwise subject to restricted disclosure under applicable law. This message is for the sole use of the intended recipient(s). Any unauthorized use, disclosure, viewing, copying, alteration, dissemination or distribution of, or reliance on, this message is strictly prohibited. If you have received this message in error, or you are not an authorized or intended recipient, please notify the sender immediately by replying to this message, delete this message and all copies from your e-mail system and destroy any printed copies.



-LAEmHhHzdJzBlTWfa4Hgs7pbKl

Re: Extract link annotations (hyperlinks) with tika app?

Posted by Tim Allison <ta...@apache.org>.

Got it.  Thank you for following up.  Please do let us know if you have any
other surprises.

On Thu, Feb 28, 2019 at 9:25 AM Svensson, Kristian <
Kristian.Svensson@wsp.com> wrote:

> Ok! Looking at some other pdf-files it seems to be extracting the links.
> The one that did not work it looks like a clickable link in PDF-Xchange
> Viewer but it's probably the viewer itself that interprets the url from the
> free-text and makes that clickable. All is well, my mistake!
>
>
>
> Thank you for your answer!
>
>
>
> *From:* Tim Allison [mailto:tallison@apache.org]
> *Sent:* den tor februari 2019 15:14
> *To:* user@tika.apache.org
> *Subject:* Re: Extract link annotations (hyperlinks) with tika app?
>
>
>
> Hmmmm....we should be extracting links. Tilman's code on SO is slightly
> different from ours at this point, but ours should be working with the
> caveat that we aren't capturing the anchor text as Tilman's code does ---
> we're just repeating the link as the anchor text, and we're dumping the
> hrefs at the end of the page, we're not currently trying to integrate hrefs
> where they actually belong in the text.
>
>
>
> We have one unit test for this:
>
>
>
>     @Test
>
>     public void testLinks() throws Exception {
>
>         final XMLResult result = getXML("testPDFVarious.pdf");
>
>         assertContains("<div class=\"annotation\"><a href=\"
> http://tika.apache.org/\ <http://tika.apache.org/>">"+
>
>                 "http://tika.apache.org/</a></div>", result.xml);
>
>     }
>
>
>
> Is there any chance that you have extractAnnotationText set to false?  The
> default is true and is required to be true to extract hrefs.
>
>
>
> This could be a bug, though...let us know...
>
>
>
>
>
> On Thu, Feb 28, 2019 at 4:56 AM Svensson, Kristian <
> Kristian.Svensson@wsp.com> wrote:
>
> Using the tika app (tika-app-1.20.jar), is it possible to extract link
> annotations (hyperlinks)? Ideally I would like to get a href in the xhtml
> output. I failed finding any documentation regarding this.
>
> I found out that pdfbox can extract link annotations:
>
> https://stackoverflow.com/questions/38587567/how-to-extract-hyperlink-information-pdfbox
> But I'm not sure how to use this with the tika app.
>
> I think the tika app is using pdfbox for pdf content extraction, but I
> might be wrong😊
>
> Any help greatly appreciated!
>
> Best Regards,
>
> Kristian
>
> ________________________________
>
>
> NOTICE: This communication and any attachments ("this message") may
> contain information which is privileged, confidential, proprietary or
> otherwise subject to restricted disclosure under applicable law. This
> message is for the sole use of the intended recipient(s). Any unauthorized
> use, disclosure, viewing, copying, alteration, dissemination or
> distribution of, or reliance on, this message is strictly prohibited. If
> you have received this message in error, or you are not an authorized or
> intended recipient, please notify the sender immediately by replying to
> this message, delete this message and all copies from your e-mail system
> and destroy any printed copies.
>
>
>
> -LAEmHhHzdJzBlTWfa4Hgs7pbKl
>
>

RE: Extract link annotations (hyperlinks) with tika app?

Posted by "Svensson, Kristian" <Kr...@wsp.com>.

Ok! Looking at some other pdf-files it seems to be extracting the links. The one that did not work it looks like a clickable link in PDF-Xchange Viewer but it's probably the viewer itself that interprets the url from the free-text and makes that clickable. All is well, my mistake!

Thank you for your answer!

From: Tim Allison [mailto:tallison@apache.org]
Sent: den tor februari 2019 15:14
To: user@tika.apache.org
Subject: Re: Extract link annotations (hyperlinks) with tika app?

Hmmmm....we should be extracting links. Tilman's code on SO is slightly different from ours at this point, but ours should be working with the caveat that we aren't capturing the anchor text as Tilman's code does --- we're just repeating the link as the anchor text, and we're dumping the hrefs at the end of the page, we're not currently trying to integrate hrefs where they actually belong in the text.

We have one unit test for this:

    @Test
    public void testLinks() throws Exception {
        final XMLResult result = getXML("testPDFVarious.pdf");
        assertContains("<div class=\"annotation\"><a href=\"http://tika.apache.org/\<http://tika.apache.org/>">"+
                "http://tika.apache.org/</a></div>", result.xml);
    }

Is there any chance that you have extractAnnotationText set to false?  The default is true and is required to be true to extract hrefs.

This could be a bug, though...let us know...

On Thu, Feb 28, 2019 at 4:56 AM Svensson, Kristian <Kr...@wsp.com>> wrote:
Using the tika app (tika-app-1.20.jar), is it possible to extract link annotations (hyperlinks)? Ideally I would like to get a href in the xhtml output. I failed finding any documentation regarding this.

I found out that pdfbox can extract link annotations:
https://stackoverflow.com/questions/38587567/how-to-extract-hyperlink-information-pdfbox
But I'm not sure how to use this with the tika app.

I think the tika app is using pdfbox for pdf content extraction, but I might be wrong😊

Any help greatly appreciated!

Best Regards,

Kristian

________________________________

NOTICE: This communication and any attachments ("this message") may contain information which is privileged, confidential, proprietary or otherwise subject to restricted disclosure under applicable law. This message is for the sole use of the intended recipient(s). Any unauthorized use, disclosure, viewing, copying, alteration, dissemination or distribution of, or reliance on, this message is strictly prohibited. If you have received this message in error, or you are not an authorized or intended recipient, please notify the sender immediately by replying to this message, delete this message and all copies from your e-mail system and destroy any printed copies.

-LAEmHhHzdJzBlTWfa4Hgs7pbKl

Re: Extract link annotations (hyperlinks) with tika app?

Posted by Tim Allison <ta...@apache.org>.

Hmmmm....we should be extracting links. Tilman's code on SO is slightly
different from ours at this point, but ours should be working with the
caveat that we aren't capturing the anchor text as Tilman's code does ---
we're just repeating the link as the anchor text, and we're dumping the
hrefs at the end of the page, we're not currently trying to integrate hrefs
where they actually belong in the text.

We have one unit test for this:

    @Test
    public void testLinks() throws Exception {
        final XMLResult result = getXML("testPDFVarious.pdf");
        assertContains("<div class=\"annotation\"><a href=\"
http://tika.apache.org/\">"+
                "http://tika.apache.org/</a></div>", result.xml);
    }

Is there any chance that you have extractAnnotationText set to false?  The
default is true and is required to be true to extract hrefs.

This could be a bug, though...let us know...


On Thu, Feb 28, 2019 at 4:56 AM Svensson, Kristian <
Kristian.Svensson@wsp.com> wrote:

> Using the tika app (tika-app-1.20.jar), is it possible to extract link
> annotations (hyperlinks)? Ideally I would like to get a href in the xhtml
> output. I failed finding any documentation regarding this.
>
> I found out that pdfbox can extract link annotations:
>
> https://stackoverflow.com/questions/38587567/how-to-extract-hyperlink-information-pdfbox
> But I'm not sure how to use this with the tika app.
>
> I think the tika app is using pdfbox for pdf content extraction, but I
> might be wrong😊
>
> Any help greatly appreciated!
>
> Best Regards,
>
> Kristian
>
> ________________________________
>
>
> NOTICE: This communication and any attachments ("this message") may
> contain information which is privileged, confidential, proprietary or
> otherwise subject to restricted disclosure under applicable law. This
> message is for the sole use of the intended recipient(s). Any unauthorized
> use, disclosure, viewing, copying, alteration, dissemination or
> distribution of, or reliance on, this message is strictly prohibited. If
> you have received this message in error, or you are not an authorized or
> intended recipient, please notify the sender immediately by replying to
> this message, delete this message and all copies from your e-mail system
> and destroy any printed copies.
>
>
>
> -LAEmHhHzdJzBlTWfa4Hgs7pbKl
>