You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/08/03 17:07:00 UTC

[jira] [Resolved] (TIKA-2702) Different behavior between TIKA and pdfbox

     [ https://issues.apache.org/jira/browse/TIKA-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2702.
-------------------------------
    Resolution: Not A Problem

Let's continue the discussion on our user list user@tika.apache.org if you have any more questions.  Thank you!

> Different behavior between TIKA and pdfbox
> ------------------------------------------
>
>                 Key: TIKA-2702
>                 URL: https://issues.apache.org/jira/browse/TIKA-2702
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Lior
>            Priority: Major
>
> As far as I understand, TIKA is using pdfbox for extracting text from pdf files
> During a side benchmark I'm doing, I'm seeing that the text I'm getting using PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in most cases, when there is a hyperlink inside the pdf file, the pdfbox ignore the link itself, while TIKA is extracting the text, for example:
> https://www.linkedin.com/in/jhonDo
> mailto:[jhondo@yahoo.com |mailto:jhondo@yahoo.com]
>  
> This is really a deal breaker for me, because I'm using pdfbox for another process I'm doing and I need the text to be the same, so I can't use TIKA at the moment....



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)