You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2018/09/04 18:57:00 UTC

[jira] [Commented] (PDFBOX-4311) Unable to parse some pdf's using pdfbox.

    [ https://issues.apache.org/jira/browse/PDFBOX-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603449#comment-16603449 ] 

Tilman Hausherr commented on PDFBOX-4311:
-----------------------------------------

The "text" you see is an image, so there is no text to extract. You can see this by trying to mark and copy and paste in Adobe Reader.

[https://pdfbox.apache.org/2.0/faq.html#text-extraction]

So, sadly, there is nothing we can do this time. Btw the current version is 2.0.11 (but that won't help either). Sorry for not having better news.

> Unable to parse some pdf's using pdfbox.
> ----------------------------------------
>
>                 Key: PDFBOX-4311
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4311
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>         Environment: Pdfbox -2.0.9
> Pdfbox-tools - 2.0.9
> Java - 1.7
> Scala - 2.10.6
>            Reporter: Krishna Dheeraj
>            Priority: Major
>         Attachments: upload_user4024353_claimnr283909709_healthpartners_2018-06-17.pdf
>
>
> When I tried to convert the PDF file into HTML for parsing the content in the body is empty and there are no errors or exceptions thrown. It is happening for only few files, others are are working as expected. I am attaching the file which we are unable to parse. Let us know know in case of any resolutions are avilable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org