You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/12/08 21:13:13 UTC

[jira] [Commented] (PDFBOX-2547) maybe encoding error

    [ https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238407#comment-14238407 ] 

John Hewson commented on PDFBOX-2547:
-------------------------------------

Text extraction does of this PDF does not produce good results with Acrobat either, although the problems are not as bad as with PDFBox. Acrobat extracts nothing for 'ę' and 'ą' but 'na przykład miłe' is extracted correctly.

Calling setSpacingTolerance(0.3) on PDFTextStripper seems to produce better results.

> maybe encoding error
> --------------------
>
>                 Key: PDFBOX-2547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7
>            Reporter: Michał
>            Priority: Minor
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' (page 4, line 6).
> Maybe it is some small problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)