You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michał (JIRA)" <ji...@apache.org> on 2014/12/07 13:39:12 UTC

[jira] [Created] (PDFBOX-2547) maybe encoding error

Michał created PDFBOX-2547:
------------------------------

             Summary: maybe encoding error
                 Key: PDFBOX-2547
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2547
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.7
            Reporter: Michał
            Priority: Minor


Hi,
I just download a pdf form page:
http://download.jw.org/files/media_books/32/es15_P.pdf
and wants extract text from this document.
I use command:
java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf resultFile-UTF-8.txt
But I see some problems for exmaple:
1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' (page 4, line 6).

Maybe it is some small problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)