You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michał (JIRA)" <ji...@apache.org> on 2014/12/07 13:39:12 UTC
[jira] [Created] (PDFBOX-2547) maybe encoding error
Michał created PDFBOX-2547:
------------------------------
Summary: maybe encoding error
Key: PDFBOX-2547
URL: https://issues.apache.org/jira/browse/PDFBOX-2547
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.7
Reporter: Michał
Priority: Minor
Hi,
I just download a pdf form page:
http://download.jw.org/files/media_books/32/es15_P.pdf
and wants extract text from this document.
I use command:
java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf resultFile-UTF-8.txt
But I see some problems for exmaple:
1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' (page 4, line 6).
Maybe it is some small problems.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)