You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Adrian Romano (JIRA)" <ji...@apache.org> on 2009/01/05 20:57:44 UTC

[jira] Updated: (PDFBOX-398) Russian extraction encoding failure

     [ https://issues.apache.org/jira/browse/PDFBOX-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrian Romano updated PDFBOX-398:
---------------------------------

    Attachment: 7.pdf

This is an example file that demonstrates the described behavior when extracted.

> Russian extraction encoding failure
> -----------------------------------
>
>                 Key: PDFBOX-398
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-398
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3, 0.8.0-incubator
>         Environment: Windows XP 32-bit, CentOS 5.2 32-bit
>            Reporter: Adrian Romano
>         Attachments: 7.pdf
>
>
> I am doing some text extraction of Russian documents and some of them aren't extracting correctly. I am using PDFTextStripper.
> When I extract on windows using UTF-8  encoding, the output is garbage. 
> When I extract on linux using any encoding, the output is garbage. 
> The only way I can get viable output is when I extract the PDF on windows, but don't specify an encoding. If I do this the output is correct when viewed with Ultra Edit, but not in notepad. I can view the output in notepad only after I convert the file to utf-8 with iconv.
> It appears to me that the encoding isn't being read correctly from the PDF, and when it's
> outputted as UTF-8, it is being double encoded or something. I can detect this double encoding, and then
> run the file with no encoding specified, then convert it to UTF-8 using iconv, and it is OK.
> But, this method does not work on linux, as I cannot get the file to extract correctly using any encoding
> on linux. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.