You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "d ferbas (JIRA)" <ji...@apache.org> on 2009/11/18 15:00:40 UTC

[jira] Updated: (PDFBOX-561) Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.

     [ https://issues.apache.org/jira/browse/PDFBOX-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

d ferbas updated PDFBOX-561:
----------------------------

    Description: 
The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
If there are critical characters in a pdf file, the extracted string differs dependent of the jvm system encoding. 
It has to be possible to set the encoding for the extraction to ensure same results independent of the default system encoding.

Sample file: see attachment "blindtext_mit_bullets.pdf"
Bullets #3 to #8 differ using utf-8 vs cp1252

Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.

  was:
The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
If there are critical characters in a pdf file, the extracted string differs dependent of the jvm system encoding. 
It has to be possible to set the encoding for the extraction to ensure same results independent of the default system encoding.

Sample file: http://e-nnovation.at/downloads/blindtext_mit_bullets_unsigned.pdf
Bullets #3 to #8 differ using utf-8 vs cp1252

Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.


> Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.
> -----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-561
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-561
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3, 0.8.0-incubator
>            Reporter: d ferbas
>         Attachments: blindtext_mit_bullets.pdf
>
>
> The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
> If there are critical characters in a pdf file, the extracted string differs dependent of the jvm system encoding. 
> It has to be possible to set the encoding for the extraction to ensure same results independent of the default system encoding.
> Sample file: see attachment "blindtext_mit_bullets.pdf"
> Bullets #3 to #8 differ using utf-8 vs cp1252
> Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.