You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "d ferbas (JIRA)" <ji...@apache.org> on 2009/11/18 14:40:39 UTC

[jira] Created: (PDFBOX-561) Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.

Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.
-----------------------------------------------------------------------------------------------

                 Key: PDFBOX-561
                 URL: https://issues.apache.org/jira/browse/PDFBOX-561
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.8.0-incubator, 0.7.3
            Reporter: d ferbas


The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
If there are critical characters in a pdf file, the extracted string differs dependent of the jvm system encoding. 
It has to be possible to set the encoding for the extraction to ensure same results independent of the default system encoding.

Sample file: http://e-nnovation.at/downloads/blindtext_mit_bullets_unsigned.pdf
Bullets #3 to #8 differ using utf-8 vs cp1252

Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-561) Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.

Posted by "d ferbas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

d ferbas updated PDFBOX-561:
----------------------------

    Description: 
The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
If there are critical characters in a pdf file, the extracted string differs dependent of the jvm system encoding. 
It has to be possible to set the encoding for the extraction to ensure same results independent of the default system encoding.

Sample file: see attachment "blindtext_mit_bullets.pdf"
Bullets #3 to #8 differ using utf-8 vs cp1252

Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.

  was:
The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
If there are critical characters in a pdf file, the extracted string differs dependent of the jvm system encoding. 
It has to be possible to set the encoding for the extraction to ensure same results independent of the default system encoding.

Sample file: http://e-nnovation.at/downloads/blindtext_mit_bullets_unsigned.pdf
Bullets #3 to #8 differ using utf-8 vs cp1252

Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.


> Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.
> -----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-561
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-561
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3, 0.8.0-incubator
>            Reporter: d ferbas
>         Attachments: blindtext_mit_bullets.pdf
>
>
> The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
> If there are critical characters in a pdf file, the extracted string differs dependent of the jvm system encoding. 
> It has to be possible to set the encoding for the extraction to ensure same results independent of the default system encoding.
> Sample file: see attachment "blindtext_mit_bullets.pdf"
> Bullets #3 to #8 differ using utf-8 vs cp1252
> Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-561) Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.

Posted by "d ferbas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

d ferbas updated PDFBOX-561:
----------------------------

    Attachment: blindtext_mit_bullets.pdf

sample file for encoding problems

> Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.
> -----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-561
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-561
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3, 0.8.0-incubator
>            Reporter: d ferbas
>         Attachments: blindtext_mit_bullets.pdf
>
>
> The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
> If there are critical characters in a pdf file, the extracted string differs dependent of the jvm system encoding. 
> It has to be possible to set the encoding for the extraction to ensure same results independent of the default system encoding.
> Sample file: http://e-nnovation.at/downloads/blindtext_mit_bullets_unsigned.pdf
> Bullets #3 to #8 differ using utf-8 vs cp1252
> Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.