You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2011/03/03 18:09:36 UTC

[jira] Updated: (PDFBOX-612) Unknown encoding for 'GBK-EUC-H'

     [ https://issues.apache.org/jira/browse/PDFBOX-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-612:
--------------------------------------

    Attachment: PDFBOX612-1DE9A100d011.png
                PDFBOX612-1DE9A100d01.txt

This issue seems to be resolved with the current version of PDFBox. [1] 
The text extraction result is quite perfect. There is still a rendering issue. Some of the characters are shown as boxes.


[1] http://pdfbox.apache.org/download

> Unknown encoding for 'GBK-EUC-H'
> --------------------------------
>
>                 Key: PDFBOX-612
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-612
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 0.8.0-incubator
>         Environment: Windows
>            Reporter: Gang Luo
>              Labels: encoding
>         Attachments: 1DE9A100d01.pdf, PDFBOX612-1DE9A100d01.txt, PDFBOX612-1DE9A100d011.png
>
>
> Unknown encoding for 'GBK-EUC-H' for chinese pdf document. To fix it.
> 1.add method to org.apache.pdfbox.pdmodel.font.PDFont.java
> public String getEncodingName() {
>         COSBase encoding = font.getDictionaryObject(COSName.ENCODING);
>         if (encoding != null) {
>             if (encoding instanceof COSName) {
>                 return ((COSName) encoding).getName();
>             }
>         }
>         return null;
>     }
> 2.modify  encode method.
> from
>         if( retval == null && cmap != null )
>         {
>                 retval = cmap.lookup( c, offset, length );
>         }
>         //if we havn't found a value yet and
>         //we are still on the first byte and
>         //there is no cmap or the cmap does not have 2 byte mappings then try to encode
>         //using fallback methods.
> to
>         if( retval == null && cmap != null )
>         {
>             String encodingStr = getEncodingName();
>             if (encodingStr != null) {
>                 EncodingConverter converter = EncodingConversionManager.getConverter(encodingStr);
>                 if (converter != null) {
>                     if (length == 1) return null;
>                     retval = converter.convertBytes(c, offset, length, cmap);
>                 } else {
>                     retval = cmap.lookup( c, offset, length );
>                 }
>             } else {
>                 retval = cmap.lookup( c, offset, length );
>             }
>         }
>         //if we havn't found a value yet and
>         //we are still on the first byte and
>         //there is no cmap or the cmap does not have 2 byte mappings then try to encode
>         //using fallback methods.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira