You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2011/03/03 18:09:36 UTC
[jira] Updated: (PDFBOX-612) Unknown encoding for 'GBK-EUC-H'
[ https://issues.apache.org/jira/browse/PDFBOX-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-612:
--------------------------------------
Attachment: PDFBOX612-1DE9A100d011.png
PDFBOX612-1DE9A100d01.txt
This issue seems to be resolved with the current version of PDFBox. [1]
The text extraction result is quite perfect. There is still a rendering issue. Some of the characters are shown as boxes.
[1] http://pdfbox.apache.org/download
> Unknown encoding for 'GBK-EUC-H'
> --------------------------------
>
> Key: PDFBOX-612
> URL: https://issues.apache.org/jira/browse/PDFBOX-612
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 0.8.0-incubator
> Environment: Windows
> Reporter: Gang Luo
> Labels: encoding
> Attachments: 1DE9A100d01.pdf, PDFBOX612-1DE9A100d01.txt, PDFBOX612-1DE9A100d011.png
>
>
> Unknown encoding for 'GBK-EUC-H' for chinese pdf document. To fix it.
> 1.add method to org.apache.pdfbox.pdmodel.font.PDFont.java
> public String getEncodingName() {
> COSBase encoding = font.getDictionaryObject(COSName.ENCODING);
> if (encoding != null) {
> if (encoding instanceof COSName) {
> return ((COSName) encoding).getName();
> }
> }
> return null;
> }
> 2.modify encode method.
> from
> if( retval == null && cmap != null )
> {
> retval = cmap.lookup( c, offset, length );
> }
> //if we havn't found a value yet and
> //we are still on the first byte and
> //there is no cmap or the cmap does not have 2 byte mappings then try to encode
> //using fallback methods.
> to
> if( retval == null && cmap != null )
> {
> String encodingStr = getEncodingName();
> if (encodingStr != null) {
> EncodingConverter converter = EncodingConversionManager.getConverter(encodingStr);
> if (converter != null) {
> if (length == 1) return null;
> retval = converter.convertBytes(c, offset, length, cmap);
> } else {
> retval = cmap.lookup( c, offset, length );
> }
> } else {
> retval = cmap.lookup( c, offset, length );
> }
> }
> //if we havn't found a value yet and
> //we are still on the first byte and
> //there is no cmap or the cmap does not have 2 byte mappings then try to encode
> //using fallback methods.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira