You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/10/12 16:14:00 UTC

[jira] [Commented] (PDFBOX-3962) No unicode mapping / Text not extracting

    [ https://issues.apache.org/jira/browse/PDFBOX-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202190#comment-16202190 ] 

Tilman Hausherr commented on PDFBOX-3962:
-----------------------------------------

Even Adobe Reader isn't able to extract that one. The glyph names are non standard. Your workaround will not work in jdk9 (or needs additional command line options) because it is kindof "hacky".

What also works is a change in the source code that I haven't made because there is no guarantee that it will work for all files. In PDSimpleFont.toUnicode() change this part (the first 4 lines exist):
{code}
            unicode = unicodeGlyphList.toUnicode(name);
            if (unicode != null)
            {
                return unicode;
            }
            // can't remember what issue
            if (name.matches("C\\d\\d\\d\\d"))
            {
                unicode = new String(new byte[]{ (byte) Integer.parseInt(name.substring(1)) });
                return unicode;
            }
            // PDFBOX-3962
            if (name.matches("G[A-F0-9][A-F0-9]"))
            {
                unicode = new String(new byte[]{ (byte) Integer.parseInt(name.substring(1), 16) });
                return unicode;
            }
{code}


> No unicode mapping / Text not extracting
> ----------------------------------------
>
>                 Key: PDFBOX-3962
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3962
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>         Attachments: 72083_qdf.pdf
>
>
> From the attached [^72083_qdf.pdf] file, this text (big letters on the top) is not extracted using PDFTextStripper:
> {code}
> AGGIE NIGHT
> AT ENRON FIELD
> FRIDAY, JUNE 15, 2001 at 7:05
> HOUSTON ASTROS VS. TEXAS RANGERS
> {code}
> It does not work well in Acrobat Reader also. But, at the same time, it can be extracted properly by some PDF viewers.
> Also, I found a workaround how to make it work, see it below.
> 1. Find this code block in *LegacyPDFStreamEngine.java*
> {code}
>         if(unicode == null) {
>             if(!(font instanceof PDSimpleFont)) {
>                 return;
>             }
>             char c = (char)code;
>             unicode = new String(new char[]{c});
>         }
> {code}
> 2. Insert this code block just before found one. 
> {code}
>         if (unicode == null) {
>             if (font instanceof PDType1CFont) {
>                 String name = ((PDType1CFont) font).codeToName(code);
>                 try {
>                     Method method = PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
>                     method.setAccessible(true);
>                     Encoding encoding = (Encoding) method.invoke(font);
>                     Integer newCode = encoding.getNameToCodeMap().get(name);
>                     if (newCode != null && newCode.intValue() != 0) {
>                         unicode = new String(new char[]{(char) newCode.byteValue()});
>                     }
>                 } catch (NoSuchMethodException e) {
>                     e.printStackTrace();
>                 } catch (IllegalAccessException e) {
>                     e.printStackTrace();
>                 } catch (InvocationTargetException e) {
>                     e.printStackTrace();
>                 }
>             }
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org