You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/10/13 18:47:00 UTC

[jira] [Closed] (PDFBOX-3962) No unicode mapping / Text not extracting

     [ https://issues.apache.org/jira/browse/PDFBOX-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr closed PDFBOX-3962.
-----------------------------------
    Resolution: Won't Fix

> No unicode mapping / Text not extracting
> ----------------------------------------
>
>                 Key: PDFBOX-3962
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3962
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>         Attachments: 72083_qdf.pdf
>
>
> From the attached [^72083_qdf.pdf] file, this text (big letters on the top) is not extracted using PDFTextStripper:
> {code}
> AGGIE NIGHT
> AT ENRON FIELD
> FRIDAY, JUNE 15, 2001 at 7:05
> HOUSTON ASTROS VS. TEXAS RANGERS
> {code}
> It does not work well in Acrobat Reader also. But, at the same time, it can be extracted properly by some PDF viewers.
> Also, I found a workaround how to make it work, see it below.
> 1. Find this code block in *LegacyPDFStreamEngine.java*
> {code}
>         if(unicode == null) {
>             if(!(font instanceof PDSimpleFont)) {
>                 return;
>             }
>             char c = (char)code;
>             unicode = new String(new char[]{c});
>         }
> {code}
> 2. Insert this code block just before found one. 
> {code}
>         if (unicode == null) {
>             if (font instanceof PDType1CFont) {
>                 String name = ((PDType1CFont) font).codeToName(code);
>                 try {
>                     Method method = PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
>                     method.setAccessible(true);
>                     Encoding encoding = (Encoding) method.invoke(font);
>                     Integer newCode = encoding.getNameToCodeMap().get(name);
>                     if (newCode != null && newCode.intValue() != 0) {
>                         unicode = new String(new char[]{(char) newCode.byteValue()});
>                     }
>                 } catch (NoSuchMethodException e) {
>                     e.printStackTrace();
>                 } catch (IllegalAccessException e) {
>                     e.printStackTrace();
>                 } catch (InvocationTargetException e) {
>                     e.printStackTrace();
>                 }
>             }
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org