You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/05/04 17:48:00 UTC

[jira] [Comment Edited] (PDFBOX-5155) Error extracting text from PDF - Can't read the embedded Type1 font FDFBJU+NewsGothic

    [ https://issues.apache.org/jira/browse/PDFBOX-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339189#comment-17339189 ] 

Tilman Hausherr edited comment on PDFBOX-5155 at 5/4/21, 5:47 PM:
------------------------------------------------------------------

Maybe we're getting closer now. The text on the top right should be "DictionaryEncoding with differences" but it's just "DictionaryEncoding".
The glyph names are the correct ones. Could you please post another screenshot that shows the encoding Differences array of F15?

I'm wondering if the differences array has a flaw, or whether we ignore the differences because of the bad /ToUnicode stream.


was (Author: tilman):
Maybe we're getting closer now. The text on the top right should be "DictionaryEncoding with differences" but it's just "DictionaryEncoding".
The glyphname is the correct one. Could you please post another screenshot that shows the encoding Differences array of F15?

I'm wondering if the differences array has a flaw, or whether we ignore the differences because of the bad /ToUnicode stream.

> Error extracting text from PDF - Can't read the embedded Type1 font FDFBJU+NewsGothic
> -------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5155
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5155
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.22, 2.0.23
>         Environment: Java 11
>            Reporter: nithin nambiar
>            Assignee: Tilman Hausherr
>            Priority: Major
>             Fix For: 2.0.24, 3.0.0 PDFBox
>
>         Attachments: FDFBJU+NewsGothic-0034.pfa, FDFBJU+NewsGothic-Bold-0050.pfa, FDFBJU+NewsGothic-Bold-0050.pfa, Screenshot 2021-04-30 at 12.34.20.png, image-2021-04-07-17-11-10-048.png, image-2021-04-30-13-22-09-187.png, image-2021-05-01-09-49-26-222.png, image-2021-05-01-12-54-26-202.png, image-2021-05-01-18-07-38-406.png, image-2021-05-04-09-45-53-271.png, image-2021-05-04-09-47-17-536.png, image-2021-05-04-09-47-46-988.png, image-2021-05-04-17-39-26-079.png, image-2021-05-04-17-41-37-186.png
>
>
> When i try to extract text from command line using pdfbox verision 2.0.22 and 2.023 I get the following error. The pdf is customer specific one, I can't share it here. Is this error because this particular font is not supported by pdfbox?
> {code:java}
> Apr 07, 2021 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNING: Invalid ToUnicode CMap in font FDFBJU+NewsGothic Apr 07, 2021 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init> SEVERE: Can't read the embedded Type1 font FDFBJU+NewsGothic java.io.IOException: Expected INTEGER or REAL but got NAME at org.apache.fontbox.type1.Type1Parser.arrayToNumbers(Type1Parser.java:256) at org.apache.fontbox.type1.Type1Parser.readSimpleValue(Type1Parser.java:168) at org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:139) at org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61) at org.apache.fontbox.type1.Type1Font.createWithSegments(Type1Font.java:85) at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:263) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:322) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269) at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org