You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Peter Costello (JIRA)" <ji...@apache.org> on 2010/05/10 22:48:30 UTC

[jira] Created: (PDFBOX-725) Text extraction fails due to font problem with Type0, supplement-0 font

Text extraction fails due to font problem with Type0, supplement-0 font
-----------------------------------------------------------------------

                 Key: PDFBOX-725
                 URL: https://issues.apache.org/jira/browse/PDFBOX-725
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.2.0
         Environment: Fedora 11 or windows
            Reporter: Peter Costello
             Fix For: 1.2.0


Text extraction fails. In particular, download and view page 3 (1-based) of:
http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf

With pdfbox text extraction, most of the page is displayed as "?". Other pages in the file have similar problems.
Trap at line#376 in org.apache.util.PDFStreamEngine.java (ie at "String c = font.encode( string, i, codeLength );")
Trap conditionally when "string.length==52", this is the second occurance of the problem.

Text extraction yields multiple "?" because the font encoding is not found.
The characters to be extracted are normal western characters.

The font COSDictionary contains:
COSName{Subtype}=COSName{Type0}
COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
COSName{Encoding}=COSName{Identity-H}
COSName{Type}=COSName{Font}

The "font.descendentFont" has the following COSDictionary items:
COSName{Subtype}=COSName{CIDFontType0}
COSName{FontDescriptor}=COSObject{540, 0}
COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
COSName{W}=...
COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) (COSName{Ordering}:COSString{Identity},
      (COSName{Registry}:COSString{Adobe}) }
etc.

All the "CMap" lookups return null. 
Code added to PDFont in February 2010 for Japanese characters tries "Adobe-Identity-UCS2", but that does not work; I think that code fragment should also
be trying "Adobe-Identity-0", or should not run at all if  the dictionary contains "Supplement=0". I manually tried "Adobe-Identity-0", but it does not exist.

The descendentFont has an encoding that corresponds to Supplement-0, but modifying PDFont to use the descendentFont encoding is not sufficient.
In particular, "descendentFont.getEncoding()" looks promising as a source for the encoding.

What is very odd is that the second  "string" displays in Acrobat Reader as "Portfolio of ....", but the byte[] contains:
[0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 0, 69, 0, 1, 0, 83]
These are not the corresponding ascii char, so it seems that these values are either indexes into a font table, or the byte[] was loaded with incorrect values.
I have confirmed that 2-bytes should be converted into a single char.

  COSName{Subtype}=COSName{CIDFontType0}
  COSName{FontDescriptor}=COSObject{540, 0}
  COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-725) Text extraction fails due to font problem with Type0, supplement-0 font

Posted by "Peter Costello (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880920#action_12880920 ] 

Peter Costello commented on PDFBOX-725:
---------------------------------------

I updated to latest repository and there was no fix to the issue reported in
PDFBOX-725.  I compared code and your updates from v955005 and v955007 were
there.  Your updates were to PDType1CFont.java.  Most of text in the problem
PDF is PDType1C, but the font with difficulties is at the end of the page
and is a PDType0Font and it's descendentFont is PDCIDFontType0Font.

 - Peter Costello




> Text extraction fails due to font problem with Type0, supplement-0 font
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-725
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-725
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Fedora 11 or windows
>            Reporter: Peter Costello
>         Attachments: annual-report-2008_pg23.pdf
>
>
> Text extraction fails. In particular, download and view pg23 or others (1-based) of:
> http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf
> With pdfbox text extraction, last 5 lines of page are displayed as "?". Other pages in the file have similar problems.
> Text extraction yields multiple "?" because "font.encode(buf,i,2)" returns null.
> The font COSDictionary contains:
> COSName{Subtype}=COSName{Type0}
> COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
> COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
> COSName{Encoding}=COSName{Identity-H}
> COSName{Type}=COSName{Font}
> The "font.descendentFont" has the following COSDictionary items:
> COSName{Subtype}=COSName{CIDFontType0}
> COSName{FontDescriptor}=COSObject{540, 0}
> COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
> COSName{W}=...
> COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) (COSName{Ordering}:COSString{Identity},(COSName{Registry}:COSString{Adobe}) }
> COSName{DW}=COSInt{1000}
> COSName{Type}=COSName{Font}
> The "fontDescriptor" of the descendentFont is:
> {COSName{StemV}=COSInt{58}, 
> COSName{FontName}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}, 
> COSName{FontFile3}=COSObject{543, 0}, 
> COSName{CIDSet}=COSObject{545, 0}, 
> COSName{Flags}=COSInt{6}, 
> COSName{Descent}=COSInt{-271}, 
> COSName{FontBBox}=COSArray{[COSInt{-166}, COSInt{-214}, COSInt{1050}, COSInt{967}]}, COSName{Ascent}=COSInt{752}, 
> COSName{CapHeight}=COSInt{737}, 
> COSName{XHeight}=COSInt{553}, 
> COSName{Type}=COSName{FontDescriptor}, 
> COSName{ItalicAngle}=COSInt{0}, 
> COSName{StemH}=COSInt{45}}
> The last 5 lines on the page are:
> "Increased Cash Flow by 11 percent to $9,386 million;"
> "Increased Operating Earnings by ..."
> etc
> These 5 lines are encoded as 2 bytes per character (it is a type0 font)
> Each 2 byte code is offset by 31 from its displayed value.
> For instance, code "0x00, 0x01" should convert to ascii "0x0020" (a space).
> The font is an "Identity" font, which means codes should just map to latin ISO chars.
> Yeah, this is a Type0 font which can display a subset of another font (the latin ISO), 
> but how come codes differ from the ascii by +31?
> This same 31 offset is found on all other pages of the file using this font.
> The font descriptor for the descendentFont has "Flags=6". Bit 3 is "Symbolic".
> PDF Spec 5.7.1 "Font contains glyphs outside the Adobe standard Latin character set."
> Maybe because the Font is "Symbolic" there is not a 1:1 map from codes to ascii.
> The question is whether the PDF file specifies the +31 offset, and pdfbox fails to properly account for this offset.. I can't find any reference to such an offset in the PDF spec. The 'getFirstChar()' in the descendentFont is -1, but the real value is"32". Maybe this +31 offset just equals 'firstChar-1'?
> The real firstChar can be found via:
>   COSDictionary fontDict = (COSDictionary)font.getCOSObject();
>   COSArray descendantFontArray = (COSArray)fontDict.getDictionaryObject(COSName.DESCENDANT_FONTS);
>   if (descendantFontArray != null)  {
>     COSDictionary descendantFontDictionary = (COSDictionary)descendantFontArray.getObject(0);
>     PDFont descendentFont = PDFontFactory.createFont(descendantFontDictionary);
>     Encoding encoding = descendentFont.getEncoding();
>     Iterator keyIterator = codeMap.keySet().iterator();
>     int firstChar=Integer.MAX_VALUE;
>     while (keyIterator.hasNext()) firstChar = Math.min(firstChar,((Integer)keyIterator.next()).intValue());
>   }
> Other example on page 3 of the document:
> Text  "Portfolio of ...." displays in Acrobat Reader, but the byte[] contains:
> [0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 0, 69, 0, 1, 0, 83]
> Again, if +31 is added to each of these 2-byte codes then the Ascii is found. 
> Where does this "+31" come from?  Acrobat reader gets it right.  How about pdfbox?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-725) Text extraction fails due to font problem with Type0, supplement-0 font

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882080#action_12882080 ] 

Andreas Lehmkühler commented on PDFBOX-725:
-------------------------------------------

My updates are related to the issue but they didn't solved it. I'm still investigating. The problem is the CID-encoding.

> Text extraction fails due to font problem with Type0, supplement-0 font
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-725
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-725
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Fedora 11 or windows
>            Reporter: Peter Costello
>         Attachments: annual-report-2008_pg23.pdf
>
>
> Text extraction fails. In particular, download and view pg23 or others (1-based) of:
> http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf
> With pdfbox text extraction, last 5 lines of page are displayed as "?". Other pages in the file have similar problems.
> Text extraction yields multiple "?" because "font.encode(buf,i,2)" returns null.
> The font COSDictionary contains:
> COSName{Subtype}=COSName{Type0}
> COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
> COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
> COSName{Encoding}=COSName{Identity-H}
> COSName{Type}=COSName{Font}
> The "font.descendentFont" has the following COSDictionary items:
> COSName{Subtype}=COSName{CIDFontType0}
> COSName{FontDescriptor}=COSObject{540, 0}
> COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
> COSName{W}=...
> COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) (COSName{Ordering}:COSString{Identity},(COSName{Registry}:COSString{Adobe}) }
> COSName{DW}=COSInt{1000}
> COSName{Type}=COSName{Font}
> The "fontDescriptor" of the descendentFont is:
> {COSName{StemV}=COSInt{58}, 
> COSName{FontName}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}, 
> COSName{FontFile3}=COSObject{543, 0}, 
> COSName{CIDSet}=COSObject{545, 0}, 
> COSName{Flags}=COSInt{6}, 
> COSName{Descent}=COSInt{-271}, 
> COSName{FontBBox}=COSArray{[COSInt{-166}, COSInt{-214}, COSInt{1050}, COSInt{967}]}, COSName{Ascent}=COSInt{752}, 
> COSName{CapHeight}=COSInt{737}, 
> COSName{XHeight}=COSInt{553}, 
> COSName{Type}=COSName{FontDescriptor}, 
> COSName{ItalicAngle}=COSInt{0}, 
> COSName{StemH}=COSInt{45}}
> The last 5 lines on the page are:
> "Increased Cash Flow by 11 percent to $9,386 million;"
> "Increased Operating Earnings by ..."
> etc
> These 5 lines are encoded as 2 bytes per character (it is a type0 font)
> Each 2 byte code is offset by 31 from its displayed value.
> For instance, code "0x00, 0x01" should convert to ascii "0x0020" (a space).
> The font is an "Identity" font, which means codes should just map to latin ISO chars.
> Yeah, this is a Type0 font which can display a subset of another font (the latin ISO), 
> but how come codes differ from the ascii by +31?
> This same 31 offset is found on all other pages of the file using this font.
> The font descriptor for the descendentFont has "Flags=6". Bit 3 is "Symbolic".
> PDF Spec 5.7.1 "Font contains glyphs outside the Adobe standard Latin character set."
> Maybe because the Font is "Symbolic" there is not a 1:1 map from codes to ascii.
> The question is whether the PDF file specifies the +31 offset, and pdfbox fails to properly account for this offset.. I can't find any reference to such an offset in the PDF spec. The 'getFirstChar()' in the descendentFont is -1, but the real value is"32". Maybe this +31 offset just equals 'firstChar-1'?
> The real firstChar can be found via:
>   COSDictionary fontDict = (COSDictionary)font.getCOSObject();
>   COSArray descendantFontArray = (COSArray)fontDict.getDictionaryObject(COSName.DESCENDANT_FONTS);
>   if (descendantFontArray != null)  {
>     COSDictionary descendantFontDictionary = (COSDictionary)descendantFontArray.getObject(0);
>     PDFont descendentFont = PDFontFactory.createFont(descendantFontDictionary);
>     Encoding encoding = descendentFont.getEncoding();
>     Iterator keyIterator = codeMap.keySet().iterator();
>     int firstChar=Integer.MAX_VALUE;
>     while (keyIterator.hasNext()) firstChar = Math.min(firstChar,((Integer)keyIterator.next()).intValue());
>   }
> Other example on page 3 of the document:
> Text  "Portfolio of ...." displays in Acrobat Reader, but the byte[] contains:
> [0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 0, 69, 0, 1, 0, 83]
> Again, if +31 is added to each of these 2-byte codes then the Ascii is found. 
> Where does this "+31" come from?  Acrobat reader gets it right.  How about pdfbox?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-725) Text extraction fails due to font problem with Type0, supplement-0 font

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated PDFBOX-725:
---------------------------------

    Fix Version/s:     (was: 1.2.0)

> Text extraction fails due to font problem with Type0, supplement-0 font
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-725
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-725
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Fedora 11 or windows
>            Reporter: Peter Costello
>         Attachments: annual-report-2008_pg23.pdf
>
>
> Text extraction fails. In particular, download and view pg23 or others (1-based) of:
> http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf
> With pdfbox text extraction, last 5 lines of page are displayed as "?". Other pages in the file have similar problems.
> Text extraction yields multiple "?" because "font.encode(buf,i,2)" returns null.
> The font COSDictionary contains:
> COSName{Subtype}=COSName{Type0}
> COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
> COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
> COSName{Encoding}=COSName{Identity-H}
> COSName{Type}=COSName{Font}
> The "font.descendentFont" has the following COSDictionary items:
> COSName{Subtype}=COSName{CIDFontType0}
> COSName{FontDescriptor}=COSObject{540, 0}
> COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
> COSName{W}=...
> COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) (COSName{Ordering}:COSString{Identity},(COSName{Registry}:COSString{Adobe}) }
> COSName{DW}=COSInt{1000}
> COSName{Type}=COSName{Font}
> The "fontDescriptor" of the descendentFont is:
> {COSName{StemV}=COSInt{58}, 
> COSName{FontName}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}, 
> COSName{FontFile3}=COSObject{543, 0}, 
> COSName{CIDSet}=COSObject{545, 0}, 
> COSName{Flags}=COSInt{6}, 
> COSName{Descent}=COSInt{-271}, 
> COSName{FontBBox}=COSArray{[COSInt{-166}, COSInt{-214}, COSInt{1050}, COSInt{967}]}, COSName{Ascent}=COSInt{752}, 
> COSName{CapHeight}=COSInt{737}, 
> COSName{XHeight}=COSInt{553}, 
> COSName{Type}=COSName{FontDescriptor}, 
> COSName{ItalicAngle}=COSInt{0}, 
> COSName{StemH}=COSInt{45}}
> The last 5 lines on the page are:
> "Increased Cash Flow by 11 percent to $9,386 million;"
> "Increased Operating Earnings by ..."
> etc
> These 5 lines are encoded as 2 bytes per character (it is a type0 font)
> Each 2 byte code is offset by 31 from its displayed value.
> For instance, code "0x00, 0x01" should convert to ascii "0x0020" (a space).
> The font is an "Identity" font, which means codes should just map to latin ISO chars.
> Yeah, this is a Type0 font which can display a subset of another font (the latin ISO), 
> but how come codes differ from the ascii by +31?
> This same 31 offset is found on all other pages of the file using this font.
> The font descriptor for the descendentFont has "Flags=6". Bit 3 is "Symbolic".
> PDF Spec 5.7.1 "Font contains glyphs outside the Adobe standard Latin character set."
> Maybe because the Font is "Symbolic" there is not a 1:1 map from codes to ascii.
> The question is whether the PDF file specifies the +31 offset, and pdfbox fails to properly account for this offset.. I can't find any reference to such an offset in the PDF spec. The 'getFirstChar()' in the descendentFont is -1, but the real value is"32". Maybe this +31 offset just equals 'firstChar-1'?
> The real firstChar can be found via:
>   COSDictionary fontDict = (COSDictionary)font.getCOSObject();
>   COSArray descendantFontArray = (COSArray)fontDict.getDictionaryObject(COSName.DESCENDANT_FONTS);
>   if (descendantFontArray != null)  {
>     COSDictionary descendantFontDictionary = (COSDictionary)descendantFontArray.getObject(0);
>     PDFont descendentFont = PDFontFactory.createFont(descendantFontDictionary);
>     Encoding encoding = descendentFont.getEncoding();
>     Iterator keyIterator = codeMap.keySet().iterator();
>     int firstChar=Integer.MAX_VALUE;
>     while (keyIterator.hasNext()) firstChar = Math.min(firstChar,((Integer)keyIterator.next()).intValue());
>   }
> Other example on page 3 of the document:
> Text  "Portfolio of ...." displays in Acrobat Reader, but the byte[] contains:
> [0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 0, 69, 0, 1, 0, 83]
> Again, if +31 is added to each of these 2-byte codes then the Ascii is found. 
> Where does this "+31" come from?  Acrobat reader gets it right.  How about pdfbox?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-725) Text extraction fails due to font problem with Type0, supplement-0 font

Posted by "Peter Costello (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Costello updated PDFBOX-725:
----------------------------------

    Description: 
Text extraction fails. In particular, download and view pg23 or others (1-based) of:
http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf

With pdfbox text extraction, last 5 lines of page are displayed as "?". Other pages in the file have similar problems.
Text extraction yields multiple "?" because "font.encode(buf,i,2)" returns null.

The font COSDictionary contains:
COSName{Subtype}=COSName{Type0}
COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
COSName{Encoding}=COSName{Identity-H}
COSName{Type}=COSName{Font}

The "font.descendentFont" has the following COSDictionary items:
COSName{Subtype}=COSName{CIDFontType0}
COSName{FontDescriptor}=COSObject{540, 0}
COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
COSName{W}=...
COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) (COSName{Ordering}:COSString{Identity},(COSName{Registry}:COSString{Adobe}) }
COSName{DW}=COSInt{1000}
COSName{Type}=COSName{Font}

The "fontDescriptor" of the descendentFont is:
{COSName{StemV}=COSInt{58}, 
COSName{FontName}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}, 
COSName{FontFile3}=COSObject{543, 0}, 
COSName{CIDSet}=COSObject{545, 0}, 
COSName{Flags}=COSInt{6}, 
COSName{Descent}=COSInt{-271}, 
COSName{FontBBox}=COSArray{[COSInt{-166}, COSInt{-214}, COSInt{1050}, COSInt{967}]}, COSName{Ascent}=COSInt{752}, 
COSName{CapHeight}=COSInt{737}, 
COSName{XHeight}=COSInt{553}, 
COSName{Type}=COSName{FontDescriptor}, 
COSName{ItalicAngle}=COSInt{0}, 
COSName{StemH}=COSInt{45}}

The last 5 lines on the page are:
"Increased Cash Flow by 11 percent to $9,386 million;"
"Increased Operating Earnings by ..."
etc

These 5 lines are encoded as 2 bytes per character (it is a type0 font)
Each 2 byte code is offset by 31 from its displayed value.
For instance, code "0x00, 0x01" should convert to ascii "0x0020" (a space).
The font is an "Identity" font, which means codes should just map to latin ISO chars.
Yeah, this is a Type0 font which can display a subset of another font (the latin ISO), 
but how come codes differ from the ascii by +31?
This same 31 offset is found on all other pages of the file using this font.

The font descriptor for the descendentFont has "Flags=6". Bit 3 is "Symbolic".
PDF Spec 5.7.1 "Font contains glyphs outside the Adobe standard Latin character set."
Maybe because the Font is "Symbolic" there is not a 1:1 map from codes to ascii.

The question is whether the PDF file specifies the +31 offset, and pdfbox fails to properly account for this offset.. I can't find any reference to such an offset in the PDF spec. The 'getFirstChar()' in the descendentFont is -1, but the real value is"32". Maybe this +31 offset just equals 'firstChar-1'?

The real firstChar can be found via:
  COSDictionary fontDict = (COSDictionary)font.getCOSObject();
  COSArray descendantFontArray = (COSArray)fontDict.getDictionaryObject(COSName.DESCENDANT_FONTS);
  if (descendantFontArray != null)  {
    COSDictionary descendantFontDictionary = (COSDictionary)descendantFontArray.getObject(0);
    PDFont descendentFont = PDFontFactory.createFont(descendantFontDictionary);
    Encoding encoding = descendentFont.getEncoding();
    Iterator keyIterator = codeMap.keySet().iterator();
    int firstChar=Integer.MAX_VALUE;
    while (keyIterator.hasNext()) firstChar = Math.min(firstChar,((Integer)keyIterator.next()).intValue());
  }

Other example on page 3 of the document:
Text  "Portfolio of ...." displays in Acrobat Reader, but the byte[] contains:
[0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 0, 69, 0, 1, 0, 83]
Again, if +31 is added to each of these 2-byte codes then the Ascii is found. 

Where does this "+31" come from?  Acrobat reader gets it right.  How about pdfbox?


  was:
Text extraction fails. In particular, download and view page 3 (1-based) of:
http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf

With pdfbox text extraction, most of the page is displayed as "?". Other pages in the file have similar problems.
Trap at line#376 in org.apache.util.PDFStreamEngine.java (ie at "String c = font.encode( string, i, codeLength );")
Trap conditionally when "string.length==52", this is the second occurance of the problem.

Text extraction yields multiple "?" because the font encoding is not found.
The characters to be extracted are normal western characters.

The font COSDictionary contains:
COSName{Subtype}=COSName{Type0}
COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
COSName{Encoding}=COSName{Identity-H}
COSName{Type}=COSName{Font}

The "font.descendentFont" has the following COSDictionary items:
COSName{Subtype}=COSName{CIDFontType0}
COSName{FontDescriptor}=COSObject{540, 0}
COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
COSName{W}=...
COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) (COSName{Ordering}:COSString{Identity},
      (COSName{Registry}:COSString{Adobe}) }
etc.

All the "CMap" lookups return null. 
Code added to PDFont in February 2010 for Japanese characters tries "Adobe-Identity-UCS2", but that does not work; I think that code fragment should also
be trying "Adobe-Identity-0", or should not run at all if  the dictionary contains "Supplement=0". I manually tried "Adobe-Identity-0", but it does not exist.

The descendentFont has an encoding that corresponds to Supplement-0, but modifying PDFont to use the descendentFont encoding is not sufficient.
In particular, "descendentFont.getEncoding()" looks promising as a source for the encoding.

What is very odd is that the second  "string" displays in Acrobat Reader as "Portfolio of ....", but the byte[] contains:
[0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 0, 69, 0, 1, 0, 83]
These are not the corresponding ascii char, so it seems that these values are either indexes into a font table, or the byte[] was loaded with incorrect values.
I have confirmed that 2-bytes should be converted into a single char.

  COSName{Subtype}=COSName{CIDFontType0}
  COSName{FontDescriptor}=COSObject{540, 0}
  COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
  

         Labels: PDFStreamEngine "Type0 Font" "Symbolic Flag" IDENTITY-H +31 map offset  (was: PDFStreamEngine PDFont IDENTITY-H descendentFont descendantFont PDCIDFontType0Font)

> Text extraction fails due to font problem with Type0, supplement-0 font
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-725
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-725
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Fedora 11 or windows
>            Reporter: Peter Costello
>             Fix For: 1.2.0
>
>
> Text extraction fails. In particular, download and view pg23 or others (1-based) of:
> http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf
> With pdfbox text extraction, last 5 lines of page are displayed as "?". Other pages in the file have similar problems.
> Text extraction yields multiple "?" because "font.encode(buf,i,2)" returns null.
> The font COSDictionary contains:
> COSName{Subtype}=COSName{Type0}
> COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
> COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
> COSName{Encoding}=COSName{Identity-H}
> COSName{Type}=COSName{Font}
> The "font.descendentFont" has the following COSDictionary items:
> COSName{Subtype}=COSName{CIDFontType0}
> COSName{FontDescriptor}=COSObject{540, 0}
> COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
> COSName{W}=...
> COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) (COSName{Ordering}:COSString{Identity},(COSName{Registry}:COSString{Adobe}) }
> COSName{DW}=COSInt{1000}
> COSName{Type}=COSName{Font}
> The "fontDescriptor" of the descendentFont is:
> {COSName{StemV}=COSInt{58}, 
> COSName{FontName}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}, 
> COSName{FontFile3}=COSObject{543, 0}, 
> COSName{CIDSet}=COSObject{545, 0}, 
> COSName{Flags}=COSInt{6}, 
> COSName{Descent}=COSInt{-271}, 
> COSName{FontBBox}=COSArray{[COSInt{-166}, COSInt{-214}, COSInt{1050}, COSInt{967}]}, COSName{Ascent}=COSInt{752}, 
> COSName{CapHeight}=COSInt{737}, 
> COSName{XHeight}=COSInt{553}, 
> COSName{Type}=COSName{FontDescriptor}, 
> COSName{ItalicAngle}=COSInt{0}, 
> COSName{StemH}=COSInt{45}}
> The last 5 lines on the page are:
> "Increased Cash Flow by 11 percent to $9,386 million;"
> "Increased Operating Earnings by ..."
> etc
> These 5 lines are encoded as 2 bytes per character (it is a type0 font)
> Each 2 byte code is offset by 31 from its displayed value.
> For instance, code "0x00, 0x01" should convert to ascii "0x0020" (a space).
> The font is an "Identity" font, which means codes should just map to latin ISO chars.
> Yeah, this is a Type0 font which can display a subset of another font (the latin ISO), 
> but how come codes differ from the ascii by +31?
> This same 31 offset is found on all other pages of the file using this font.
> The font descriptor for the descendentFont has "Flags=6". Bit 3 is "Symbolic".
> PDF Spec 5.7.1 "Font contains glyphs outside the Adobe standard Latin character set."
> Maybe because the Font is "Symbolic" there is not a 1:1 map from codes to ascii.
> The question is whether the PDF file specifies the +31 offset, and pdfbox fails to properly account for this offset.. I can't find any reference to such an offset in the PDF spec. The 'getFirstChar()' in the descendentFont is -1, but the real value is"32". Maybe this +31 offset just equals 'firstChar-1'?
> The real firstChar can be found via:
>   COSDictionary fontDict = (COSDictionary)font.getCOSObject();
>   COSArray descendantFontArray = (COSArray)fontDict.getDictionaryObject(COSName.DESCENDANT_FONTS);
>   if (descendantFontArray != null)  {
>     COSDictionary descendantFontDictionary = (COSDictionary)descendantFontArray.getObject(0);
>     PDFont descendentFont = PDFontFactory.createFont(descendantFontDictionary);
>     Encoding encoding = descendentFont.getEncoding();
>     Iterator keyIterator = codeMap.keySet().iterator();
>     int firstChar=Integer.MAX_VALUE;
>     while (keyIterator.hasNext()) firstChar = Math.min(firstChar,((Integer)keyIterator.next()).intValue());
>   }
> Other example on page 3 of the document:
> Text  "Portfolio of ...." displays in Acrobat Reader, but the byte[] contains:
> [0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 0, 69, 0, 1, 0, 83]
> Again, if +31 is added to each of these 2-byte codes then the Ascii is found. 
> Where does this "+31" come from?  Acrobat reader gets it right.  How about pdfbox?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-725) Text extraction fails due to font problem with Type0, supplement-0 font

Posted by "Peter Costello (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Costello updated PDFBOX-725:
----------------------------------

    Attachment: annual-report-2008_pg23.pdf

Page 23 from Encana annual-report-2008.pdf

There are other pages with comparable problems, and these problems are undoubtably related to similar problems with Chinese/Japanese/Korean Type0 fonts.

> Text extraction fails due to font problem with Type0, supplement-0 font
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-725
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-725
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Fedora 11 or windows
>            Reporter: Peter Costello
>             Fix For: 1.2.0
>
>         Attachments: annual-report-2008_pg23.pdf
>
>
> Text extraction fails. In particular, download and view pg23 or others (1-based) of:
> http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf
> With pdfbox text extraction, last 5 lines of page are displayed as "?". Other pages in the file have similar problems.
> Text extraction yields multiple "?" because "font.encode(buf,i,2)" returns null.
> The font COSDictionary contains:
> COSName{Subtype}=COSName{Type0}
> COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
> COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
> COSName{Encoding}=COSName{Identity-H}
> COSName{Type}=COSName{Font}
> The "font.descendentFont" has the following COSDictionary items:
> COSName{Subtype}=COSName{CIDFontType0}
> COSName{FontDescriptor}=COSObject{540, 0}
> COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
> COSName{W}=...
> COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) (COSName{Ordering}:COSString{Identity},(COSName{Registry}:COSString{Adobe}) }
> COSName{DW}=COSInt{1000}
> COSName{Type}=COSName{Font}
> The "fontDescriptor" of the descendentFont is:
> {COSName{StemV}=COSInt{58}, 
> COSName{FontName}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}, 
> COSName{FontFile3}=COSObject{543, 0}, 
> COSName{CIDSet}=COSObject{545, 0}, 
> COSName{Flags}=COSInt{6}, 
> COSName{Descent}=COSInt{-271}, 
> COSName{FontBBox}=COSArray{[COSInt{-166}, COSInt{-214}, COSInt{1050}, COSInt{967}]}, COSName{Ascent}=COSInt{752}, 
> COSName{CapHeight}=COSInt{737}, 
> COSName{XHeight}=COSInt{553}, 
> COSName{Type}=COSName{FontDescriptor}, 
> COSName{ItalicAngle}=COSInt{0}, 
> COSName{StemH}=COSInt{45}}
> The last 5 lines on the page are:
> "Increased Cash Flow by 11 percent to $9,386 million;"
> "Increased Operating Earnings by ..."
> etc
> These 5 lines are encoded as 2 bytes per character (it is a type0 font)
> Each 2 byte code is offset by 31 from its displayed value.
> For instance, code "0x00, 0x01" should convert to ascii "0x0020" (a space).
> The font is an "Identity" font, which means codes should just map to latin ISO chars.
> Yeah, this is a Type0 font which can display a subset of another font (the latin ISO), 
> but how come codes differ from the ascii by +31?
> This same 31 offset is found on all other pages of the file using this font.
> The font descriptor for the descendentFont has "Flags=6". Bit 3 is "Symbolic".
> PDF Spec 5.7.1 "Font contains glyphs outside the Adobe standard Latin character set."
> Maybe because the Font is "Symbolic" there is not a 1:1 map from codes to ascii.
> The question is whether the PDF file specifies the +31 offset, and pdfbox fails to properly account for this offset.. I can't find any reference to such an offset in the PDF spec. The 'getFirstChar()' in the descendentFont is -1, but the real value is"32". Maybe this +31 offset just equals 'firstChar-1'?
> The real firstChar can be found via:
>   COSDictionary fontDict = (COSDictionary)font.getCOSObject();
>   COSArray descendantFontArray = (COSArray)fontDict.getDictionaryObject(COSName.DESCENDANT_FONTS);
>   if (descendantFontArray != null)  {
>     COSDictionary descendantFontDictionary = (COSDictionary)descendantFontArray.getObject(0);
>     PDFont descendentFont = PDFontFactory.createFont(descendantFontDictionary);
>     Encoding encoding = descendentFont.getEncoding();
>     Iterator keyIterator = codeMap.keySet().iterator();
>     int firstChar=Integer.MAX_VALUE;
>     while (keyIterator.hasNext()) firstChar = Math.min(firstChar,((Integer)keyIterator.next()).intValue());
>   }
> Other example on page 3 of the document:
> Text  "Portfolio of ...." displays in Acrobat Reader, but the byte[] contains:
> [0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 0, 69, 0, 1, 0, 83]
> Again, if +31 is added to each of these 2-byte codes then the Ascii is found. 
> Where does this "+31" come from?  Acrobat reader gets it right.  How about pdfbox?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.