You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Leleu Eric <er...@gmail.com> on 2012/03/07 09:15:50 UTC

Questions about toUnicode Cmap

Hi all,


I'm currently working on the preflight issue PDFBOX-1236 [1]

The error seems to come from the management of the "toUnicode" CMap in a
Type0 font.

The "toUnicode" CMap overrides the "Encoding" CMap of the font. Due to this
behaviour,
the preflight validator receives the unicode value for each character code
present in a Text operator instead of the CID value present in the Encoding
CMap.

So I have two questions :
- Is the "Encoding overriding" the right thing to do ?
- Why the "toUnicode" Cmap is used to display text? According to my
understanding of the PDF References v1.7, the toUnicode CMap is used to
extract Text from a PDF File and to create a text file with unicode
characters. To display the text on a PDFReader, the font content and the
Encoding Cmap seem enough.

What is your point of view about these two points?

BR,
Eric

[1] https://issues.apache.org/jira/browse/PDFBOX-1236

Re: Questions about toUnicode Cmap

Posted by Leleu Eric <er...@gmail.com>.
Hi,

OK thanks you Andreas.
I will do the "getCID" method.[1]

BR,
Eric

[1] https://issues.apache.org/jira/browse/PDFBOX-1253


2012/3/13 Andreas Lehmkuehler <an...@lehmi.de>

> Hi,
>
> Am 13.03.2012 19:10, schrieb Andreas Lehmkuehler:
>
>  Hi
>>
>> Am 09.03.2012 07:30, schrieb Andreas Lehmkuehler:
>>
>>> Hi,
>>>
>>> Am 08.03.2012 09:52, schrieb Leleu Eric:
>>>
>>>> Hi,
>>>>
>>>> 2012/3/8 Andreas Lehmkuehler<an...@lehmi.de>
>>>>
>>>>  Hi,
>>>>>
>>>>> Am 07.03.2012 09:15, schrieb Leleu Eric:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>>
>>>>>>  <SNIP>
>
>
>  I don't need to render the Text in the preflight component, I only check
>>>> that the glyph is present and I check the consistency of the width.
>>>>
>>>> Bypass the AWT-Font will be great but it is a huge work.
>>>>
>>> Yes, but we need to do that, because some of the needed fonts aren't
>>> supported
>>> or the support is buggy, see PDFBOX-490.
>>>
>>>  What is your point of view about these two points?
>>>>>
>>>>>>
>>>>>>  Probably we can find a workaround for your issue, but I need some
>>>>> more
>>>>> details on how the preflight code works (see above).
>>>>>
>>>> I had a look and I guess there is no workaround.
>>
>> I don't know the origin purpose of PDFont#encode but nowadays it tries to
>> provide a readable version of the encoded text. AFAIK it's used in 3
>> different
>> cases:
>>
>> - text extraction: works fine as long as PDFBox knows how to encode the
>> text
>> - rendering: the rendering uses java.awt.Font#drawString and therefore it
>> also
>> needs the readable text. BUT this doesn't work in many cases (CID fonts,
>> substituted fonts etc.). In the long run we have to use the cid too to
>> support
>> every kind of font
>> - preflight: ContentStreamWrapper#validText expects to get the CID when
>> calling
>> PDFont#encode but that only works if cid == string
>>
>> To make it more complicated, the encoding cmap is overwritten if a
>> ToUnicode
>> cmap is used at the same time.
>>
>> TODO:
>>
>> - separate the ToUnicode cmap from the encoding cmap
>>
> I guess that's done [1]
>
>
>  - split PDFont#encode, to get one methode providing the string and one
>> providing
>> the cid.
>>
>
> BR
> Andreas Lehmkühler
>
> [1] https://issues.apache.org/**jira/browse/PDFBOX-1252<https://issues.apache.org/jira/browse/PDFBOX-1252>
>

Re: Questions about toUnicode Cmap

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 13.03.2012 19:10, schrieb Andreas Lehmkuehler:
> Hi
>
> Am 09.03.2012 07:30, schrieb Andreas Lehmkuehler:
>> Hi,
>>
>> Am 08.03.2012 09:52, schrieb Leleu Eric:
>>> Hi,
>>>
>>> 2012/3/8 Andreas Lehmkuehler<an...@lehmi.de>
>>>
>>>> Hi,
>>>>
>>>> Am 07.03.2012 09:15, schrieb Leleu Eric:
>>>>
>>>> Hi all,
>>>>>
<SNIP>

>>> I don't need to render the Text in the preflight component, I only check
>>> that the glyph is present and I check the consistency of the width.
>>>
>>> Bypass the AWT-Font will be great but it is a huge work.
>> Yes, but we need to do that, because some of the needed fonts aren't supported
>> or the support is buggy, see PDFBOX-490.
>>
>>>> What is your point of view about these two points?
>>>>>
>>>> Probably we can find a workaround for your issue, but I need some more
>>>> details on how the preflight code works (see above).
> I had a look and I guess there is no workaround.
>
> I don't know the origin purpose of PDFont#encode but nowadays it tries to
> provide a readable version of the encoded text. AFAIK it's used in 3 different
> cases:
>
> - text extraction: works fine as long as PDFBox knows how to encode the text
> - rendering: the rendering uses java.awt.Font#drawString and therefore it also
> needs the readable text. BUT this doesn't work in many cases (CID fonts,
> substituted fonts etc.). In the long run we have to use the cid too to support
> every kind of font
> - preflight: ContentStreamWrapper#validText expects to get the CID when calling
> PDFont#encode but that only works if cid == string
>
> To make it more complicated, the encoding cmap is overwritten if a ToUnicode
> cmap is used at the same time.
>
> TODO:
>
> - separate the ToUnicode cmap from the encoding cmap
I guess that's done [1]

> - split PDFont#encode, to get one methode providing the string and one providing
> the cid.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-1252

Re: Questions about toUnicode Cmap

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi

Am 09.03.2012 07:30, schrieb Andreas Lehmkuehler:
> Hi,
>
> Am 08.03.2012 09:52, schrieb Leleu Eric:
>> Hi,
>>
>> 2012/3/8 Andreas Lehmkuehler<an...@lehmi.de>
>>
>>> Hi,
>>>
>>> Am 07.03.2012 09:15, schrieb Leleu Eric:
>>>
>>> Hi all,
>>>>
>>>>
>>>> I'm currently working on the preflight issue PDFBOX-1236 [1]
>>>>
>>>> The error seems to come from the management of the "toUnicode" CMap in a
>>>> Type0 font.
>>>>
>>>> The "toUnicode" CMap overrides the "Encoding" CMap of the font. Due to
>>>> this
>>>> behaviour,
>>>> the preflight validator receives the unicode value for each character code
>>>> present in a Text operator instead of the CID value present in the
>>>> Encoding
>>>> CMap.
>>>>
>>> Can you give me a pointer where in the preflight code that exactly happens.
>>
>> You can find the Text validation in the
>> "org.apache.padaf.preflight.contentstream.ConstentStreamWrapper" class.
>> The method is validText(byte[] string).
>>
>> We ask the character to the font.encode method to know how many bytes are
>> used to describe the CID.
>> When we have the CID, the checkCID on the
>> "org.apache.padaf.preflight.font.CFFType2FontContainer" is called and an
>> exception occurred when we search the GlyphId with this CID.
>>
>> If I comment the initialization of the toUnicode map, I found the right
>> glyphs.
>> The first one is the 'W' glyph58 linked to the CID 1. (If I extract the
>> font and I read it with fontforge, the glyph 58 is the 'W' too)
> I'll have a look at the weekend.
>
>>> So I have two questions :
>>>> - Is the "Encoding overriding" the right thing to do ?
>>>> - Why the "toUnicode" Cmap is used to display text? According to my
>>>> understanding of the PDF References v1.7, the toUnicode CMap is used to
>>>> extract Text from a PDF File and to create a text file with unicode
>>>> characters. To display the text on a PDFReader, the font content and the
>>>> Encoding Cmap seem enough.
>>>>
>>> PDFBox uses Graphics2d#drawString and newly java.awt.Font#**createGlyphVector
>>> to render the text. The text as to be provided as unicode string when
>>> calling those methods.
>>> IMO we have to change that in the longrun. It would be better to create
>>> the glyphs using the font directly instead of converting it to an AWT-font.
>>>
>>
>> I don't need to render the Text in the preflight component, I only check
>> that the glyph is present and I check the consistency of the width.
>>
>> Bypass the AWT-Font will be great but it is a huge work.
> Yes, but we need to do that, because some of the needed fonts aren't supported
> or the support is buggy, see PDFBOX-490.
>
>>> What is your point of view about these two points?
>>>>
>>> Probably we can find a workaround for your issue, but I need some more
>>> details on how the preflight code works (see above).
I had a look and I guess there is no workaround.

I don't know the origin purpose of PDFont#encode but nowadays it tries to
provide a readable version of the encoded text. AFAIK it's used in 3 different
cases:

- text extraction: works fine as long as PDFBox knows how to encode the text
- rendering: the rendering uses java.awt.Font#drawString and therefore it also
needs the readable text. BUT this doesn't work in many cases (CID fonts, 
substituted fonts etc.). In the long run we have to use the cid too to support
every kind of font
- preflight: ContentStreamWrapper#validText expects to get the CID when calling
PDFont#encode but that only works if cid == string

To make it more complicated, the encoding cmap is overwritten if a ToUnicode
cmap is used at the same time.

TODO:

- separate the ToUnicode cmap from the encoding cmap
- split PDFont#encode, to get one methode providing the string and one providing
the cid.


 >>> BR,
>>>> Eric
>>>>
>>>> [1]
>>>> https://issues.apache.org/**jira/browse/PDFBOX-1236<https://issues.apache.org/jira/browse/PDFBOX-1236>
>>>>
>>>>
>>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>
>> BR
>> Eric
>
>

BR
Andreas Lehmkühler


Re: Questions about toUnicode Cmap

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 08.03.2012 09:52, schrieb Leleu Eric:
> Hi,
>
> 2012/3/8 Andreas Lehmkuehler<an...@lehmi.de>
>
>> Hi,
>>
>> Am 07.03.2012 09:15, schrieb Leleu Eric:
>>
>>   Hi all,
>>>
>>>
>>> I'm currently working on the preflight issue PDFBOX-1236 [1]
>>>
>>> The error seems to come from the management of the "toUnicode" CMap in a
>>> Type0 font.
>>>
>>> The "toUnicode" CMap overrides the "Encoding" CMap of the font. Due to
>>> this
>>> behaviour,
>>> the preflight validator receives the unicode value for each character code
>>> present in a Text operator instead of the CID value present in the
>>> Encoding
>>> CMap.
>>>
>> Can you give me a pointer where in the preflight code that exactly happens.
>
> You can find the Text validation in the
> "org.apache.padaf.preflight.contentstream.ConstentStreamWrapper" class.
> The method is validText(byte[] string).
>
> We ask the character to the font.encode method to know how many bytes are
> used to describe the CID.
> When we have the CID, the checkCID on the
> "org.apache.padaf.preflight.font.CFFType2FontContainer" is called and an
> exception occurred when we search the GlyphId with this CID.
>
> If I comment the initialization of the toUnicode map, I found the right
> glyphs.
> The first one is the 'W' glyph58 linked to the CID 1. (If I extract the
> font and I read it with fontforge, the glyph 58 is the 'W' too)
I'll have a look at the weekend.

>>   So I have two questions :
>>> - Is the "Encoding overriding" the right thing to do ?
>>> - Why the "toUnicode" Cmap is used to display text? According to my
>>> understanding of the PDF References v1.7, the toUnicode CMap is used to
>>> extract Text from a PDF File and to create a text file with unicode
>>> characters. To display the text on a PDFReader, the font content and the
>>> Encoding Cmap seem enough.
>>>
>> PDFBox uses Graphics2d#drawString and newly java.awt.Font#**createGlyphVector
>> to render the text. The text as to be provided as unicode string when
>> calling those methods.
>> IMO we have to change that in the longrun. It would be better to create
>> the glyphs using the font directly instead of converting it to an AWT-font.
>>
>
> I don't need to render the Text in the preflight component, I only check
> that the glyph is present and I check the consistency of the width.
>
> Bypass the AWT-Font will be great but it is a huge work.
Yes, but we need to do that, because some of the needed fonts aren't supported 
or the support is buggy, see PDFBOX-490.

>>   What is your point of view about these two points?
>>>
>> Probably we can find a workaround for your issue, but I need some more
>> details on how the preflight code works (see above).
>>
>>
>>   BR,
>>> Eric
>>>
>>> [1] https://issues.apache.org/**jira/browse/PDFBOX-1236<https://issues.apache.org/jira/browse/PDFBOX-1236>
>>>
>>
>> BR
>> Andreas Lehmkühler
>>
>
> BR
> Eric


BR
Andreas Lehmkühler

Re: Questions about toUnicode Cmap

Posted by Leleu Eric <er...@gmail.com>.
Hi,



2012/3/8 Andreas Lehmkuehler <an...@lehmi.de>

> Hi,
>
> Am 07.03.2012 09:15, schrieb Leleu Eric:
>
>  Hi all,
>>
>>
>> I'm currently working on the preflight issue PDFBOX-1236 [1]
>>
>> The error seems to come from the management of the "toUnicode" CMap in a
>> Type0 font.
>>
>> The "toUnicode" CMap overrides the "Encoding" CMap of the font. Due to
>> this
>> behaviour,
>> the preflight validator receives the unicode value for each character code
>> present in a Text operator instead of the CID value present in the
>> Encoding
>> CMap.
>>
> Can you give me a pointer where in the preflight code that exactly happens.
>
>

You can find the Text validation in the
"org.apache.padaf.preflight.contentstream.ConstentStreamWrapper" class.
The method is validText(byte[] string).

We ask the character to the font.encode method to know how many bytes are
used to describe the CID.
When we have the CID, the checkCID on the
"org.apache.padaf.preflight.font.CFFType2FontContainer" is called and an
exception occurred when we search the GlyphId with this CID.

If I comment the initialization of the toUnicode map, I found the right
glyphs.
The first one is the 'W' glyph58 linked to the CID 1. (If I extract the
font and I read it with fontforge, the glyph 58 is the 'W' too)



>  So I have two questions :
>> - Is the "Encoding overriding" the right thing to do ?
>> - Why the "toUnicode" Cmap is used to display text? According to my
>> understanding of the PDF References v1.7, the toUnicode CMap is used to
>> extract Text from a PDF File and to create a text file with unicode
>> characters. To display the text on a PDFReader, the font content and the
>> Encoding Cmap seem enough.
>>
> PDFBox uses Graphics2d#drawString and newly java.awt.Font#**createGlyphVector
> to render the text. The text as to be provided as unicode string when
> calling those methods.
> IMO we have to change that in the longrun. It would be better to create
> the glyphs using the font directly instead of converting it to an AWT-font.
>

I don't need to render the Text in the preflight component, I only check
that the glyph is present and I check the consistency of the width.

Bypass the AWT-Font will be great but it is a huge work.


>  What is your point of view about these two points?
>>
> Probably we can find a workaround for your issue, but I need some more
> details on how the preflight code works (see above).
>
>
>  BR,
>> Eric
>>
>> [1] https://issues.apache.org/**jira/browse/PDFBOX-1236<https://issues.apache.org/jira/browse/PDFBOX-1236>
>>
>
> BR
> Andreas Lehmkühler
>

BR
Eric

Re: Questions about toUnicode Cmap

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 07.03.2012 09:15, schrieb Leleu Eric:
> Hi all,
>
>
> I'm currently working on the preflight issue PDFBOX-1236 [1]
>
> The error seems to come from the management of the "toUnicode" CMap in a
> Type0 font.
>
> The "toUnicode" CMap overrides the "Encoding" CMap of the font. Due to this
> behaviour,
> the preflight validator receives the unicode value for each character code
> present in a Text operator instead of the CID value present in the Encoding
> CMap.
Can you give me a pointer where in the preflight code that exactly happens.

> So I have two questions :
> - Is the "Encoding overriding" the right thing to do ?
> - Why the "toUnicode" Cmap is used to display text? According to my
> understanding of the PDF References v1.7, the toUnicode CMap is used to
> extract Text from a PDF File and to create a text file with unicode
> characters. To display the text on a PDFReader, the font content and the
> Encoding Cmap seem enough.
PDFBox uses Graphics2d#drawString and newly java.awt.Font#createGlyphVector to 
render the text. The text as to be provided as unicode string when calling those 
methods.
IMO we have to change that in the longrun. It would be better to create the 
glyphs using the font directly instead of converting it to an AWT-font.

> What is your point of view about these two points?
Probably we can find a workaround for your issue, but I need some more details 
on how the preflight code works (see above).

> BR,
> Eric
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-1236

BR
Andreas Lehmkühler