You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by mehdi houshmand <me...@gmail.com> on 2010/11/09 12:08:36 UTC

TrueType Font Embedding

Hi,

I'm working on making TTF subset embedding configurable such that a
user can opt for either full font embedding, subset embedding or just
referencing, this would be extending the work Jeremias submitted. I
was considering adding a parameter to the font configuration file
called "embedding" with 3 possible values "none", "subset" and "full".
This would allow the user to configure the embedding mode on a font by
font basis. What do people think about this proposal?

Thanks

Mehdi

RE: wrong glyph shown for box (was: TrueType Font Embedding)

Posted by Eric Douglas <ed...@blockhouse.com>.
Hi Jeremias,
	I finally got time to check on this and here's what I found.
Our web designer said she looked at the Lucida Sans Typewriter font in
the Adobe professional font editor and it appeared to have the unicode
characters 25A1 and 25AB in it.  I just opened the Lucida Sans
Typewriter Regular font LUCON.TTF in that Windows Character Map program
and those characters are missing.  There is another font Lucida Sans
Unicode L_10646.TTF which is a much larger file, which does show those
characters.  If I just add that file to my list of embedded fonts for
the renderer, and specify font=Lucida Sans Unicode for the square
characters, FOP 1.0 prints a square.  Apparently it's just an odd
coincidence that FOP 0.95 prints the unicode character 0x25A1 for
missing glyph which happened to be the glyph I was trying to print.
Using FOP 1.0 which is apparently doing what 0.95 was supposed to be
doing just makes the 56KB 2-page PDF file into a 59KB file.
 

-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Friday, November 19, 2010 3:29 AM
To: fop-dev@xmlgraphics.apache.org
Subject: Re: wrong glyph shown for box (was: TrueType Font Embedding)

Hi Eric

Sorry for the delay. I've looked at your example you've sent me
off-list.

On 12.11.2010 09:32:13 Jeremias Maerki wrote:
<snip/>
> > The problem I'm currently having with output is rendering special 
> > unicode glyphs.  I sent one unicode as a 25AB with the font file 
> > LTYPE.TTF which came installed with Windows XP.  In FOP 0.95 it 
> > produced a square which is what I want.  That character is supposed 
> > to be a square.  If I'm wrong and that character is not in the font 
> > then the square was the default print for character not found.  I'd 
> > like to be able to run a routine through FOP to get out a list of 
> > all unicodes and what characters they go with for a particular font.

> > When I tried FOP 1.0, that same code produced a pound #.
> 
> Hmm, sounds like a regression. I guess we'll have to look into that
then.
> And such a glyph dump utility is definitely something FOP could profit

> from. Has anybody already written something like that? We could 
> integrate it into org.apache.fop.tools.fontlist maybe.

It's not really a regression although the change is curious. Anyway, the
box you got with FOP 0.95 was not the 0x25AB character (WHITE SMALL
SQUARE) but actually the .notdef glyph which often is a big square (not
a small square). In later versions, FOP seems to catch the missing
character and replace it by "#". The "#" character is not really the
right one to display for a glyph that was not found, but FOP has been
doing that for 10 years. Maybe that gets looked at at some point. But I
don't know (and won't investigate) why FOP didn't produce a "#".

Anyway, the 0x25AB glyph is not in the font you're using. If you want a
little square glyph, you need to use a different font. You can use
Windows' "Character Map" tool to find a suitable one.

<snip/>

Jeremias Maerki


Re: wrong glyph shown for box (was: TrueType Font Embedding)

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Hi Eric

Sorry for the delay. I've looked at your example you've sent me off-list.

On 12.11.2010 09:32:13 Jeremias Maerki wrote:
<snip/>
> > The problem I'm currently having with output is rendering special
> > unicode glyphs.  I sent one unicode as a 25AB with the font file
> > LTYPE.TTF which came installed with Windows XP.  In FOP 0.95 it produced
> > a square which is what I want.  That character is supposed to be a
> > square.  If I'm wrong and that character is not in the font then the
> > square was the default print for character not found.  I'd like to be
> > able to run a routine through FOP to get out a list of all unicodes and
> > what characters they go with for a particular font.  When I tried FOP
> > 1.0, that same code produced a pound #.
> 
> Hmm, sounds like a regression. I guess we'll have to look into that then.
> And such a glyph dump utility is definitely something FOP could profit
> from. Has anybody already written something like that? We could
> integrate it into org.apache.fop.tools.fontlist maybe.

It's not really a regression although the change is curious. Anyway, the
box you got with FOP 0.95 was not the 0x25AB character (WHITE SMALL
SQUARE) but actually the .notdef glyph which often is a big square (not
a small square). In later versions, FOP seems to catch the missing
character and replace it by "#". The "#" character is not really the
right one to display for a glyph that was not found, but FOP has been
doing that for 10 years. Maybe that gets looked at at some point. But
I don't know (and won't investigate) why FOP didn't produce a "#".

Anyway, the 0x25AB glyph is not in the font you're using. If you want a
little square glyph, you need to use a different font. You can use
Windows' "Character Map" tool to find a suitable one.

<snip/>

Jeremias Maerki


Re: TrueType Font Embedding

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
On 11.11.2010 22:10:57 Eric Douglas wrote:
> If using installed fonts is an option to save space in the file / data
> stream, using embedded fonts still needs to be an option.

Eric, we're not talking about removing anything. We're talking about
adding TrueType support to PostScript output and handling referenced
TrueType fonts with possibly full Unicode support.

> I am assigning specific fonts from specific files to get consistent
> output so everything must be embedded.  I don't want to have to care
> what is installed where.  I am glad to fix this headache I've had with
> Windows 98 trying to use Courier New fonts and different PCs with the
> same OS had a different font file, and trying to render on the server
> versus the client having one not installed or different fonts installed
> with the same name.
> 
> The problem I'm currently having with output is rendering special
> unicode glyphs.  I sent one unicode as a 25AB with the font file
> LTYPE.TTF which came installed with Windows XP.  In FOP 0.95 it produced
> a square which is what I want.  That character is supposed to be a
> square.  If I'm wrong and that character is not in the font then the
> square was the default print for character not found.  I'd like to be
> able to run a routine through FOP to get out a list of all unicodes and
> what characters they go with for a particular font.  When I tried FOP
> 1.0, that same code produced a pound #.

Hmm, sounds like a regression. I guess we'll have to look into that then.
And such a glyph dump utility is definitely something FOP could profit
from. Has anybody already written something like that? We could
integrate it into org.apache.fop.tools.fontlist maybe.

> The biggest problem I'm having running FOP 0.95 is the threading.  I've
> tried calling it from a Java SwingWorker and it's not resolving the
> issue.  I'm running a javax.swing.JProgressBar as indeterminate and it
> freezes while I'm transforming FOP output, so the users think the
> program is just stuck and I have to explain to them it's supposed to do
> that the first time.  If they run it twice in a row the second one is
> much smoother.

I've never used FOP in a way that it interacts with a Swing GUI. Maybe
there's some interaction with AWT/Java2D since FOP uses Java2D
extensively depending on the output format. But it makes absolutely
sense to run FOP in a different thread than AWT's event loop.

> Getting smaller results is nice but not necessarily a priority.
> Reducing a 2 MB file to 35 K is high priority.  Reducing a 46 K file to
> 35 K is not a big deal.  Getting consistent output is top priority.
> 
> 
> -----Original Message-----
> From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
> Sent: Thursday, November 11, 2010 3:35 PM
> To: fop-dev@xmlgraphics.apache.org
> Subject: Re: TrueType Font Embedding
> 
> Hi Chris
> 
> I fully understand the desire to install the font on a PostScript
> printer to keep the PS files smaller. To answer your question: I did not
> ask for the business use case. The problem I'm struggling with in this
> context is how to know about the CID meaning of the font, i.e. the
> multi-byte encoding of the font.
> 
> When we do subsets in FOP, we re-index the glyphs starting with index 1
> (or 3) by occurrence in the document. Only FOP knows which Unicode
> character is represented by which CID. That's why we need the ToUnicode
> CMap in PDF. Otherwise, text extraction would not be so easy.
> 
> In single-byte mode, the whole font is embedded (right now probably with
> the same problems I've just fixed with rev1034094 for the TTF subset).
> In this mode the Adobe character names map into the font, so 8-bit
> encodings can be built to properly address the right characters even if
> the font is not embedded. That's also how we currently do referenced TTF
> fonts for PDF output.
> 
> If we fully embed the font as a CID font, we currently lose the
> knowledge about which index represents which Unicode character.
> Combining the font with a suitable CMap resolves the problem but at the
> moment we only use Identity-H which is a 1:1 mapping. One solution would
> be to turn the Unicode "cmap" table in the TrueType font into a custom
> PS CMap and then use 16-bit Unicode characters directly. FOP currently
> doesn't support that.
> 
> Also, if some PS platform allows to upload naked TrueType fonts, how
> will they be represented in the PS VM? Are they CID fonts then or
> single-byte fonts? If they are CID fonts, which CID system are they
> following? I have no idea. The only way to be sure about this is by
> installing a CID font plus CMap that is generated by FOP (which can be
> done by extracting these resources from one of the PS streams. After
> that, the font can be referenced, but it may not be portable to other
> PS-generating applications.
> 
> And then, as Glen mentioned we have to have a strategy to deal with
> glyphs with no representation in Unicode. I think I get where he goes
> with that and it seems to be close to the CMap I mentioned above that is
> derived from the Unicode "cmap" table in the TrueType font. At any rate,
> FOP then has to learn to output Unicode characters (including private
> area chars) instead of arbitrary CIDs coming from subsetting.
> 
> In the end, I'm not 100% I've understood all implications here. I hope
> we'll get there soon. I guess a Wiki page would do us good here.
> 
> 
> 
> Jeremias Maerki
> 




Jeremias Maerki


Re: TrueType Font Embedding

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
On 12.11.2010 09:17:44 Chris Bowditch wrote:
<snip/>
> Thanks for the detailed explanation. I think I follow what you mean. 
> IIUC what you say above then when we fully embedded the CID TTF it would 
> not have been extractable? In the same way a subsetted font is 
> meaningless when extracted.

I think so, but I'm not 100% sure. Theoretically, if the Unicode cmap
tables are preserved (or even generated for the subset fonts), that
information is retained. But with a separate CMap resource that is
detached from the actual cidfont resource, it's difficult. Of course,
there's also the font resource that combines the CMap with the cidfont
that combines the two again. So to make this work all three resources
have to be kept together somehow. Example:

%%BeginResource: font EAAACC+HYb1gj
/EAAACC+HYb1gj /Identity-H [/EAAACC+HYb1gj] composefont pop
%%EndResource

> If this is true then clearly there is little 
> value in making this configurable without also adding the extra tables 
> you mention above, which I am guessing is a lot of work and probably not 
> worth it.
> 
> What about Type1 fonts? Do we always embed the font fully and can they 
> be extracted for re-use?

The good thing about Type1 fonts is that they are PostScript programs
which can be embedded with almost not changes. And you've also always
got each glyph referenced by its Adobe glyph name. But then we're also
not talking about CID Type1 fonts where the same problem probably
applies.

> <snip/>
> 
> Thanks,
> 
> Chris




Jeremias Maerki


Re: TrueType Font Embedding

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Useful info. Thanks!

On 16.11.2010 17:48:47 Vincent Hennebert wrote:
> Installing a font on a printer is a problem, post-processing
> a PostScript file is another one.
> 
> It is indeed an issue to determine how to reference a font that has been
> manually installed on a printer. I tried once to install a TrueType font
> on a Xerox printer, and the Xerox utility I used for that tried to
> convert it. Into what? No idea.
> 
> I tried to reference the Kochi Gothic font manually installed on a HP
> printer and using the PostScript name (Kochi-Gothic) didn’t work.
> Printing the font list was giving ‘Kochi Gothic’ with the space in
> between and AFAIK it’s not possible to use a space in a font name in
> PostScript.

Yes, the PS names use hyphens instead. I would also expect the
PostScript name from the TTF font to be used.

> I tried to reference an ornaments font installed on the Xerox printer,
> using the TrueType file provided with the printer to get the metrics.
> I got it working be deriving a font with a custom encoding. I don’t know
> wether the actual font on the printer was in Type 1 or TrueType format.
> The method of deriving a custom encoding would have been the same
> anyway. Maybe it was even some proprietary format.

Yes, the glyph description type really doesn't matter, as long as you know
how to address the individual glyphs (Adobe names or CIDs).

> So, it’s difficult to know whether a font that is manually installed on
> a printer will be converted or not, accessible as a single-byte font or
> a CIDFont, etc. And each make is likely to do it differently.

That's what I feared. When analysing the last problem (HP printed some
glyphs badly), I started to write some PS code to dump a font dictionary
to the console. I didn't get too far. If a PostScript program could be
written that creates a report for one or more PS fonts on a printer, we
might learn more about how the printer offers the fonts.

But if some printers present pre-installed TTF fonts differently than
others, that would make the whole thing rather complicated for users
which means to recommend embedding full fonts.

> However, AFAIU from Chris, there still is an interest to fully embed
> a font to allow post-processing by a print bureau. For example,
> concatenating several FOP-produced documents into a single big print
> job. In that case we don’t care about the printer. Everything remains in
> the control of FOP. It’s up to us whether we want to use base fonts or
> CID-keyed fonts. And I don’t think the user even wants to know how we do
> it, as long as they have the option to either fully embed, or
> subset-embed the font.

Agreed.

<snip/>

Jeremias Maerki


Re: TrueType Font Embedding

Posted by Vincent Hennebert <vh...@gmail.com>.
Installing a font on a printer is a problem, post-processing
a PostScript file is another one.

It is indeed an issue to determine how to reference a font that has been
manually installed on a printer. I tried once to install a TrueType font
on a Xerox printer, and the Xerox utility I used for that tried to
convert it. Into what? No idea.

I tried to reference the Kochi Gothic font manually installed on a HP
printer and using the PostScript name (Kochi-Gothic) didn’t work.
Printing the font list was giving ‘Kochi Gothic’ with the space in
between and AFAIK it’s not possible to use a space in a font name in
PostScript.

I tried to reference an ornaments font installed on the Xerox printer,
using the TrueType file provided with the printer to get the metrics.
I got it working be deriving a font with a custom encoding. I don’t know
wether the actual font on the printer was in Type 1 or TrueType format.
The method of deriving a custom encoding would have been the same
anyway. Maybe it was even some proprietary format.

So, it’s difficult to know whether a font that is manually installed on
a printer will be converted or not, accessible as a single-byte font or
a CIDFont, etc. And each make is likely to do it differently.

However, AFAIU from Chris, there still is an interest to fully embed
a font to allow post-processing by a print bureau. For example,
concatenating several FOP-produced documents into a single big print
job. In that case we don’t care about the printer. Everything remains in
the control of FOP. It’s up to us whether we want to use base fonts or
CID-keyed fonts. And I don’t think the user even wants to know how we do
it, as long as they have the option to either fully embed, or
subset-embed the font.


Vincent


On 11/11/10 20:35, Jeremias Maerki wrote:
> Hi Chris
> 
> I fully understand the desire to install the font on a PostScript
> printer to keep the PS files smaller. To answer your question: I did not
> ask for the business use case. The problem I'm struggling with in this
> context is how to know about the CID meaning of the font, i.e. the
> multi-byte encoding of the font.
> 
> When we do subsets in FOP, we re-index the glyphs starting with index 1
> (or 3) by occurrence in the document. Only FOP knows which Unicode
> character is represented by which CID. That's why we need the ToUnicode
> CMap in PDF. Otherwise, text extraction would not be so easy.
> 
> In single-byte mode, the whole font is embedded (right now probably with
> the same problems I've just fixed with rev1034094 for the TTF subset).
> In this mode the Adobe character names map into the font, so 8-bit
> encodings can be built to properly address the right characters even if
> the font is not embedded. That's also how we currently do referenced TTF
> fonts for PDF output.
> 
> If we fully embed the font as a CID font, we currently lose the
> knowledge about which index represents which Unicode character.
> Combining the font with a suitable CMap resolves the problem but at the
> moment we only use Identity-H which is a 1:1 mapping. One solution would
> be to turn the Unicode "cmap" table in the TrueType font into a custom PS
> CMap and then use 16-bit Unicode characters directly. FOP currently
> doesn't support that.
> 
> Also, if some PS platform allows to upload naked TrueType fonts, how
> will they be represented in the PS VM? Are they CID fonts then or
> single-byte fonts? If they are CID fonts, which CID system are they
> following? I have no idea. The only way to be sure about this is by
> installing a CID font plus CMap that is generated by FOP (which can be
> done by extracting these resources from one of the PS streams. After
> that, the font can be referenced, but it may not be portable to other
> PS-generating applications.
> 
> And then, as Glen mentioned we have to have a strategy to deal with
> glyphs with no representation in Unicode. I think I get where he goes
> with that and it seems to be close to the CMap I mentioned above that is
> derived from the Unicode "cmap" table in the TrueType font. At any rate,
> FOP then has to learn to output Unicode characters (including private
> area chars) instead of arbitrary CIDs coming from subsetting.
> 
> In the end, I'm not 100% I've understood all implications here. I hope
> we'll get there soon. I guess a Wiki page would do us good here.
> 
> On 11.11.2010 17:50:46 Chris Bowditch wrote:
>> Hi All,
>>
>> On 09/11/2010 14:43, Jeremias Maerki wrote:
>>> On 09.11.2010 14:48:30 Vincent Hennebert wrote:
>>>> There may be an interest in fully embedding a font for PostScript
>>>> output. IIUC there may be a print manager that pre-processes PostScript
>>>> files, extracts embedded fonts to store them somewhere and re-use them
>>>> whenever needed. It can then strip the font off subsequent files and
>>>> substantially lighten them, speeding up the printing process.
>>> It makes the files smaller, but that will be the only thing that
>>> improved printing performance. The PS interpreter still has to parse and
>>> process the actual resource. It also needs to be noted that extracting
>>> subset fonts doesn't make sense. I've already added the unique-ification
>>> prefix to the TTF font names (like in PDF) to avoid problems like that.
>>
>> Yes I agree extracting subset fonts doesn't make sense, but extracting a 
>> fully embedded font does have plenty of business applications. Which is 
>> precisely why the introduction of a setting is required here. In some 
>> cases it is important to bring the file size down; enter the subsetting 
>> feature. Subsetting is particularly useful when creating print stream 
>> with a relatively small number of pages, i.e. 100 or less and you have 
>> large Unicode fonts to support Eastern character sets.
>>
>>   In other situations people using FOP want to be able to create large 
>> Print streams to send to Print Bureaus. Print Bureaus tends to use 
>> software to parse Print streams rather than sending them directly to a 
>> printer. Those processes will often need to be able to process the 
>> fonts, which they can only do if the full font is embedded rather than a 
>> subset. As you already noted above, extracting a subset if useless.
>>
>>>> What’s the purpose of the ‘encoding’ parameter? It looks to me like
>>>> users don’t care about what encoding is used in the PDF or PostScript
>>>> file. All they want to have is properly printed documents that use their
>>>> own fonts. I think that parameter should be removed in favour of Mehdi’s
>>>> proposal, which IMO makes much more sense from a user perspective.
>>> I don't know if it's necessary. That's why I wrote that maybe additional
>>> research may be necessary. If we don't have it, we may have to build up
>>> a /CIDMap that covers Unicode because there is otherwise no information
>>> in the font which character indices correspond to which glyph as long as
>>> we use /Registry (Adobe) /Ordering (Identity). Or: you configure a CID
>>> map (encoding) that is tailored to the kind of document you want to
>>> produce. The Unicode /CIDMap could result in rather big /CIDMap arrays
>>> (65535 * 4 = 256KB) with lots of pointers to ".notdef".
>>
>>  From a user's perspective, the encoding parameter is too technical and 
>> most user's ill not understand its purpose. If possible I would like to 
>> reach a consenus on what we should do and then remove the parameter to 
>> help cut down the complexity of configuring fonts. As you noted there 
>> are now a bewildering number of options.
>>
>>> Before continuing with this there should be a broad understanding how
>>> non-subset TrueType fonts shall be handled in PostScript (and PDF where
>>> you can make the same case). Otherwise, a change like Mehdi proposed
>>> doesn't improve anything.
>>
>> Are you asking what the business use case is for fully embedded fonts as 
>> opposed to subset fonts. The ability to post process is the most 
>> important use case. If the fonts are subset it become difficult to merge 
>> Postscript files together or extract the font. Both are fairly common at 
>> Print bureaus.
>>>> Granted, there would be some redundancy with the referenced-fonts
>>>> element. But is the additional flexibility of regexp really useful in
>>>> the first place? I’m not too sure. Maybe that could be removed too.
>>> I don't want that removed. I've been grateful for its existence more
>>> than once. With the regexp I can make sure that, for example, all
>>> variants of the "Frutiger" font are not embedded: Frutiger 45 Light,
>>> Frutiger 55 Roman etc. etc.
>>
>> I concur the regexp stuff in the font referencing is useful. We can use 
>> it to change the way whole font families are referenced without having 
>> to list every font.
>>> Anyway, I don't like constantly changing the way fonts are configured.
>>> There's enough confusion with the way it's currently done already. I
>>> won't veto a change like that but I'm not happy with it.
>>
>> I understand what you are saying there are a lot of options, but then 
>> the requirements around fonts are complex so there is no escaping a 
>> comlex configuration file.
>>
>> Thanks,
>>
>> Chris
>>
>>>> Vincent
>>>>
>>>>
>>>> On 09/11/10 12:45, Jeremias Maerki wrote:
>>>>> Hi Mehdi,
>>>>> I'm against that since we already have mechanisms to control some of
>>>>> these traits and this would overlap with them. For example, we have the
>>>>> referenced-fonts element
>>>>> (http://xmlgraphics.apache.org/fop/trunk/fonts.html#embedding)
>>>>> which controls whether we embed or not. And we have the encoding-mode
>>>>> attribute on the font element to control if single-byte or cid mode
>>>>> should be used. Granted, that's not exactly what you're after, but I
>>>>> believe this already covers 95% of the use cases if not more.
>>>>>
>>>>> The only thing you can't currently do is embed a full font in CID mode
>>>>> (or reference it). The problem here is the character map that should be
>>>>> used when in CID mode. I think that would require some research first so
>>>>> we know how best to handle this. For example, referencing only makes
>>>>> sense if a TrueType font can be installed directly on the printer. But
>>>>> then, the question is in which mode the characters can be addressed.
>>>>> Single-byte (like we currently fall back to) is probably not a problem
>>>>> unless you need to print Asian documents. Please note that we also don't
>>>>> support full TTF embedding/referencing in CID mode in PDF documents. So
>>>>> I'm not sure if we really need that at the moment.
>>>>>
>>>>> If we do, I believe it would generally suffice to extend encoding-mode
>>>>> from (auto|single-byte|cid) to (auto|single-byte|cid|cid-full). We may
>>>>> need a "cmap" parameter then to change the default CMap (currently
>>>>> "Identity-H" like in PDF) since our subsetting code uses custom mappings,
>>>>> not Unicode or any other encoding scheme (like "90ms-RKSJ-H").
>>>>>
>>>>> On 09.11.2010 12:08:36 mehdi houshmand wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm working on making TTF subset embedding configurable such that a
>>>>>> user can opt for either full font embedding, subset embedding or just
>>>>>> referencing, this would be extending the work Jeremias submitted. I
>>>>>> was considering adding a parameter to the font configuration file
>>>>>> called "embedding" with 3 possible values "none", "subset" and "full".
>>>>>> This would allow the user to configure the embedding mode on a font by
>>>>>> font basis. What do people think about this proposal?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Mehdi
>>>>>
>>>>>
>>>>>
>>>>> Jeremias Maerki
>>>>>
>>>
>>>
>>>
>>> Jeremias Maerki
>>>
>>>
>>>
> 
> 
> 
> 
> Jeremias Maerki
> 

RE: TrueType Font Embedding

Posted by Eric Douglas <ed...@blockhouse.com>.
If using installed fonts is an option to save space in the file / data
stream, using embedded fonts still needs to be an option.
I am assigning specific fonts from specific files to get consistent
output so everything must be embedded.  I don't want to have to care
what is installed where.  I am glad to fix this headache I've had with
Windows 98 trying to use Courier New fonts and different PCs with the
same OS had a different font file, and trying to render on the server
versus the client having one not installed or different fonts installed
with the same name.

The problem I'm currently having with output is rendering special
unicode glyphs.  I sent one unicode as a 25AB with the font file
LTYPE.TTF which came installed with Windows XP.  In FOP 0.95 it produced
a square which is what I want.  That character is supposed to be a
square.  If I'm wrong and that character is not in the font then the
square was the default print for character not found.  I'd like to be
able to run a routine through FOP to get out a list of all unicodes and
what characters they go with for a particular font.  When I tried FOP
1.0, that same code produced a pound #.

The biggest problem I'm having running FOP 0.95 is the threading.  I've
tried calling it from a Java SwingWorker and it's not resolving the
issue.  I'm running a javax.swing.JProgressBar as indeterminate and it
freezes while I'm transforming FOP output, so the users think the
program is just stuck and I have to explain to them it's supposed to do
that the first time.  If they run it twice in a row the second one is
much smoother.

Getting smaller results is nice but not necessarily a priority.
Reducing a 2 MB file to 35 K is high priority.  Reducing a 46 K file to
35 K is not a big deal.  Getting consistent output is top priority.


-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Thursday, November 11, 2010 3:35 PM
To: fop-dev@xmlgraphics.apache.org
Subject: Re: TrueType Font Embedding

Hi Chris

I fully understand the desire to install the font on a PostScript
printer to keep the PS files smaller. To answer your question: I did not
ask for the business use case. The problem I'm struggling with in this
context is how to know about the CID meaning of the font, i.e. the
multi-byte encoding of the font.

When we do subsets in FOP, we re-index the glyphs starting with index 1
(or 3) by occurrence in the document. Only FOP knows which Unicode
character is represented by which CID. That's why we need the ToUnicode
CMap in PDF. Otherwise, text extraction would not be so easy.

In single-byte mode, the whole font is embedded (right now probably with
the same problems I've just fixed with rev1034094 for the TTF subset).
In this mode the Adobe character names map into the font, so 8-bit
encodings can be built to properly address the right characters even if
the font is not embedded. That's also how we currently do referenced TTF
fonts for PDF output.

If we fully embed the font as a CID font, we currently lose the
knowledge about which index represents which Unicode character.
Combining the font with a suitable CMap resolves the problem but at the
moment we only use Identity-H which is a 1:1 mapping. One solution would
be to turn the Unicode "cmap" table in the TrueType font into a custom
PS CMap and then use 16-bit Unicode characters directly. FOP currently
doesn't support that.

Also, if some PS platform allows to upload naked TrueType fonts, how
will they be represented in the PS VM? Are they CID fonts then or
single-byte fonts? If they are CID fonts, which CID system are they
following? I have no idea. The only way to be sure about this is by
installing a CID font plus CMap that is generated by FOP (which can be
done by extracting these resources from one of the PS streams. After
that, the font can be referenced, but it may not be portable to other
PS-generating applications.

And then, as Glen mentioned we have to have a strategy to deal with
glyphs with no representation in Unicode. I think I get where he goes
with that and it seems to be close to the CMap I mentioned above that is
derived from the Unicode "cmap" table in the TrueType font. At any rate,
FOP then has to learn to output Unicode characters (including private
area chars) instead of arbitrary CIDs coming from subsetting.

In the end, I'm not 100% I've understood all implications here. I hope
we'll get there soon. I guess a Wiki page would do us good here.



Jeremias Maerki


Re: TrueType Font Embedding

Posted by Chris Bowditch <bo...@hotmail.com>.
On 11/11/2010 20:35, Jeremias Maerki wrote:
> Hi Chris

Hi Jeremias,
> I fully understand the desire to install the font on a PostScript
> printer to keep the PS files smaller. To answer your question: I did not
> ask for the business use case. The problem I'm struggling with in this
> context is how to know about the CID meaning of the font, i.e. the
> multi-byte encoding of the font.
>
> When we do subsets in FOP, we re-index the glyphs starting with index 1
> (or 3) by occurrence in the document. Only FOP knows which Unicode
> character is represented by which CID. That's why we need the ToUnicode
> CMap in PDF. Otherwise, text extraction would not be so easy.
>
> In single-byte mode, the whole font is embedded (right now probably with
> the same problems I've just fixed with rev1034094 for the TTF subset).
> In this mode the Adobe character names map into the font, so 8-bit
> encodings can be built to properly address the right characters even if
> the font is not embedded. That's also how we currently do referenced TTF
> fonts for PDF output.
>
> If we fully embed the font as a CID font, we currently lose the
> knowledge about which index represents which Unicode character.
> Combining the font with a suitable CMap resolves the problem but at the
> moment we only use Identity-H which is a 1:1 mapping. One solution would
> be to turn the Unicode "cmap" table in the TrueType font into a custom PS
> CMap and then use 16-bit Unicode characters directly. FOP currently
> doesn't support that.
Thanks for the detailed explanation. I think I follow what you mean. 
IIUC what you say above then when we fully embedded the CID TTF it would 
not have been extractable? In the same way a subsetted font is 
meaningless when extracted. If this is true then clearly there is little 
value in making this configurable without also adding the extra tables 
you mention above, which I am guessing is a lot of work and probably not 
worth it.

What about Type1 fonts? Do we always embed the font fully and can they 
be extracted for re-use?

<snip/>

Thanks,

Chris

Re: TrueType Font Embedding

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Hi Chris

I fully understand the desire to install the font on a PostScript
printer to keep the PS files smaller. To answer your question: I did not
ask for the business use case. The problem I'm struggling with in this
context is how to know about the CID meaning of the font, i.e. the
multi-byte encoding of the font.

When we do subsets in FOP, we re-index the glyphs starting with index 1
(or 3) by occurrence in the document. Only FOP knows which Unicode
character is represented by which CID. That's why we need the ToUnicode
CMap in PDF. Otherwise, text extraction would not be so easy.

In single-byte mode, the whole font is embedded (right now probably with
the same problems I've just fixed with rev1034094 for the TTF subset).
In this mode the Adobe character names map into the font, so 8-bit
encodings can be built to properly address the right characters even if
the font is not embedded. That's also how we currently do referenced TTF
fonts for PDF output.

If we fully embed the font as a CID font, we currently lose the
knowledge about which index represents which Unicode character.
Combining the font with a suitable CMap resolves the problem but at the
moment we only use Identity-H which is a 1:1 mapping. One solution would
be to turn the Unicode "cmap" table in the TrueType font into a custom PS
CMap and then use 16-bit Unicode characters directly. FOP currently
doesn't support that.

Also, if some PS platform allows to upload naked TrueType fonts, how
will they be represented in the PS VM? Are they CID fonts then or
single-byte fonts? If they are CID fonts, which CID system are they
following? I have no idea. The only way to be sure about this is by
installing a CID font plus CMap that is generated by FOP (which can be
done by extracting these resources from one of the PS streams. After
that, the font can be referenced, but it may not be portable to other
PS-generating applications.

And then, as Glen mentioned we have to have a strategy to deal with
glyphs with no representation in Unicode. I think I get where he goes
with that and it seems to be close to the CMap I mentioned above that is
derived from the Unicode "cmap" table in the TrueType font. At any rate,
FOP then has to learn to output Unicode characters (including private
area chars) instead of arbitrary CIDs coming from subsetting.

In the end, I'm not 100% I've understood all implications here. I hope
we'll get there soon. I guess a Wiki page would do us good here.

On 11.11.2010 17:50:46 Chris Bowditch wrote:
> Hi All,
> 
> On 09/11/2010 14:43, Jeremias Maerki wrote:
> > On 09.11.2010 14:48:30 Vincent Hennebert wrote:
> >> There may be an interest in fully embedding a font for PostScript
> >> output. IIUC there may be a print manager that pre-processes PostScript
> >> files, extracts embedded fonts to store them somewhere and re-use them
> >> whenever needed. It can then strip the font off subsequent files and
> >> substantially lighten them, speeding up the printing process.
> > It makes the files smaller, but that will be the only thing that
> > improved printing performance. The PS interpreter still has to parse and
> > process the actual resource. It also needs to be noted that extracting
> > subset fonts doesn't make sense. I've already added the unique-ification
> > prefix to the TTF font names (like in PDF) to avoid problems like that.
> 
> Yes I agree extracting subset fonts doesn't make sense, but extracting a 
> fully embedded font does have plenty of business applications. Which is 
> precisely why the introduction of a setting is required here. In some 
> cases it is important to bring the file size down; enter the subsetting 
> feature. Subsetting is particularly useful when creating print stream 
> with a relatively small number of pages, i.e. 100 or less and you have 
> large Unicode fonts to support Eastern character sets.
> 
>   In other situations people using FOP want to be able to create large 
> Print streams to send to Print Bureaus. Print Bureaus tends to use 
> software to parse Print streams rather than sending them directly to a 
> printer. Those processes will often need to be able to process the 
> fonts, which they can only do if the full font is embedded rather than a 
> subset. As you already noted above, extracting a subset if useless.
> 
> >> What’s the purpose of the ‘encoding’ parameter? It looks to me like
> >> users don’t care about what encoding is used in the PDF or PostScript
> >> file. All they want to have is properly printed documents that use their
> >> own fonts. I think that parameter should be removed in favour of Mehdi’s
> >> proposal, which IMO makes much more sense from a user perspective.
> > I don't know if it's necessary. That's why I wrote that maybe additional
> > research may be necessary. If we don't have it, we may have to build up
> > a /CIDMap that covers Unicode because there is otherwise no information
> > in the font which character indices correspond to which glyph as long as
> > we use /Registry (Adobe) /Ordering (Identity). Or: you configure a CID
> > map (encoding) that is tailored to the kind of document you want to
> > produce. The Unicode /CIDMap could result in rather big /CIDMap arrays
> > (65535 * 4 = 256KB) with lots of pointers to ".notdef".
> 
>  From a user's perspective, the encoding parameter is too technical and 
> most user's ill not understand its purpose. If possible I would like to 
> reach a consenus on what we should do and then remove the parameter to 
> help cut down the complexity of configuring fonts. As you noted there 
> are now a bewildering number of options.
> 
> > Before continuing with this there should be a broad understanding how
> > non-subset TrueType fonts shall be handled in PostScript (and PDF where
> > you can make the same case). Otherwise, a change like Mehdi proposed
> > doesn't improve anything.
> 
> Are you asking what the business use case is for fully embedded fonts as 
> opposed to subset fonts. The ability to post process is the most 
> important use case. If the fonts are subset it become difficult to merge 
> Postscript files together or extract the font. Both are fairly common at 
> Print bureaus.
> >> Granted, there would be some redundancy with the referenced-fonts
> >> element. But is the additional flexibility of regexp really useful in
> >> the first place? I’m not too sure. Maybe that could be removed too.
> > I don't want that removed. I've been grateful for its existence more
> > than once. With the regexp I can make sure that, for example, all
> > variants of the "Frutiger" font are not embedded: Frutiger 45 Light,
> > Frutiger 55 Roman etc. etc.
> 
> I concur the regexp stuff in the font referencing is useful. We can use 
> it to change the way whole font families are referenced without having 
> to list every font.
> > Anyway, I don't like constantly changing the way fonts are configured.
> > There's enough confusion with the way it's currently done already. I
> > won't veto a change like that but I'm not happy with it.
> 
> I understand what you are saying there are a lot of options, but then 
> the requirements around fonts are complex so there is no escaping a 
> comlex configuration file.
> 
> Thanks,
> 
> Chris
> 
> >> Vincent
> >>
> >>
> >> On 09/11/10 12:45, Jeremias Maerki wrote:
> >>> Hi Mehdi,
> >>> I'm against that since we already have mechanisms to control some of
> >>> these traits and this would overlap with them. For example, we have the
> >>> referenced-fonts element
> >>> (http://xmlgraphics.apache.org/fop/trunk/fonts.html#embedding)
> >>> which controls whether we embed or not. And we have the encoding-mode
> >>> attribute on the font element to control if single-byte or cid mode
> >>> should be used. Granted, that's not exactly what you're after, but I
> >>> believe this already covers 95% of the use cases if not more.
> >>>
> >>> The only thing you can't currently do is embed a full font in CID mode
> >>> (or reference it). The problem here is the character map that should be
> >>> used when in CID mode. I think that would require some research first so
> >>> we know how best to handle this. For example, referencing only makes
> >>> sense if a TrueType font can be installed directly on the printer. But
> >>> then, the question is in which mode the characters can be addressed.
> >>> Single-byte (like we currently fall back to) is probably not a problem
> >>> unless you need to print Asian documents. Please note that we also don't
> >>> support full TTF embedding/referencing in CID mode in PDF documents. So
> >>> I'm not sure if we really need that at the moment.
> >>>
> >>> If we do, I believe it would generally suffice to extend encoding-mode
> >>> from (auto|single-byte|cid) to (auto|single-byte|cid|cid-full). We may
> >>> need a "cmap" parameter then to change the default CMap (currently
> >>> "Identity-H" like in PDF) since our subsetting code uses custom mappings,
> >>> not Unicode or any other encoding scheme (like "90ms-RKSJ-H").
> >>>
> >>> On 09.11.2010 12:08:36 mehdi houshmand wrote:
> >>>> Hi,
> >>>>
> >>>> I'm working on making TTF subset embedding configurable such that a
> >>>> user can opt for either full font embedding, subset embedding or just
> >>>> referencing, this would be extending the work Jeremias submitted. I
> >>>> was considering adding a parameter to the font configuration file
> >>>> called "embedding" with 3 possible values "none", "subset" and "full".
> >>>> This would allow the user to configure the embedding mode on a font by
> >>>> font basis. What do people think about this proposal?
> >>>>
> >>>> Thanks
> >>>>
> >>>> Mehdi
> >>>
> >>>
> >>>
> >>> Jeremias Maerki
> >>>
> >
> >
> >
> > Jeremias Maerki
> >
> >
> >




Jeremias Maerki


Re: TrueType Font Embedding

Posted by mehdi houshmand <me...@gmail.com>.
Hi Jeremias,

Without trying to flog a dead horse here, would you mind if I changed
your TTFSubSetFile.GlpyhHandler interface by putting in the
implementation from PSFontUtils (i.e. making it an inner class rather
than an interface).  My reason for this change is that I'm creating
unit tests, so that I can implement the changes we discussed above
confidently. I do appreciate that this would be slightly
counter-intuitive but I do think my reason is a valid one since unit
tests (and regression testing) is one area that the fonts package is
in dire need of. Anyway, is this something you have any particular
objections to?

By the way, I'd like to think that generating "proper" CMaps would be
the "correct" solution, if there is a correct solution. I'm not sure
how difficult that would be, I'm still a little green in terms of
fonts and how PS/PDF interpret them.

Mehdi

On 19 November 2010 08:00, Jeremias Maerki <de...@jeremias-maerki.ch> wrote:
> On 18.11.2010 16:05:25 mehdi houshmand wrote:
>> All that being said, I could implement my initial proposal, obviously
>> it would have to be user friendly and not conflict with the settings
>> already available, so maybe a parameter called "embedding" with two
>> possible values, "full" and "subset" (since the "none" is already
>> covered by referenced fonts).
>
> +1
> embedding="auto|full|subset"
> (default: auto, where auto=type1:full/ttf:subset like we have for PDF).
>
> Also, we have to keep in mind that with the current code, embedding the
> full font (in /snfts) will likely result in the same problem as before I
> switched to the GlyphDirectory approach for subset TTF. Might make sense
> to switch that, too.
>
>> As for the unique prefix for the font name, may I suggest moving it
>> from the font level (o.a.f.fonts.MultiByteFont) to PS level (maybe
>> implemented in somewhere like o.a.f.render.ps.PSFontUtils), this would
>> allow a more intelligent implementation since PS and PDF don't have
>> the same requirements in this case since PDF prefixes only need to be
>> unique to the document.
>
> +1
>
> And to answer Chris' question:
>
> On 17.11.2010 09:42:19 Chris Bowditch wrote:
>> So now where does
>> that leave us in terms of the configuration and/or implementation details?
>
> - I think we agree that the regex mechanism for referencing is useful.
>
> - FOP should automatically switch to single-byte encoding if referencing
> is activated for a font (like we already do for PDF output). That
> increases the probability that a pre-installed font is going to work
> (because it probably has a /CharStrings dict).
>
> - We can switch between single-byte and cid with the encoding attribute
> which is useful, but mostly an advanced option for people who know what
> they are doing.
>
> - For PDF we have full embedding for Type 1 and CID subset embedding for
> TTF by default. I think that's a good default especially since we don't
> support Type1 subsetting. That behaviour should be applied to PostScript,
> too, IMO.
>
> - Now, for some advanced use cases (PS post-processing), we need full
> TTF embedding for PS output. Mehdi's embedding="full" will do that trick,
> but the /sfnts boundary problem needs to be sorted out (possibly by also
> switching to /GlyphDirectory there).
>
> - Obviously, embedding="subset" with a Type 1 font currently needs to
> result in a "NYI" exception.
>
>
> Just a thought: we could think about an encoding="unicode" option which
> would use 16-bit Unicode values as character codes (instead of the
> direct glyph addressing by index with Identity-H). That would mean
> generating appropriate CMaps. I'm not sure if it would solve any problem
> other than make debugging/reading PS files easier. Of course, it would
> not allow Unicode characters above 0xFFFF. As I said: just a thought.
>
> Jeremias Maerki
>
>

Re: TrueType Font Embedding

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
On 18.11.2010 16:05:25 mehdi houshmand wrote:
> All that being said, I could implement my initial proposal, obviously
> it would have to be user friendly and not conflict with the settings
> already available, so maybe a parameter called "embedding" with two
> possible values, "full" and "subset" (since the "none" is already
> covered by referenced fonts).

+1
embedding="auto|full|subset"
(default: auto, where auto=type1:full/ttf:subset like we have for PDF).

Also, we have to keep in mind that with the current code, embedding the
full font (in /snfts) will likely result in the same problem as before I
switched to the GlyphDirectory approach for subset TTF. Might make sense
to switch that, too.

> As for the unique prefix for the font name, may I suggest moving it
> from the font level (o.a.f.fonts.MultiByteFont) to PS level (maybe
> implemented in somewhere like o.a.f.render.ps.PSFontUtils), this would
> allow a more intelligent implementation since PS and PDF don't have
> the same requirements in this case since PDF prefixes only need to be
> unique to the document.

+1

And to answer Chris' question:

On 17.11.2010 09:42:19 Chris Bowditch wrote:
> So now where does 
> that leave us in terms of the configuration and/or implementation details?

- I think we agree that the regex mechanism for referencing is useful.

- FOP should automatically switch to single-byte encoding if referencing
is activated for a font (like we already do for PDF output). That
increases the probability that a pre-installed font is going to work
(because it probably has a /CharStrings dict).

- We can switch between single-byte and cid with the encoding attribute
which is useful, but mostly an advanced option for people who know what
they are doing.

- For PDF we have full embedding for Type 1 and CID subset embedding for
TTF by default. I think that's a good default especially since we don't
support Type1 subsetting. That behaviour should be applied to PostScript,
too, IMO.

- Now, for some advanced use cases (PS post-processing), we need full
TTF embedding for PS output. Mehdi's embedding="full" will do that trick,
but the /sfnts boundary problem needs to be sorted out (possibly by also
switching to /GlyphDirectory there).

- Obviously, embedding="subset" with a Type 1 font currently needs to
result in a "NYI" exception.


Just a thought: we could think about an encoding="unicode" option which
would use 16-bit Unicode values as character codes (instead of the
direct glyph addressing by index with Identity-H). That would mean
generating appropriate CMaps. I'm not sure if it would solve any problem
other than make debugging/reading PS files easier. Of course, it would
not allow Unicode characters above 0xFFFF. As I said: just a thought.

Jeremias Maerki


Re: TrueType Font Embedding

Posted by mehdi houshmand <me...@gmail.com>.
All that being said, I could implement my initial proposal, obviously
it would have to be user friendly and not conflict with the settings
already available, so maybe a parameter called "embedding" with two
possible values, "full" and "subset" (since the "none" is already
covered by referenced fonts).

As for the unique prefix for the font name, may I suggest moving it
from the font level (o.a.f.fonts.MultiByteFont) to PS level (maybe
implemented in somewhere like o.a.f.render.ps.PSFontUtils), this would
allow a more intelligent implementation since PS and PDF don't have
the same requirements in this case since PDF prefixes only need to be
unique to the document.

</snip>

Re: TrueType Font Embedding

Posted by Chris Bowditch <bo...@hotmail.com>.
On 17/11/2010 07:55, Jeremias Maerki wrote:

Hi Jeremias,
> So my take:
>
> 1. if you want to print many smaller PS files, use font subsetting.
> 2. if you are dealing with larger print streams and do PS
> post-processing, don't use font subsetting.

+1. This is also my view of the business requirements. So now where does 
that leave us in terms of the configuration and/or implementation details?

> ...which makes the uniqueness discussion less of an issue.
>
>
> Jeremias Maerki
>
>
Thanks,

Chris


Re: TrueType Font Embedding

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
On 16.11.2010 16:12:53 Vincent Hennebert wrote:
> On 09/11/10 14:43, Jeremias Maerki wrote:
> > On 09.11.2010 14:48:30 Vincent Hennebert wrote:
> >> There may be an interest in fully embedding a font for PostScript
> >> output. IIUC there may be a print manager that pre-processes PostScript
> >> files, extracts embedded fonts to store them somewhere and re-use them
> >> whenever needed. It can then strip the font off subsequent files and
> >> substantially lighten them, speeding up the printing process.
> > 
> > It makes the files smaller, but that will be the only thing that
> > improved printing performance. The PS interpreter still has to parse and
> > process the actual resource. It also needs to be noted that extracting
> > subset fonts doesn't make sense. I've already added the unique-ification
> > prefix to the TTF font names (like in PDF) to avoid problems like that.
> 
> That doesn’t solve the problem. AFAIU always the same prefix will be
> used for the same font. If different FO files are being rendered into
> PostScript using the same config file, then the PS outputs will all
> contain the same font name. That may cause serious issues if
> a post-processing tool concatenates all the files but uses the font
> instance from the first file only. I think that violates the DSC
> specification.

Right, I've blindly taken over the approach that was done for PDF where
this problem does not apply because you can have two subset fonts with
the same name. I assumed the prefix was a pseudo-random number but I
failed to verify that.

> The only way to ensure uniqueness is to compute some hash sum based on
> the glyphs used. That may be costly. The alternative is to put the font
> stream directly in the page stream and not advertise it as a DSC
> resource.

Yes, the hash is too costly. But a good hash would also have to be
longer which also increases the PS file size due to the potentially
numerous findfont calls.

Putting the font in the page stream means you have to subset the font
for every page because the DSC spec is also broken if the font is
incrementally built inside the various pages. Restarting the font for
every page makes the PS files very big.

I think the only thing you can do is to get a near-uniqueness by using
pseudo-random values for the prefix. That dials the risk for a colision
down (but doesn't completely rule it out). Anyway, applying lots of DSC
postprocessing to various PS files (split, concat, imposition etc.)
screams for use of non-subset fonts. Even then it makes sense trying not
to mix PS files from too many sources (FOP or otherwise) for one print
stream. It only produces potential problems.

So my take:

1. if you want to print many smaller PS files, use font subsetting.
2. if you are dealing with larger print streams and do PS
post-processing, don't use font subsetting.

...which makes the uniqueness discussion less of an issue.


Jeremias Maerki


Re: TrueType Font Embedding

Posted by Vincent Hennebert <vh...@gmail.com>.
On 09/11/10 14:43, Jeremias Maerki wrote:
> On 09.11.2010 14:48:30 Vincent Hennebert wrote:
>> There may be an interest in fully embedding a font for PostScript
>> output. IIUC there may be a print manager that pre-processes PostScript
>> files, extracts embedded fonts to store them somewhere and re-use them
>> whenever needed. It can then strip the font off subsequent files and
>> substantially lighten them, speeding up the printing process.
> 
> It makes the files smaller, but that will be the only thing that
> improved printing performance. The PS interpreter still has to parse and
> process the actual resource. It also needs to be noted that extracting
> subset fonts doesn't make sense. I've already added the unique-ification
> prefix to the TTF font names (like in PDF) to avoid problems like that.

That doesn’t solve the problem. AFAIU always the same prefix will be
used for the same font. If different FO files are being rendered into
PostScript using the same config file, then the PS outputs will all
contain the same font name. That may cause serious issues if
a post-processing tool concatenates all the files but uses the font
instance from the first file only. I think that violates the DSC
specification.


The only way to ensure uniqueness is to compute some hash sum based on
the glyphs used. That may be costly. The alternative is to put the font
stream directly in the page stream and not advertise it as a DSC
resource.

<snip/>

Vincent

Re: TrueType Font Embedding

Posted by Glenn Adams <gl...@skynav.com>.
I haven't been following this thread very closely, but subset fonts are and
will be important for complex script support, since some Arabic and related
fonts have thousands of glyphs, of which only a small subset is typically
used on a given page or in a document.

One further issue with complex script support is that not all glyphs in
certain fonts have an encoding entry, e.g., in an Arabic TTF font, there may
be glyph's whose indices does not have a forward mapping in the CMAP. In
some systems, this is addressed by using a lower level OS interface when
drawing, one that uses glyph indices directly (rather than character codes
to be mapped through an encoding vector); however, in the FOP IF interfaces,
only character codes are used. As a consequence, the complex script support
gets around this by dynamically creating new entries in the CMAP (or
equivalent encoding vector), using Unicode private use characters, to map to
glyph indices that are emitted by complex script processing, but which do
not normally have a CMAP entry.

In this scenario, it is essential to create an embedded font that includes
the referenced glyphs as well as the dynamically generated CMAP entries.
This is presently handled in the PDF renderer with no additional effort,
however, I have not yet addressed this in the PS or other renderers, so I
may very well have to modify the current mechanism if it is not practical to
use this approach with other renderers.

Regards,
Glenn

On Thu, Nov 11, 2010 at 9:50 AM, Chris Bowditch
<bo...@hotmail.com>wrote:

> Hi All,
>
>
> On 09/11/2010 14:43, Jeremias Maerki wrote:
>
>> On 09.11.2010 14:48:30 Vincent Hennebert wrote:
>>
>>> There may be an interest in fully embedding a font for PostScript
>>> output. IIUC there may be a print manager that pre-processes PostScript
>>> files, extracts embedded fonts to store them somewhere and re-use them
>>> whenever needed. It can then strip the font off subsequent files and
>>> substantially lighten them, speeding up the printing process.
>>>
>> It makes the files smaller, but that will be the only thing that
>> improved printing performance. The PS interpreter still has to parse and
>> process the actual resource. It also needs to be noted that extracting
>> subset fonts doesn't make sense. I've already added the unique-ification
>> prefix to the TTF font names (like in PDF) to avoid problems like that.
>>
>
> Yes I agree extracting subset fonts doesn't make sense, but extracting a
> fully embedded font does have plenty of business applications. Which is
> precisely why the introduction of a setting is required here. In some cases
> it is important to bring the file size down; enter the subsetting feature.
> Subsetting is particularly useful when creating print stream with a
> relatively small number of pages, i.e. 100 or less and you have large
> Unicode fonts to support Eastern character sets.
>
>  In other situations people using FOP want to be able to create large Print
> streams to send to Print Bureaus. Print Bureaus tends to use software to
> parse Print streams rather than sending them directly to a printer. Those
> processes will often need to be able to process the fonts, which they can
> only do if the full font is embedded rather than a subset. As you already
> noted above, extracting a subset if useless.
>
>
>  What’s the purpose of the ‘encoding’ parameter? It looks to me like
>>> users don’t care about what encoding is used in the PDF or PostScript
>>> file. All they want to have is properly printed documents that use their
>>> own fonts. I think that parameter should be removed in favour of Mehdi’s
>>> proposal, which IMO makes much more sense from a user perspective.
>>>
>> I don't know if it's necessary. That's why I wrote that maybe additional
>> research may be necessary. If we don't have it, we may have to build up
>> a /CIDMap that covers Unicode because there is otherwise no information
>> in the font which character indices correspond to which glyph as long as
>> we use /Registry (Adobe) /Ordering (Identity). Or: you configure a CID
>> map (encoding) that is tailored to the kind of document you want to
>> produce. The Unicode /CIDMap could result in rather big /CIDMap arrays
>> (65535 * 4 = 256KB) with lots of pointers to ".notdef".
>>
>
> From a user's perspective, the encoding parameter is too technical and most
> user's ill not understand its purpose. If possible I would like to reach a
> consenus on what we should do and then remove the parameter to help cut down
> the complexity of configuring fonts. As you noted there are now a
> bewildering number of options.
>
>
>  Before continuing with this there should be a broad understanding how
>> non-subset TrueType fonts shall be handled in PostScript (and PDF where
>> you can make the same case). Otherwise, a change like Mehdi proposed
>> doesn't improve anything.
>>
>
> Are you asking what the business use case is for fully embedded fonts as
> opposed to subset fonts. The ability to post process is the most important
> use case. If the fonts are subset it become difficult to merge Postscript
> files together or extract the font. Both are fairly common at Print bureaus.
>
>  Granted, there would be some redundancy with the referenced-fonts
>>> element. But is the additional flexibility of regexp really useful in
>>> the first place? I’m not too sure. Maybe that could be removed too.
>>>
>> I don't want that removed. I've been grateful for its existence more
>> than once. With the regexp I can make sure that, for example, all
>> variants of the "Frutiger" font are not embedded: Frutiger 45 Light,
>> Frutiger 55 Roman etc. etc.
>>
>
> I concur the regexp stuff in the font referencing is useful. We can use it
> to change the way whole font families are referenced without having to list
> every font.
>
>  Anyway, I don't like constantly changing the way fonts are configured.
>> There's enough confusion with the way it's currently done already. I
>> won't veto a change like that but I'm not happy with it.
>>
>
> I understand what you are saying there are a lot of options, but then the
> requirements around fonts are complex so there is no escaping a comlex
> configuration file.
>
> Thanks,
>
> Chris
>
>
>  Vincent
>>>
>>>
>>> On 09/11/10 12:45, Jeremias Maerki wrote:
>>>
>>>> Hi Mehdi,
>>>> I'm against that since we already have mechanisms to control some of
>>>> these traits and this would overlap with them. For example, we have the
>>>> referenced-fonts element
>>>> (http://xmlgraphics.apache.org/fop/trunk/fonts.html#embedding)
>>>> which controls whether we embed or not. And we have the encoding-mode
>>>> attribute on the font element to control if single-byte or cid mode
>>>> should be used. Granted, that's not exactly what you're after, but I
>>>> believe this already covers 95% of the use cases if not more.
>>>>
>>>> The only thing you can't currently do is embed a full font in CID mode
>>>> (or reference it). The problem here is the character map that should be
>>>> used when in CID mode. I think that would require some research first so
>>>> we know how best to handle this. For example, referencing only makes
>>>> sense if a TrueType font can be installed directly on the printer. But
>>>> then, the question is in which mode the characters can be addressed.
>>>> Single-byte (like we currently fall back to) is probably not a problem
>>>> unless you need to print Asian documents. Please note that we also don't
>>>> support full TTF embedding/referencing in CID mode in PDF documents. So
>>>> I'm not sure if we really need that at the moment.
>>>>
>>>> If we do, I believe it would generally suffice to extend encoding-mode
>>>> from (auto|single-byte|cid) to (auto|single-byte|cid|cid-full). We may
>>>> need a "cmap" parameter then to change the default CMap (currently
>>>> "Identity-H" like in PDF) since our subsetting code uses custom
>>>> mappings,
>>>> not Unicode or any other encoding scheme (like "90ms-RKSJ-H").
>>>>
>>>> On 09.11.2010 12:08:36 mehdi houshmand wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm working on making TTF subset embedding configurable such that a
>>>>> user can opt for either full font embedding, subset embedding or just
>>>>> referencing, this would be extending the work Jeremias submitted. I
>>>>> was considering adding a parameter to the font configuration file
>>>>> called "embedding" with 3 possible values "none", "subset" and "full".
>>>>> This would allow the user to configure the embedding mode on a font by
>>>>> font basis. What do people think about this proposal?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Mehdi
>>>>>
>>>>
>>>>
>>>>
>>>> Jeremias Maerki
>>>>
>>>>
>>
>>
>> Jeremias Maerki
>>
>>
>>
>>
>

Re: TrueType Font Embedding

Posted by Chris Bowditch <bo...@hotmail.com>.
Hi All,

On 09/11/2010 14:43, Jeremias Maerki wrote:
> On 09.11.2010 14:48:30 Vincent Hennebert wrote:
>> There may be an interest in fully embedding a font for PostScript
>> output. IIUC there may be a print manager that pre-processes PostScript
>> files, extracts embedded fonts to store them somewhere and re-use them
>> whenever needed. It can then strip the font off subsequent files and
>> substantially lighten them, speeding up the printing process.
> It makes the files smaller, but that will be the only thing that
> improved printing performance. The PS interpreter still has to parse and
> process the actual resource. It also needs to be noted that extracting
> subset fonts doesn't make sense. I've already added the unique-ification
> prefix to the TTF font names (like in PDF) to avoid problems like that.

Yes I agree extracting subset fonts doesn't make sense, but extracting a 
fully embedded font does have plenty of business applications. Which is 
precisely why the introduction of a setting is required here. In some 
cases it is important to bring the file size down; enter the subsetting 
feature. Subsetting is particularly useful when creating print stream 
with a relatively small number of pages, i.e. 100 or less and you have 
large Unicode fonts to support Eastern character sets.

  In other situations people using FOP want to be able to create large 
Print streams to send to Print Bureaus. Print Bureaus tends to use 
software to parse Print streams rather than sending them directly to a 
printer. Those processes will often need to be able to process the 
fonts, which they can only do if the full font is embedded rather than a 
subset. As you already noted above, extracting a subset if useless.

>> What’s the purpose of the ‘encoding’ parameter? It looks to me like
>> users don’t care about what encoding is used in the PDF or PostScript
>> file. All they want to have is properly printed documents that use their
>> own fonts. I think that parameter should be removed in favour of Mehdi’s
>> proposal, which IMO makes much more sense from a user perspective.
> I don't know if it's necessary. That's why I wrote that maybe additional
> research may be necessary. If we don't have it, we may have to build up
> a /CIDMap that covers Unicode because there is otherwise no information
> in the font which character indices correspond to which glyph as long as
> we use /Registry (Adobe) /Ordering (Identity). Or: you configure a CID
> map (encoding) that is tailored to the kind of document you want to
> produce. The Unicode /CIDMap could result in rather big /CIDMap arrays
> (65535 * 4 = 256KB) with lots of pointers to ".notdef".

 From a user's perspective, the encoding parameter is too technical and 
most user's ill not understand its purpose. If possible I would like to 
reach a consenus on what we should do and then remove the parameter to 
help cut down the complexity of configuring fonts. As you noted there 
are now a bewildering number of options.

> Before continuing with this there should be a broad understanding how
> non-subset TrueType fonts shall be handled in PostScript (and PDF where
> you can make the same case). Otherwise, a change like Mehdi proposed
> doesn't improve anything.

Are you asking what the business use case is for fully embedded fonts as 
opposed to subset fonts. The ability to post process is the most 
important use case. If the fonts are subset it become difficult to merge 
Postscript files together or extract the font. Both are fairly common at 
Print bureaus.
>> Granted, there would be some redundancy with the referenced-fonts
>> element. But is the additional flexibility of regexp really useful in
>> the first place? I’m not too sure. Maybe that could be removed too.
> I don't want that removed. I've been grateful for its existence more
> than once. With the regexp I can make sure that, for example, all
> variants of the "Frutiger" font are not embedded: Frutiger 45 Light,
> Frutiger 55 Roman etc. etc.

I concur the regexp stuff in the font referencing is useful. We can use 
it to change the way whole font families are referenced without having 
to list every font.
> Anyway, I don't like constantly changing the way fonts are configured.
> There's enough confusion with the way it's currently done already. I
> won't veto a change like that but I'm not happy with it.

I understand what you are saying there are a lot of options, but then 
the requirements around fonts are complex so there is no escaping a 
comlex configuration file.

Thanks,

Chris

>> Vincent
>>
>>
>> On 09/11/10 12:45, Jeremias Maerki wrote:
>>> Hi Mehdi,
>>> I'm against that since we already have mechanisms to control some of
>>> these traits and this would overlap with them. For example, we have the
>>> referenced-fonts element
>>> (http://xmlgraphics.apache.org/fop/trunk/fonts.html#embedding)
>>> which controls whether we embed or not. And we have the encoding-mode
>>> attribute on the font element to control if single-byte or cid mode
>>> should be used. Granted, that's not exactly what you're after, but I
>>> believe this already covers 95% of the use cases if not more.
>>>
>>> The only thing you can't currently do is embed a full font in CID mode
>>> (or reference it). The problem here is the character map that should be
>>> used when in CID mode. I think that would require some research first so
>>> we know how best to handle this. For example, referencing only makes
>>> sense if a TrueType font can be installed directly on the printer. But
>>> then, the question is in which mode the characters can be addressed.
>>> Single-byte (like we currently fall back to) is probably not a problem
>>> unless you need to print Asian documents. Please note that we also don't
>>> support full TTF embedding/referencing in CID mode in PDF documents. So
>>> I'm not sure if we really need that at the moment.
>>>
>>> If we do, I believe it would generally suffice to extend encoding-mode
>>> from (auto|single-byte|cid) to (auto|single-byte|cid|cid-full). We may
>>> need a "cmap" parameter then to change the default CMap (currently
>>> "Identity-H" like in PDF) since our subsetting code uses custom mappings,
>>> not Unicode or any other encoding scheme (like "90ms-RKSJ-H").
>>>
>>> On 09.11.2010 12:08:36 mehdi houshmand wrote:
>>>> Hi,
>>>>
>>>> I'm working on making TTF subset embedding configurable such that a
>>>> user can opt for either full font embedding, subset embedding or just
>>>> referencing, this would be extending the work Jeremias submitted. I
>>>> was considering adding a parameter to the font configuration file
>>>> called "embedding" with 3 possible values "none", "subset" and "full".
>>>> This would allow the user to configure the embedding mode on a font by
>>>> font basis. What do people think about this proposal?
>>>>
>>>> Thanks
>>>>
>>>> Mehdi
>>>
>>>
>>>
>>> Jeremias Maerki
>>>
>
>
>
> Jeremias Maerki
>
>
>


Re: TrueType Font Embedding

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
On 09.11.2010 14:48:30 Vincent Hennebert wrote:
> There may be an interest in fully embedding a font for PostScript
> output. IIUC there may be a print manager that pre-processes PostScript
> files, extracts embedded fonts to store them somewhere and re-use them
> whenever needed. It can then strip the font off subsequent files and
> substantially lighten them, speeding up the printing process.

It makes the files smaller, but that will be the only thing that
improved printing performance. The PS interpreter still has to parse and
process the actual resource. It also needs to be noted that extracting
subset fonts doesn't make sense. I've already added the unique-ification
prefix to the TTF font names (like in PDF) to avoid problems like that.

> What’s the purpose of the ‘encoding’ parameter? It looks to me like
> users don’t care about what encoding is used in the PDF or PostScript
> file. All they want to have is properly printed documents that use their
> own fonts. I think that parameter should be removed in favour of Mehdi’s
> proposal, which IMO makes much more sense from a user perspective.

I don't know if it's necessary. That's why I wrote that maybe additional
research may be necessary. If we don't have it, we may have to build up
a /CIDMap that covers Unicode because there is otherwise no information
in the font which character indices correspond to which glyph as long as
we use /Registry (Adobe) /Ordering (Identity). Or: you configure a CID
map (encoding) that is tailored to the kind of document you want to
produce. The Unicode /CIDMap could result in rather big /CIDMap arrays
(65535 * 4 = 256KB) with lots of pointers to ".notdef".

Before continuing with this there should be a broad understanding how
non-subset TrueType fonts shall be handled in PostScript (and PDF where
you can make the same case). Otherwise, a change like Mehdi proposed
doesn't improve anything.

> Granted, there would be some redundancy with the referenced-fonts
> element. But is the additional flexibility of regexp really useful in
> the first place? I’m not too sure. Maybe that could be removed too.

I don't want that removed. I've been grateful for its existence more
than once. With the regexp I can make sure that, for example, all
variants of the "Frutiger" font are not embedded: Frutiger 45 Light,
Frutiger 55 Roman etc. etc.

Anyway, I don't like constantly changing the way fonts are configured.
There's enough confusion with the way it's currently done already. I
won't veto a change like that but I'm not happy with it.

> Vincent
> 
> 
> On 09/11/10 12:45, Jeremias Maerki wrote:
> > Hi Mehdi,
> > I'm against that since we already have mechanisms to control some of
> > these traits and this would overlap with them. For example, we have the
> > referenced-fonts element
> > (http://xmlgraphics.apache.org/fop/trunk/fonts.html#embedding)
> > which controls whether we embed or not. And we have the encoding-mode
> > attribute on the font element to control if single-byte or cid mode
> > should be used. Granted, that's not exactly what you're after, but I
> > believe this already covers 95% of the use cases if not more.
> > 
> > The only thing you can't currently do is embed a full font in CID mode
> > (or reference it). The problem here is the character map that should be
> > used when in CID mode. I think that would require some research first so
> > we know how best to handle this. For example, referencing only makes
> > sense if a TrueType font can be installed directly on the printer. But
> > then, the question is in which mode the characters can be addressed.
> > Single-byte (like we currently fall back to) is probably not a problem
> > unless you need to print Asian documents. Please note that we also don't
> > support full TTF embedding/referencing in CID mode in PDF documents. So
> > I'm not sure if we really need that at the moment.
> > 
> > If we do, I believe it would generally suffice to extend encoding-mode
> > from (auto|single-byte|cid) to (auto|single-byte|cid|cid-full). We may
> > need a "cmap" parameter then to change the default CMap (currently
> > "Identity-H" like in PDF) since our subsetting code uses custom mappings,
> > not Unicode or any other encoding scheme (like "90ms-RKSJ-H").
> > 
> > On 09.11.2010 12:08:36 mehdi houshmand wrote:
> >> Hi,
> >>
> >> I'm working on making TTF subset embedding configurable such that a
> >> user can opt for either full font embedding, subset embedding or just
> >> referencing, this would be extending the work Jeremias submitted. I
> >> was considering adding a parameter to the font configuration file
> >> called "embedding" with 3 possible values "none", "subset" and "full".
> >> This would allow the user to configure the embedding mode on a font by
> >> font basis. What do people think about this proposal?
> >>
> >> Thanks
> >>
> >> Mehdi
> > 
> > 
> > 
> > 
> > Jeremias Maerki
> > 




Jeremias Maerki


Re: TrueType Font Embedding

Posted by Vincent Hennebert <vh...@gmail.com>.
There may be an interest in fully embedding a font for PostScript
output. IIUC there may be a print manager that pre-processes PostScript
files, extracts embedded fonts to store them somewhere and re-use them
whenever needed. It can then strip the font off subsequent files and
substantially lighten them, speeding up the printing process.

What’s the purpose of the ‘encoding’ parameter? It looks to me like
users don’t care about what encoding is used in the PDF or PostScript
file. All they want to have is properly printed documents that use their
own fonts. I think that parameter should be removed in favour of Mehdi’s
proposal, which IMO makes much more sense from a user perspective.

Granted, there would be some redundancy with the referenced-fonts
element. But is the additional flexibility of regexp really useful in
the first place? I’m not too sure. Maybe that could be removed too.

Vincent


On 09/11/10 12:45, Jeremias Maerki wrote:
> Hi Mehdi,
> I'm against that since we already have mechanisms to control some of
> these traits and this would overlap with them. For example, we have the
> referenced-fonts element
> (http://xmlgraphics.apache.org/fop/trunk/fonts.html#embedding)
> which controls whether we embed or not. And we have the encoding-mode
> attribute on the font element to control if single-byte or cid mode
> should be used. Granted, that's not exactly what you're after, but I
> believe this already covers 95% of the use cases if not more.
> 
> The only thing you can't currently do is embed a full font in CID mode
> (or reference it). The problem here is the character map that should be
> used when in CID mode. I think that would require some research first so
> we know how best to handle this. For example, referencing only makes
> sense if a TrueType font can be installed directly on the printer. But
> then, the question is in which mode the characters can be addressed.
> Single-byte (like we currently fall back to) is probably not a problem
> unless you need to print Asian documents. Please note that we also don't
> support full TTF embedding/referencing in CID mode in PDF documents. So
> I'm not sure if we really need that at the moment.
> 
> If we do, I believe it would generally suffice to extend encoding-mode
> from (auto|single-byte|cid) to (auto|single-byte|cid|cid-full). We may
> need a "cmap" parameter then to change the default CMap (currently
> "Identity-H" like in PDF) since our subsetting code uses custom mappings,
> not Unicode or any other encoding scheme (like "90ms-RKSJ-H").
> 
> On 09.11.2010 12:08:36 mehdi houshmand wrote:
>> Hi,
>>
>> I'm working on making TTF subset embedding configurable such that a
>> user can opt for either full font embedding, subset embedding or just
>> referencing, this would be extending the work Jeremias submitted. I
>> was considering adding a parameter to the font configuration file
>> called "embedding" with 3 possible values "none", "subset" and "full".
>> This would allow the user to configure the embedding mode on a font by
>> font basis. What do people think about this proposal?
>>
>> Thanks
>>
>> Mehdi
> 
> 
> 
> 
> Jeremias Maerki
> 

Re: TrueType Font Embedding

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Hi Mehdi,
I'm against that since we already have mechanisms to control some of
these traits and this would overlap with them. For example, we have the
referenced-fonts element
(http://xmlgraphics.apache.org/fop/trunk/fonts.html#embedding)
which controls whether we embed or not. And we have the encoding-mode
attribute on the font element to control if single-byte or cid mode
should be used. Granted, that's not exactly what you're after, but I
believe this already covers 95% of the use cases if not more.

The only thing you can't currently do is embed a full font in CID mode
(or reference it). The problem here is the character map that should be
used when in CID mode. I think that would require some research first so
we know how best to handle this. For example, referencing only makes
sense if a TrueType font can be installed directly on the printer. But
then, the question is in which mode the characters can be addressed.
Single-byte (like we currently fall back to) is probably not a problem
unless you need to print Asian documents. Please note that we also don't
support full TTF embedding/referencing in CID mode in PDF documents. So
I'm not sure if we really need that at the moment.

If we do, I believe it would generally suffice to extend encoding-mode
from (auto|single-byte|cid) to (auto|single-byte|cid|cid-full). We may
need a "cmap" parameter then to change the default CMap (currently
"Identity-H" like in PDF) since our subsetting code uses custom mappings,
not Unicode or any other encoding scheme (like "90ms-RKSJ-H").

On 09.11.2010 12:08:36 mehdi houshmand wrote:
> Hi,
> 
> I'm working on making TTF subset embedding configurable such that a
> user can opt for either full font embedding, subset embedding or just
> referencing, this would be extending the work Jeremias submitted. I
> was considering adding a parameter to the font configuration file
> called "embedding" with 3 possible values "none", "subset" and "full".
> This would allow the user to configure the embedding mode on a font by
> font basis. What do people think about this proposal?
> 
> Thanks
> 
> Mehdi




Jeremias Maerki