You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-dev@xmlgraphics.apache.org by m....@gmx.de on 2002/11/21 14:33:12 UTC

Background information about using unicode characters

Hi,

I'm using fop 0.20.4 with jdk 1.4 and would like to get some background
information about how fop embedds unicode character from a true type font which
is not one of the 14 PDF Standard Fonts. I used "Arial Unicode Ms.ttf" to
embed some asien characters into an pdf document. Based on the information from
the fop homepage I assume that only the needed character, its encoding and the
glyph is embedded into the pdf file to enable the Acrobat Reader to render
the character correctly.  Is that correct, or are more information embedded
like the whole codepage in which the charater is in? Or which information are
exactly embedded into the pdf file? Can I rely on the assumption that the
pdf file with the embedded font will be displayed correctly independingly on
the
operating system and the available fonts on which ever computer the pdf file
is trying to be viewed?

Is it possible to create a PDF file with FOP which just describes a
character by its unicode representation and the corresponding glyph to display the
character is loaded from one of the available fonts from the computer where the
pdf file is viewed? That would be an interesting an valueable feature
instead of embedding the font file.

Thanks a lot for any information to clarify understanding!

Markus



-- 
+++ GMX - Mail, Messaging & more  http://www.gmx.net +++
NEU: Mit GMX ins Internet. Rund um die Uhr für 1 ct/ Min. surfen!


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org


Re: Background information about using unicode characters

Posted by Oleg Tkachenko <ol...@multiconn.com>.
Tore Engvig wrote:

> This has the drawback that searching or cut'n' paste fra the PDF document
> won't work (e.g. see bug #5335). To fix this, a ToUnicode mapping should be
> created such that the characters in the PDF document are unicode (UTF-16
> encoded) and a mapping between unicode and glyph index is embedded with the
> font. The reason why this wasn't implemented to begin with was that it
> wasn't documented very well in the PDF spec. Experimenting with different
> settings was hard (development on a slow computer implied a very long and
> frustrating round-trip)  and error messages from Acrobat reader wasn't very
> cooperative :)
> Since then Adobe has published a tutorial and howto for creating ToUnicode
> maps:
> http://partners.adobe.com/asn/developer/pdfs/tn/5411.ToUnicode.pdf
> So it should be easier to create now.

Good answer, Tore! I saved link to your post in bug #5335 if you don't mind.

-- 
Oleg Tkachenko
eXperanto team
Multiconn Technologies, Israel


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org


RE: Background information about using unicode characters

Posted by Tore Engvig <to...@tracetracker.com>.

m.schaeffler@gmx.de wrote:

> Hi,
>
> I'm using fop 0.20.4 with jdk 1.4 and would like to get some background
> information about how fop embedds unicode character from a true
> type font which
> is not one of the 14 PDF Standard Fonts. I used "Arial Unicode Ms.ttf" to
> embed some asien characters into an pdf document. Based on the
> information from
> the fop homepage I assume that only the needed character, its
> encoding and the
> glyph is embedded into the pdf file to enable the Acrobat Reader to render
> the character correctly.  Is that correct, or are more
> information embedded
> like the whole codepage in which the charater is in? Or which
> information are
> exactly embedded into the pdf file?

For TrueType fonts, only the glyphs used are embedded. When you embed a
truetype font, a new font containing only the glyphs used are created from
the original font and embedded in the pdf document. The embedded font
contains only the minimum needed to be embedded in a pdf document (i.e no
codepage information). The PDF document will then contain indexes to the
glyphs in the font instead of characters using an encoding.

This has the drawback that searching or cut'n' paste fra the PDF document
won't work (e.g. see bug #5335). To fix this, a ToUnicode mapping should be
created such that the characters in the PDF document are unicode (UTF-16
encoded) and a mapping between unicode and glyph index is embedded with the
font. The reason why this wasn't implemented to begin with was that it
wasn't documented very well in the PDF spec. Experimenting with different
settings was hard (development on a slow computer implied a very long and
frustrating round-trip)  and error messages from Acrobat reader wasn't very
cooperative :)
Since then Adobe has published a tutorial and howto for creating ToUnicode
maps:
http://partners.adobe.com/asn/developer/pdfs/tn/5411.ToUnicode.pdf
So it should be easier to create now.

This behavior can be avoided if the -ansi option is included when generating
metrics with TTFReader. If you do, the whole font will be embedded in the
pdf document and characters will be ansi encoded (so, you loose the
possibility to use non-ansi characters) - or more correctly: WinAnsi encoded
as specified in the PDF spec.

For PostScript fonts the behavior is different. If you embed a postscript
font, the whole font is always embedded.

> Can I rely on the assumption that the
> pdf file with the embedded font will be displayed correctly
> independingly on
> the
> operating system and the available fonts on which ever computer
> the pdf file
> is trying to be viewed?

Yes. To achieve this, the fontname is scrambled. A prefix is inserted to
make sure that the fontname won't match the fontname of an installed font.
Acrobat reader used to prefer installed fonts rather than embedded fonts
which could cause problems if the user had installed different versions of
the embedded fonts (e.g. fonts which don't have some characters used in the
pdf document, or simply looks different). Acrobat Reader 5.0 has the option
to select whether system installed fonts or document embedded fonts should
be used, so this prefix thing can be removed. However, it's not a huge gain
to remove it. The only case where you gain something is if you merge a lot
of pdf documents that embed the same font. Then Acrobat (fullversion only)
can merge the fonts so you don't embed the same font twice in the same
document.

> Is it possible to create a PDF file with FOP which just describes a
> character by its unicode representation and the corresponding
> glyph to display the
> character is loaded from one of the available fonts from the
> computer where the
> pdf file is viewed? That would be an interesting an valueable feature
> instead of embedding the font file.

You don't have to embed the font. I'm not sure how well this is supported,
but I think that if you use any other encoding than ansi, you're screwed
(because of the lack of ToUnicode maps).

You most probably want to embed the fonts anyway. Real life example
(happened this year): I was generating reports for a huge organization. All
the computers in that organization were supposed to be equal - maintenance
was centralized and all computers were automatically upgraded (and users
were not permitted to install anything). Much effort was spent to have all
computers be equal. No matter the effort, there was 2 different versions of
the Arial font installed on different computers. The newest Arial version
had more glyphs, and most important: The glyphs was bigger! That totally
ruined the reports causing page-breaks where there shouldn't be any,
overflow, etc.

Moral is: always embed the fonts :)


Tore

>
> Thanks a lot for any information to clarify understanding!
>
> Markus
>
>
>
> --
> +++ GMX - Mail, Messaging & more  http://www.gmx.net +++
> NEU: Mit GMX ins Internet. Rund um die Uhr für 1 ct/ Min. surfen!
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
> For additional commands, email: fop-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org


RE: Background information about using unicode characters

Posted by Victor Mote <vi...@outfitr.com>.
Markus (Schaeffler?) wrote:

> like the whole codepage in which the charater is in? Or which
> information are
> exactly embedded into the pdf file? Can I rely on the assumption that the
> pdf file with the embedded font will be displayed correctly
> independingly on
> the
> operating system and the available fonts on which ever computer
> the pdf file
> is trying to be viewed?

AFAIK, the entire font is embedded in the PDF. On o/s independence, yes,
that is the whole theory behind PDF. There are occasional differences
between PDF viewers that might be though of as platform-related, but in
general, PDF is intended to be platform- and device-independent, and in
practice seems to work pretty well.

> Is it possible to create a PDF file with FOP which just describes a
> character by its unicode representation and the corresponding
> glyph to display the
> character is loaded from one of the available fonts from the
> computer where the
> pdf file is viewed? That would be an interesting an valueable feature
> instead of embedding the font file.

Yes, simply don't embed the file. The risk is that if the font is not on the
other system or is different, it may not look as intended.

BTW, you will get faster and better responses to questions like this on the
fop-user list, or, better yet, on a PDF list.

Victor Mote


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org