You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-users@xmlgraphics.apache.org by David Moles <da...@vykor.com> on 2002/10/02 21:54:37 UTC

Unicode characters vs. character references

Hi there. I've got a need to use the em dash (Unicode character x2014)
and the multiplication sign (Unicode character x00d7) in an FO document
I'm generating. When I encode these as escaped character references in
the XML (&#u2014; and &#00d7;), they turn up fine in the PDF -- FOP has
no trouble mapping these to working characters in Helvetica or whatever
font it's using. But when the XML just has the raw Unicode characters,
I get the "#" meaning FOP can't find that character in the font.

So, "If it hurts when you do that, stop doing it!", right? But the
package (JDOM) I'm using to generate the XML doesn't like to generate
character references -- if I try to put a character reference in there,
it escapes the ampersand, and if I put the Java character literal in
there ('\u2014', '\u00d7'), it outputs raw UTF8.

I can work around this by doing some ugly postprocessing, but I'd
rather not.

I guess what I don't understand is why FOP interprets the escaped
character reference differently from the character literal -- where
does the magic happen? I would expect the parser to give FOP the same
Java String, internally, regardless. It's nice that with the escaped
character references FOP can figure out how to display them, but it
seems like a bug that with the character literals it can't.

I'm using FOP 20.3 on Linux (JDK version 1.3.1), and viewing my PDFs 
in Acrobat Reader version 4.0.

Any advice appreciated,

David

Re: Unicode characters vs. character references

Posted by David Moles <da...@vykor.com>.

On Wed, 2002-10-02 at 13:35, J.Pietschmann wrote:
> 
> However, I guess something is wron with your encoding declaration.
> Maybe the file is mangled in a data transfer or something. Load
> the file you feed to FOP in a XML-aware editor, or run an identity
> transformation with the output encoding set to ISO-8859-1 and
> examine the result.

You were right. The problem was that in my Java code I was forgetting
to set the output encoding and so it was outputting in ISO-8859-1; 
the characters outside the character set were simply getting mangled.

Re: Unicode characters vs. character references

Posted by David Moles <da...@vykor.com>.

On Wed, 2002-10-02 at 13:35, J.Pietschmann wrote:
> David Moles wrote:
>
> > I guess what I don't understand is why FOP interprets the escaped
> > character reference differently from the character literal -- where
> > does the magic happen?
>
> It's the parser. FOP itself never sees a character reference, it
> gets proper Unicode character.
> Either try a more recent version of the parser (Xerces), or
> another parser.
> 
> However, I guess something is wron with your encoding declaration.
> Maybe the file is mangled in a data transfer or something. Load
> the file you feed to FOP in a XML-aware editor, or run an identity
> transformation with the output encoding set to ISO-8859-1 and
> examine the result.

Hmm. I'm pretty sure that the encoding is UTF-8 and it's being
declared as UTF-8, but I'll have a look at it.

Re: Unicode characters vs. character references

Posted by "J.Pietschmann" <j3...@yahoo.de>.

David Moles wrote:
> Hi there. I've got a need to use the em dash (Unicode character x2014)
> and the multiplication sign (Unicode character x00d7) in an FO document
> I'm generating. When I encode these as escaped character references in
> the XML (&#u2014; and &#00d7;), they turn up fine in the PDF -- FOP has
> no trouble mapping these to working characters in Helvetica or whatever
> font it's using. But when the XML just has the raw Unicode characters,
> I get the "#" meaning FOP can't find that character in the font.
...
> I guess what I don't understand is why FOP interprets the escaped
> character reference differently from the character literal -- where
> does the magic happen?
It's the parser. FOP itself never sees a character reference, it
gets proper Unicode character.
Either try a more recent version of the parser (Xerces), or
another parser.

However, I guess something is wron with your encoding declaration.
Maybe the file is mangled in a data transfer or something. Load
the file you feed to FOP in a XML-aware editor, or run an identity
transformation with the output encoding set to ISO-8859-1 and
examine the result.

J.Pietschmann