You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@xerces.apache.org by Dmitry Melekhov <dm...@aspec.ru> on 2000/01/28 08:36:22 UTC

xml encodings, java

Hello!

I'm not shure that tis list is write place
for this question. If I do mistake, I'm sorry!

Question is Cocoon related and about how xerces must
works with encodings.

I write my xml documents in koi8 encoding,
but set I encoding or not I always see ???? in browser instead of
8 bit characters.
Taras Shumeyko pointed me that this is formatter problem and
that problem is in org.apache.xml.serialize.BaseMarkupSerializer
in function    protected String escape( String source )

I changed it- remove all reecodings from it and now
I have Cocoon and Xerces works OK.
Here is my variant of function:

  protected String escape( String source )
    {
        StringBuffer    result;
        int             i;
        char            ch;
        String          charRef;

        result = new StringBuffer( source.length() );
        for ( i = 0 ; i < source.length() ; ++i )  {
            ch = source.charAt( i );
            // If the character is not printable, print as character
reference.
            // Non printables are below ASCII space but not tab or line
            // terminator, ASCII delete, or above a certain Unicode
threshold.
//          if ( ( ch < ' ' && ch != '\t' && ch != '\n' && ch != '\r' )
||
//               ch > _lastPrintable || ch == 0xF7 )
//                  result.append( "&#" ).append( Integer.toString( ch )
).append( ';' );
//          else {
                    // If there is a suitable entity reference for this
                    // character, print it. The list of available entity

                    // references is almost but not identical between
                    // XML and HTML.
//                  charRef = getEntityRef( ch );
//                  if ( charRef == null )
                        result.append( ch );
//                  else
//                      result.append( '&' ).append( charRef ).append(
';' );
//          }
        }
        return result.toString();
    }

But this is dirty hack.

I want to understand how must Xerces treat encodings and why
it don't wokrs now.

--
Dmitry Melekhov
http://www.aspec.ru/~dm
2:5050/11.23@fidonet

P.S.
My java platform is blackdown jdk 1.1.7 for Linux x86

Re: xml encodings, java

Posted by Dmitry Melekhov <dm...@aspec.ru>.

----- Original Message -----
From: Mike Pogue <mp...@apache.org>
To: <xe...@xml.apache.org>
Sent: Friday, January 28, 2000 9:23 PM
Subject: Re: xml encodings, java

Hello!

And thank you for your perfect answer!

> I'm not sure how Cocoon works, but let me summarize how encodings work
> in general.
>
> The XML spec says (I'm paraphrasing) that if you use an encoding other
> than UTF-8 or UTF-16, then you must specify the encoding in the initial
> line, like this:
>
> <?xml version="1.0" encoding="koi8"?>
>
> The parser uses the first 4 bytes of the file to determine what the
> encoding of the first line is.  You cannot put any characters before the
> "<?xm".  A byte order mark is required, in the case of UTF-16, which
> specifies BE or LE.
>
> The first 4 bytes are used to guess at the encoding of the first line.
> The first line is read, and if an encoding clause is present, the
> encoding is switched to that one for the rest of the file.  In the Java
> parser, the underlying JDK is called to instantiate the encoder.
> In most cases, the NAME of the encoding is NOT the Java name -- it's the
> MIME or IANA name.  There is a switch on the parser to permit Java
> encoding names, too, but using this switch can result in non-standard
> XML (XML that cannot be read on other parsers).  I do not recommend
> doing this.
>
> Now, if the first line is NOT present, then the parser assumes that the
> file is UTF-8.

As I understand parser reencode xml from input encoding to utf8.
And this is not work for me because my JVM don't have koi8
encoding installed?

[skip]
>
> > Why not to work with xml content like with raw data, only processing
tags?
>
> The XML spec says that arbitrary binary data is NOT allowed in an XML
> document.  It must be data in a recognized encoding, and every character
> is checked to make sure that it is in a legal range.
>

Hmm. I understand that Xerces is more than only parser for Cocoon,
but , imho, there is problem in Xerces and Cocoon interaction.
I always use valid documents for Cocoon and I don't think
that publishing is time for validation...

> On output, you need to make sure that your files follow all these
> rules.  An easy way to check this, is to use the parser itself.  Feed
> your input or output file through one of the parser sample programs, and
> let the parser tell you which characters are wrong.  It will tell you
> which characters are illegal, and what line they're on.
>
> This method makes it a LOT easier to track down encoding-related
> problems.
>

Shure. But, imho, this not right way for publishing engine.

> Another way to eliminate a lot of problems is to use UTF-8 as an output
> encoding.  This is sometimes not possible, but UTF-8 does contain all
> the Cyrillic characters of koi8 (as far as I know).  And, your resulting
> XML will be portable to more environments, because ALL XML parsers are
> required to understand UTF-8.

I use XML only for publishing. And I use Cocoon with Russian Apache,
Apache with patches by Alex Tutubalin, which is very popular in Russia.
Main feature of RA is reencoding documents to client (browser ) encoding.
We have at least 4 encodings for cyrillic characters (5 with unicode) and
I decided to use only koi8 on my Linux servers, because unicode
support is too weak now.
So I need to Cocoon read documents in koi8 encoding ( or any encoding)
and output koi8 (or any other, but how to say to Cocoon this? ).
How can I do my work without patching new versions of Xerces?

Dmitry Melekhov
http://www.aspec.ru/~dm
2:5050/11.23@fidonet

P.S.
Looks like this list is not for such questions, but there is no replies in
Cocoon users list :( Point me, please, to right list.

Re: xml encodings, java

Posted by Mike Pogue <mp...@apache.org>.

I'm not sure how Cocoon works, but let me summarize how encodings work
in general.

The XML spec says (I'm paraphrasing) that if you use an encoding other
than UTF-8 or UTF-16, then you must specify the encoding in the initial
line, like this:

	<?xml version="1.0" encoding="koi8"?>

The parser uses the first 4 bytes of the file to determine what the
encoding of the first line is.  You cannot put any characters before the
"<?xm".  A byte order mark is required, in the case of UTF-16, which
specifies BE or LE.  

The first 4 bytes are used to guess at the encoding of the first line. 
The first line is read, and if an encoding clause is present, the
encoding is switched to that one for the rest of the file.  In the Java
parser, the underlying JDK is called to instantiate the encoder.  
In most cases, the NAME of the encoding is NOT the Java name -- it's the
MIME or IANA name.  There is a switch on the parser to permit Java
encoding names, too, but using this switch can result in non-standard
XML (XML that cannot be read on other parsers).  I do not recommend
doing this.

Now, if the first line is NOT present, then the parser assumes that the
file is UTF-8.
There are ways to "override" this behavior, and basically "lie" to the
parser about the encoding, but unless you know what you're doing, I'd
recommend against doing so -- it's dangerous.

On top of that, not all characters are allowed in XML, bringing us to
your next question:

> Why not to work with xml content like with raw data, only processing tags?

The XML spec says that arbitrary binary data is NOT allowed in an XML
document.  It must be data in a recognized encoding, and every character
is checked to make sure that it is in a legal range.

On output, you need to make sure that your files follow all these
rules.  An easy way to check this, is to use the parser itself.  Feed
your input or output file through one of the parser sample programs, and
let the parser tell you which characters are wrong.  It will tell you
which characters are illegal, and what line they're on.  

This method makes it a LOT easier to track down encoding-related
problems.

Another way to eliminate a lot of problems is to use UTF-8 as an output
encoding.  This is sometimes not possible, but UTF-8 does contain all
the Cyrillic characters of koi8 (as far as I know).  And, your resulting
XML will be portable to more environments, because ALL XML parsers are
required to understand UTF-8.

Hope this helps!
Mike

Dmitry Melekhov wrote:
> 
> ----- Original Message -----
> From: Mike Pogue <mp...@apache.org>
> To: <xe...@xml.apache.org>
> Sent: Friday, January 28, 2000 8:33 PM
> Subject: Re: xml encodings, java
> 
> > The code you have below is a clever workaround, but ultimately, you want
> > to use a JVM that has the encoding support built-in.
> >
> > So, I'd suggest you try to use the IBM 1.1.8 JVM.  It's fairly reliable,
> > scalable, and I think it has the encoding support you are looking for.
> > (Of course, I am biased in this! :-)
> >
> 
> OK. I just tried IBM jdk, it work exactly as blackdown in this case.
> 
> But I wont to know how must xerces (or may be this is cocoon problem,
> I don't know) works with encodings. Why there is code which I comment out?
> Why not to work with xml content like with raw data, only processing tags?
> How must it works if I set encoding in xml document and is it input
> (i.e. what I have in xml) or output (i.e. what cocoon send to browser)
> encoding, etc? I want to understand how it works ! :)
> 
> Dmitry Melekhov
> http://www.aspec.ru/~dm
> 2:5050/11.23@fidonet
> 
> > Mike
> >
> >
> > Dmitry Melekhov wrote:
> > >
> > > Hello!
> > >
> > > I'm not shure that tis list is write place
> > > for this question. If I do mistake, I'm sorry!
> > >
> > > Question is Cocoon related and about how xerces must
> > > works with encodings.
> > >
> > > I write my xml documents in koi8 encoding,
> > > but set I encoding or not I always see ???? in browser instead of
> > > 8 bit characters.
> > > Taras Shumeyko pointed me that this is formatter problem and
> > > that problem is in org.apache.xml.serialize.BaseMarkupSerializer
> > > in function    protected String escape( String source )
> > >
> > > I changed it- remove all reecodings from it and now
> > > I have Cocoon and Xerces works OK.
> > > Here is my variant of function:
> > >
> > >   protected String escape( String source )
> > >     {
> > >         StringBuffer    result;
> > >         int             i;
> > >         char            ch;
> > >         String          charRef;
> > >
> > >         result = new StringBuffer( source.length() );
> > >         for ( i = 0 ; i < source.length() ; ++i )  {
> > >             ch = source.charAt( i );
> > >             // If the character is not printable, print as character
> > > reference.
> > >             // Non printables are below ASCII space but not tab or line
> > >             // terminator, ASCII delete, or above a certain Unicode
> > > threshold.
> > > //          if ( ( ch < ' ' && ch != '\t' && ch != '\n' && ch != '\r' )
> > > ||
> > > //               ch > _lastPrintable || ch == 0xF7 )
> > > //                  result.append( "&#" ).append( Integer.toString( ch )
> > > ).append( ';' );
> > > //          else {
> > >                     // If there is a suitable entity reference for this
> > >                     // character, print it. The list of available entity
> > >
> > >                     // references is almost but not identical between
> > >                     // XML and HTML.
> > > //                  charRef = getEntityRef( ch );
> > > //                  if ( charRef == null )
> > >                         result.append( ch );
> > > //                  else
> > > //                      result.append( '&' ).append( charRef ).append(
> > > ';' );
> > > //          }
> > >         }
> > >         return result.toString();
> > >     }
> > >
> > > But this is dirty hack.
> > >
> > > I want to understand how must Xerces treat encodings and why
> > > it don't wokrs now.
> > >
> > > --
> > > Dmitry Melekhov
> > > http://www.aspec.ru/~dm
> > > 2:5050/11.23@fidonet
> > >
> > > P.S.
> > > My java platform is blackdown jdk 1.1.7 for Linux x86
> >
> >

Re: xml encodings, java

Posted by Dmitry Melekhov <dm...@aspec.ru>.

----- Original Message -----
From: Mike Pogue <mp...@apache.org>
To: <xe...@xml.apache.org>
Sent: Friday, January 28, 2000 8:33 PM
Subject: Re: xml encodings, java


> The code you have below is a clever workaround, but ultimately, you want
> to use a JVM that has the encoding support built-in.
>
> So, I'd suggest you try to use the IBM 1.1.8 JVM.  It's fairly reliable,
> scalable, and I think it has the encoding support you are looking for.
> (Of course, I am biased in this! :-)
>

OK. I just tried IBM jdk, it work exactly as blackdown in this case.

But I wont to know how must xerces (or may be this is cocoon problem,
I don't know) works with encodings. Why there is code which I comment out?
Why not to work with xml content like with raw data, only processing tags?
How must it works if I set encoding in xml document and is it input
(i.e. what I have in xml) or output (i.e. what cocoon send to browser)
encoding, etc? I want to understand how it works ! :)

Dmitry Melekhov
http://www.aspec.ru/~dm
2:5050/11.23@fidonet

> Mike
>
>
> Dmitry Melekhov wrote:
> >
> > Hello!
> >
> > I'm not shure that tis list is write place
> > for this question. If I do mistake, I'm sorry!
> >
> > Question is Cocoon related and about how xerces must
> > works with encodings.
> >
> > I write my xml documents in koi8 encoding,
> > but set I encoding or not I always see ???? in browser instead of
> > 8 bit characters.
> > Taras Shumeyko pointed me that this is formatter problem and
> > that problem is in org.apache.xml.serialize.BaseMarkupSerializer
> > in function    protected String escape( String source )
> >
> > I changed it- remove all reecodings from it and now
> > I have Cocoon and Xerces works OK.
> > Here is my variant of function:
> >
> >   protected String escape( String source )
> >     {
> >         StringBuffer    result;
> >         int             i;
> >         char            ch;
> >         String          charRef;
> >
> >         result = new StringBuffer( source.length() );
> >         for ( i = 0 ; i < source.length() ; ++i )  {
> >             ch = source.charAt( i );
> >             // If the character is not printable, print as character
> > reference.
> >             // Non printables are below ASCII space but not tab or line
> >             // terminator, ASCII delete, or above a certain Unicode
> > threshold.
> > //          if ( ( ch < ' ' && ch != '\t' && ch != '\n' && ch != '\r' )
> > ||
> > //               ch > _lastPrintable || ch == 0xF7 )
> > //                  result.append( "&#" ).append( Integer.toString( ch )
> > ).append( ';' );
> > //          else {
> >                     // If there is a suitable entity reference for this
> >                     // character, print it. The list of available entity
> >
> >                     // references is almost but not identical between
> >                     // XML and HTML.
> > //                  charRef = getEntityRef( ch );
> > //                  if ( charRef == null )
> >                         result.append( ch );
> > //                  else
> > //                      result.append( '&' ).append( charRef ).append(
> > ';' );
> > //          }
> >         }
> >         return result.toString();
> >     }
> >
> > But this is dirty hack.
> >
> > I want to understand how must Xerces treat encodings and why
> > it don't wokrs now.
> >
> > --
> > Dmitry Melekhov
> > http://www.aspec.ru/~dm
> > 2:5050/11.23@fidonet
> >
> > P.S.
> > My java platform is blackdown jdk 1.1.7 for Linux x86
>
>

Re: xml encodings, java

Posted by Mike Pogue <mp...@apache.org>.

Which encodings are available to you depends HEAVILY on the encoding
support in the underlying JVM.  In your case: Blackdown.

Note that Sun does NOT require JVM's to support the same encodings that
the Sun JVM does.
And, the Xerces-J parser does NOT do it's own encoding support -- it
uses whatever is there in the JVM. So, if your JVM doesn't support an
encoding that you want to use, bad things will happen (it usually won't
do what you expect).

The code you have below is a clever workaround, but ultimately, you want
to use a JVM that has the encoding support built-in.

Here's my experience with encodings:

Has fewest encodings:	Microsoft JVM
			Blackdown JVM

			Sun JVM
Has most encodings:	IBM JVM

So, I'd suggest you try to use the IBM 1.1.8 JVM.  It's fairly reliable,
scalable, and I think it has the encoding support you are looking for. 
(Of course, I am biased in this! :-)

Mike


Dmitry Melekhov wrote:
> 
> Hello!
> 
> I'm not shure that tis list is write place
> for this question. If I do mistake, I'm sorry!
> 
> Question is Cocoon related and about how xerces must
> works with encodings.
> 
> I write my xml documents in koi8 encoding,
> but set I encoding or not I always see ???? in browser instead of
> 8 bit characters.
> Taras Shumeyko pointed me that this is formatter problem and
> that problem is in org.apache.xml.serialize.BaseMarkupSerializer
> in function    protected String escape( String source )
> 
> I changed it- remove all reecodings from it and now
> I have Cocoon and Xerces works OK.
> Here is my variant of function:
> 
>   protected String escape( String source )
>     {
>         StringBuffer    result;
>         int             i;
>         char            ch;
>         String          charRef;
> 
>         result = new StringBuffer( source.length() );
>         for ( i = 0 ; i < source.length() ; ++i )  {
>             ch = source.charAt( i );
>             // If the character is not printable, print as character
> reference.
>             // Non printables are below ASCII space but not tab or line
>             // terminator, ASCII delete, or above a certain Unicode
> threshold.
> //          if ( ( ch < ' ' && ch != '\t' && ch != '\n' && ch != '\r' )
> ||
> //               ch > _lastPrintable || ch == 0xF7 )
> //                  result.append( "&#" ).append( Integer.toString( ch )
> ).append( ';' );
> //          else {
>                     // If there is a suitable entity reference for this
>                     // character, print it. The list of available entity
> 
>                     // references is almost but not identical between
>                     // XML and HTML.
> //                  charRef = getEntityRef( ch );
> //                  if ( charRef == null )
>                         result.append( ch );
> //                  else
> //                      result.append( '&' ).append( charRef ).append(
> ';' );
> //          }
>         }
>         return result.toString();
>     }
> 
> But this is dirty hack.
> 
> I want to understand how must Xerces treat encodings and why
> it don't wokrs now.
> 
> --
> Dmitry Melekhov
> http://www.aspec.ru/~dm
> 2:5050/11.23@fidonet
> 
> P.S.
> My java platform is blackdown jdk 1.1.7 for Linux x86