You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-dev@axis.apache.org by Toshiyuki Kimura <to...@apache.org> on 2005/01/19 04:41:48 UTC

Re: UTF8Encoder question...

Hi Ias, Jongjin and all,

   Sorry for the cutting in. I'd like to know the conclusion.

   As you may know, I'm now working for i18n of Axis. Then, the
Japanese Axis Community has already made a Japanized resources.
On the testing, I faced an encoding problem of UTF-8.

   With the latest CVS codes, I get a escaping message from the
server-side Axis as follows;

   <Admin>&#x51E6;&#x7406;&#x3092;&#x5B9F;&#x884C;&#x3057;&#x307E;
   &#x3057;&#x305F;/ [en]-(Done processing)</Admin>

instead of

   <Admin>[Japanese Message] / [en]-(Done processing)</Admin>

  As a side node, I could have valid Japanese characters when I
applied a patch of Jongjin to my local 'UTF8Encoder.java'.

Any thought?

Regards,
Toshi <to...@apache.org>

On Thu, 30 Dec 2004, Changshin Lee wrote:

>> Ias and all,
>>
>> If you revive the commented and removed code of UTF8Encoder that is :
>>                     /*
>> TODO: Try fixing this block instead of code above.
>>                         if (character < 0x80) {
>>                             writer.write(character);
>>                         } else if (character < 0x800) {
>>                             writer.write((0xC0 | character >> 6));
>>                             writer.write((0x80 | character & 0x3F));
>>                         } else if (character < 0x10000) {
>>                             writer.write((0xE0 | character >> 12));
>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>                             writer.write((0x80 | character & 0x3F));
>>                         } else if (character < 0x200000) {
>>                             writer.write((0xF0 | character >> 18));
>>                             writer.write((0x80 | character >> 12 & 0x3F));
>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>                             writer.write((0x80 | character & 0x3F));
>>                         }
>>                         */
>> and uncommented current escaping code, the all-tests will fail.
>> As I addressed, these code would be necessary for OutputStream not Writer.
>> In this case the Writer is used and  the code can be simply rewrited (as in UTF16Encoder)
>>
>> writer.write(character);
>>
>> I think the all-tests will succeed. (I can verify this now because current CVS all-tests fails.)
>>
>
> Could you run all-tests except those failed chronically (by adding
> them to excluded list)? If the result is clean, I'm on the change (and
> it's easy to revert as well, so commit it :-).
>
>> For readability of SOAP message, I think it is not the responsibility of Axis.
>
> Human readability is one of essenses in XML (and SOAP). Assuming that
> a SOAP processor processes a SOAP input message readable to a user,
> then the output of the processing as a form of SOAP must be readable
> to the user. Therefore when people use Axis as a SOAP processor, they
> will blame Axis for a result containing unreadably broken characters
> to them. It's not utterly up to Axis, but Axis can cause it, and Axis
> should guarantee that there's no distortion in terms of readability
> from Alpha to Omega of SOAP processing.
>
> Ias
>
>>
>> This is the diff:
>> cvs diff -u UTF8Encoder.java
>> Index: UTF8Encoder.java
>> ===================================================================
>> RCS file: /home/cvspublic/ws-axis/java/src/org/apache/axis/components/encoding/UTF8Encoder.java,v
>> retrieving revision 1.4
>> diff -u -r1.4 UTF8Encoder.java
>> --- UTF8Encoder.java 4 Nov 2004 18:23:12 -0000 1.4
>> +++ UTF8Encoder.java 30 Dec 2004 01:20:03 -0000
>> @@ -82,10 +82,6 @@
>>                                  "invalidXmlCharacter00",
>>                                  Integer.toHexString(character),
>>                                  xmlString));
>> -                    } else if (character > 0x7F) {
>> -                        writer.write("&#x");
>> -                        writer.write(Integer.toHexString(character).toUpperCase());
>> -                        writer.write(";");
>>                      } else {
>>                          writer.write(character);
>>                      }
>>
>>
>> /Jongjin
>>
>> ----- Original Message -----
>> From: "Changshin Lee" <ia...@gmail.com>
>> To: <ax...@ws.apache.org>
>> Sent: Thursday, December 30, 2004 1:20 AM
>> Subject: Re: UTF8Encoder question...
>>
>>> Ias,
>>>
>>> Even if we consider the system which can't display the soap message well  for its lack of unicode-font,
>>> I think the default encoding should be as-it-is not scaping.
>>>
>>> The soap message is not for display and it is better to generate the more compact soap message from the web services toolkit's point of view.
>>>
>>
>> SOAP messages are not for presentation but should be readable :-)
>>
>>> For displaying, the application can convert the soap message to appropriate encoding. (as you know, here in korea, we use euc-kr. and also as you know, the conversion can be possible with some line of java code.)
>>> Also, as far as I know,  Axis used as-it-is way in Axis 1.0 or 1.1.
>>>
>>
>> That's a good point. However, we need to pay attention to those may
>> want UTF8Encoder to run conversion like now. If we revert Axis 1.2's
>> UTF8Encoder, we should inform users of the regression clearly in order
>> not to puzzle them.
>>
>>> I remember that the reason to use scaping in UTF8Encoder was to handle the french accent or german umlaut a few months ago. This is reflected in test.encoding.TestString test case.
>>>
>>
>> The current mechanism came up in April. At the moment
>>
>> TODO: Try fixing this block instead of code above.
>>                         if (character < 0x80) {
>>                             writer.write(character);
>>                         } else if (character < 0x800) {
>>                             writer.write((0xC0 | character >> 6));
>>                             writer.write((0x80 | character & 0x3F));
>>                         } else if (character < 0x10000) {
>>                             writer.write((0xE0 | character >> 12));
>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>                             writer.write((0x80 | character & 0x3F));
>>                         } else if (character < 0x200000) {
>>                             writer.write((0xF0 | character >> 18));
>>                             writer.write((0x80 | character >> 12 & 0x3F));
>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>                             writer.write((0x80 | character & 0x3F));
>>                         }
>>                         */
>>
>> but the commented part was gone in 1_2RC2 tag.
>>
>>> Any thought?
>>>
>>
>> So, what you're saying is that the current UTF8Encoder's behavior
>> comes from the test case. In other words, if you change the encoder to
>> output "as-it-is", then the test fails. Could we make them consistent,
>> I mean, UTF8Encoder outputs without conversion and at the same time
>> the case passes?
>>
>> Ias
>>
>> P.S. I'd like to hear opinions on changing UTF8Encoder's default
>> behavior (and possibly create another encoder or an option for
>> conversion). Once we pass all tests with the changed encoder, it is
>> worth adopting the change, I believe.
>>
>>> /Jongjin
>>>
>>> ----- Original Message -----
>>> From: "Ias" <ia...@hotmail.com>
>>> To: <ax...@ws.apache.org>
>>> Sent: Wednesday, December 29, 2004 1:53 AM
>>> Subject: RE: UTF8Encoder question...
>>>
>>>>
>>>> From: Jongjin Choi [mailto:gunsnroz@hotmail.com]
>>>> Sent: Tuesday, December 28, 2004 11:56 AM
>>>> To: axis-dev@ws.apache.org
>>>> Subject: UTF8Encoder question...
>>>>
>>>>
>>>> Dims and all,
>>>>
>>>> UTF8Encoder writes escaped string when the character is over 0x7F.
>>>> The escaping does not seem to be necessary because
>>>> the Writer (not OutputStream) is used.
>>>>
>>>> I think this could be just : (line 86)
>>>>
>>>> writer.write(character);
>>>>
>>>> instead of : (line 86 ~ 88)
>>>> writer.write("&#x);
>>>> writer.write(Integer.toHexString(character).toUpperCase());
>>>> writer.write(";");
>>>>
>>>> The escaping just increases the message size.
>>>>
>>> ias> Yes, it does. However, I think representing a character of which codepoint
>>> ias> is over 0x7F as a form of &#x XML entity is one of the aims of the encoder
>>> ias> because some systems can't display that character properly due to no
>>> ias> unicode-wide fonts built in there. In case it's 100% certain that every node
>>> ias> in a messaging system has no problem with "as-it-is" character
>>> ias> representation on a XML instance, it must be much more efficient to use a
>>> ias> compact encoder as you pointed out instead of UTF8Encoder. Interestingly,
>>> ias> AbstractXMLEncoder (which is not instantiable) works in such a way. In
>>> ias> consequence, it would be a good idea to create a new encoder to optimize
>>> ias> message size and use it with ease of configurability. (Yes, we can recommend
>>> ias> it to users dealing with non-Latin character systems :-)
>>>>
>>>> Happy new year,
>>>>
>>>> Ias
>>>>
>>>> P.S. I'm going to switch iasandcb@hotmail.com to iasandcb@gmail.com (soon,
>>>> very soon).
>>>>
>>>>
>>>> If the OutputStream is used, the escaping or UTF-8 conversion (which
>>>> existed in old UTF8Encoder.java) will be needed.
>>>>
>>>> Thought?
>>>>
>>>> /Jongjin
>>>>
>>>>
>>
>

Re: UTF8Encoder question...

Posted by Toshiyuki Kimura <to...@apache.org>.
Hi Jongjin,

Let me clarify ...
Is the switch for only Admin Service and Client, for app global,
or for per each apps ?

   On the i18n point of view, I hope Axis works fine any time with
all of languages by using the default settings.

Thanks,
Toshi <to...@apache.org>

On Wed, 19 Jan 2005, Jongjin Choi wrote:

> Hi, Toshi and all.
> 
> I'd like to propose these for backward compatibility:
>   - keep the escaping as default
>   - make a runtime option (axis property in wsdd) for switching to
>     no-escaping.
> 
> The current behavior has no problem for an application to handle the
> soap message. I just pointed that the message size can be somewhat
> larger with escaping.
> 
> But in this case, the admin client (AdminClient.java) seems to writes
> the content of soap body directly to console. I think the switch can
> be applied to Admin Service and Client.
>
> Any thought?
>
> /Jongjin
>
> ----- Original Message -----
> From: "Toshiyuki Kimura" <to...@apache.org>
> To: <ax...@ws.apache.org>
> Cc: "Changshin Lee" <ia...@gmail.com>;
> "Jongjin Choi" <gu...@hotmail.com>
> Sent: Wednesday, January 19, 2005 12:41 PM
> Subject: Re: UTF8Encoder question...
>
>
>> Hi Ias, Jongjin and all,
>>
>>   Sorry for the cutting in. I'd like to know the conclusion.
>>
>>   As you may know, I'm now working for i18n of Axis. Then, the
>> Japanese Axis Community has already made a Japanized resources.
>> On the testing, I faced an encoding problem of UTF-8.
>>
>>   With the latest CVS codes, I get a escaping message from the
>> server-side Axis as follows;
>>
>>   <Admin>&#x51E6;&#x7406;&#x3092;&#x5B9F;&#x884C;&#x3057;&#x307E;
>>   &#x3057;&#x305F;/ [en]-(Done processing)</Admin>
>>
>> instead of
>>
>>   <Admin>[Japanese Message] / [en]-(Done processing)</Admin>
>>
>>  As a side node, I could have valid Japanese characters when I
>> applied a patch of Jongjin to my local 'UTF8Encoder.java'.
>>
>> Any thought?
>>
>> Regards,
>> Toshi <to...@apache.org>
>>
>> On Thu, 30 Dec 2004, Changshin Lee wrote:
>>
>>>> Ias and all,
>>>>
>>>> If you revive the commented and removed code of UTF8Encoder that is :
>>>>                     /*
>>>> TODO: Try fixing this block instead of code above.
>>>>                         if (character < 0x80) {
>>>>                             writer.write(character);
>>>>                         } else if (character < 0x800) {
>>>>                             writer.write((0xC0 | character >> 6));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         } else if (character < 0x10000) {
>>>>                             writer.write((0xE0 | character >> 12));
>>>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         } else if (character < 0x200000) {
>>>>                             writer.write((0xF0 | character >> 18));
>>>>                             writer.write((0x80 | character >> 12 & 0x3F));
>>>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         }
>>>>                         */
>>>> and uncommented current escaping code, the all-tests will fail.
>>>> As I addressed, these code would be necessary for OutputStream not Writer.
>>>> In this case the Writer is used and  the code can be simply rewrited (as in UTF16Encoder)
>>>>
>>>> writer.write(character);
>>>>
>>>> I think the all-tests will succeed. (I can verify this now because current CVS all-tests fails.)
>>>>
>>>
>>> Could you run all-tests except those failed chronically (by adding
>>> them to excluded list)? If the result is clean, I'm on the change (and
>>> it's easy to revert as well, so commit it :-).
>>>
>>>> For readability of SOAP message, I think it is not the responsibility of Axis.
>>>
>>> Human readability is one of essenses in XML (and SOAP). Assuming that
>>> a SOAP processor processes a SOAP input message readable to a user,
>>> then the output of the processing as a form of SOAP must be readable
>>> to the user. Therefore when people use Axis as a SOAP processor, they
>>> will blame Axis for a result containing unreadably broken characters
>>> to them. It's not utterly up to Axis, but Axis can cause it, and Axis
>>> should guarantee that there's no distortion in terms of readability
>>> from Alpha to Omega of SOAP processing.
>>>
>>> Ias
>>>
>>>>
>>>> This is the diff:
>>>> cvs diff -u UTF8Encoder.java
>>>> Index: UTF8Encoder.java
>>>> ===================================================================
>>>> RCS file: /home/cvspublic/ws-axis/java/src/org/apache/axis/components/encoding/UTF8Encoder.java,v
>>>> retrieving revision 1.4
>>>> diff -u -r1.4 UTF8Encoder.java
>>>> --- UTF8Encoder.java 4 Nov 2004 18:23:12 -0000 1.4
>>>> +++ UTF8Encoder.java 30 Dec 2004 01:20:03 -0000
>>>> @@ -82,10 +82,6 @@
>>>>                                  "invalidXmlCharacter00",
>>>>                                  Integer.toHexString(character),
>>>>                                  xmlString));
>>>> -                    } else if (character > 0x7F) {
>>>> -                        writer.write("&#x");
>>>> -                        writer.write(Integer.toHexString(character).toUpperCase());
>>>> -                        writer.write(";");
>>>>                      } else {
>>>>                          writer.write(character);
>>>>                      }
>>>>
>>>>
>>>> /Jongjin
>>>>
>>>> ----- Original Message -----
>>>> From: "Changshin Lee" <ia...@gmail.com>
>>>> To: <ax...@ws.apache.org>
>>>> Sent: Thursday, December 30, 2004 1:20 AM
>>>> Subject: Re: UTF8Encoder question...
>>>>
>>>>> Ias,
>>>>>
>>>>> Even if we consider the system which can't display the soap message well  for its lack of unicode-font,
>>>>> I think the default encoding should be as-it-is not scaping.
>>>>>
>>>>> The soap message is not for display and it is better to generate the more compact soap message from the web services toolkit's point of view.
>>>>>
>>>>
>>>> SOAP messages are not for presentation but should be readable :-)
>>>>
>>>>> For displaying, the application can convert the soap message to appropriate encoding. (as you know, here in korea, we use euc-kr. and also as you know, the conversion can be possible with some line of java code.)
>>>>> Also, as far as I know,  Axis used as-it-is way in Axis 1.0 or 1.1.
>>>>>
>>>>
>>>> That's a good point. However, we need to pay attention to those may
>>>> want UTF8Encoder to run conversion like now. If we revert Axis 1.2's
>>>> UTF8Encoder, we should inform users of the regression clearly in order
>>>> not to puzzle them.
>>>>
>>>>> I remember that the reason to use scaping in UTF8Encoder was to handle the french accent or german umlaut a few months ago. This is reflected in test.encoding.TestString test case.
>>>>>
>>>>
>>>> The current mechanism came up in April. At the moment
>>>>
>>>> TODO: Try fixing this block instead of code above.
>>>>                         if (character < 0x80) {
>>>>                             writer.write(character);
>>>>                         } else if (character < 0x800) {
>>>>                             writer.write((0xC0 | character >> 6));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         } else if (character < 0x10000) {
>>>>                             writer.write((0xE0 | character >> 12));
>>>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         } else if (character < 0x200000) {
>>>>                             writer.write((0xF0 | character >> 18));
>>>>                             writer.write((0x80 | character >> 12 & 0x3F));
>>>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         }
>>>>                         */
>>>>
>>>> but the commented part was gone in 1_2RC2 tag.
>>>>
>>>>> Any thought?
>>>>>
>>>>
>>>> So, what you're saying is that the current UTF8Encoder's behavior
>>>> comes from the test case. In other words, if you change the encoder to
>>>> output "as-it-is", then the test fails. Could we make them consistent,
>>>> I mean, UTF8Encoder outputs without conversion and at the same time
>>>> the case passes?
>>>>
>>>> Ias
>>>>
>>>> P.S. I'd like to hear opinions on changing UTF8Encoder's default
>>>> behavior (and possibly create another encoder or an option for
>>>> conversion). Once we pass all tests with the changed encoder, it is
>>>> worth adopting the change, I believe.
>>>>
>>>>> /Jongjin
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Ias" <ia...@hotmail.com>
>>>>> To: <ax...@ws.apache.org>
>>>>> Sent: Wednesday, December 29, 2004 1:53 AM
>>>>> Subject: RE: UTF8Encoder question...
>>>>>
>>>>>>
>>>>>> From: Jongjin Choi [mailto:gunsnroz@hotmail.com]
>>>>>> Sent: Tuesday, December 28, 2004 11:56 AM
>>>>>> To: axis-dev@ws.apache.org
>>>>>> Subject: UTF8Encoder question...
>>>>>>
>>>>>>
>>>>>> Dims and all,
>>>>>>
>>>>>> UTF8Encoder writes escaped string when the character is over 0x7F.
>>>>>> The escaping does not seem to be necessary because
>>>>>> the Writer (not OutputStream) is used.
>>>>>>
>>>>>> I think this could be just : (line 86)
>>>>>>
>>>>>> writer.write(character);
>>>>>>
>>>>>> instead of : (line 86 ~ 88)
>>>>>> writer.write("&#x);
>>>>>> writer.write(Integer.toHexString(character).toUpperCase());
>>>>>> writer.write(";");
>>>>>>
>>>>>> The escaping just increases the message size.
>>>>>>
>>>>> ias> Yes, it does. However, I think representing a character of which codepoint
>>>>> ias> is over 0x7F as a form of &#x XML entity is one of the aims of the encoder
>>>>> ias> because some systems can't display that character properly due to no
>>>>> ias> unicode-wide fonts built in there. In case it's 100% certain that every node
>>>>> ias> in a messaging system has no problem with "as-it-is" character
>>>>> ias> representation on a XML instance, it must be much more efficient to use a
>>>>> ias> compact encoder as you pointed out instead of UTF8Encoder. Interestingly,
>>>>> ias> AbstractXMLEncoder (which is not instantiable) works in such a way. In
>>>>> ias> consequence, it would be a good idea to create a new encoder to optimize
>>>>> ias> message size and use it with ease of configurability. (Yes, we can recommend
>>>>> ias> it to users dealing with non-Latin character systems :-)
>>>>>>
>>>>>> Happy new year,
>>>>>>
>>>>>> Ias
>>>>>>
>>>>>> P.S. I'm going to switch iasandcb@hotmail.com to iasandcb@gmail.com (soon,
>>>>>> very soon).
>>>>>>
>>>>>>
>>>>>> If the OutputStream is used, the escaping or UTF-8 conversion (which
>>>>>> existed in old UTF8Encoder.java) will be needed.
>>>>>>
>>>>>> Thought?
>>>>>>
>>>>>> /Jongjin
>>>>>>
>>>>>>
>>>>
>>>
>>

Re: UTF8Encoder question...

Posted by Jongjin Choi <gu...@hotmail.com>.
Hi, Toshi and all.

I'd like to propose these for backward compatibility:  
  - keep the escaping as default
  - make a runtime option (axis property in wsdd) for switching to no-escaping.

The current behavior has no problem for an application to handle the soap message.
I just pointed that the message size can be somewhat larger with escaping.

But in this case, the admin client (AdminClient.java) seems to writes the content of soap body directly to console. I think the switch can be applied to Admin Service and Client.

Any thought?

/Jongjin

----- Original Message ----- 
From: "Toshiyuki Kimura" <to...@apache.org>
To: <ax...@ws.apache.org>
Cc: "Changshin Lee" <ia...@gmail.com>; "Jongjin Choi" <gu...@hotmail.com>
Sent: Wednesday, January 19, 2005 12:41 PM
Subject: Re: UTF8Encoder question...


> Hi Ias, Jongjin and all,
> 
>   Sorry for the cutting in. I'd like to know the conclusion.
> 
>   As you may know, I'm now working for i18n of Axis. Then, the
> Japanese Axis Community has already made a Japanized resources.
> On the testing, I faced an encoding problem of UTF-8.
> 
>   With the latest CVS codes, I get a escaping message from the
> server-side Axis as follows;
> 
>   <Admin>&#x51E6;&#x7406;&#x3092;&#x5B9F;&#x884C;&#x3057;&#x307E;
>   &#x3057;&#x305F;/ [en]-(Done processing)</Admin>
> 
> instead of
> 
>   <Admin>[Japanese Message] / [en]-(Done processing)</Admin>
> 
>  As a side node, I could have valid Japanese characters when I
> applied a patch of Jongjin to my local 'UTF8Encoder.java'.
> 
> Any thought?
> 
> Regards,
> Toshi <to...@apache.org>
> 
> On Thu, 30 Dec 2004, Changshin Lee wrote:
> 
>>> Ias and all,
>>>
>>> If you revive the commented and removed code of UTF8Encoder that is :
>>>                     /*
>>> TODO: Try fixing this block instead of code above.
>>>                         if (character < 0x80) {
>>>                             writer.write(character);
>>>                         } else if (character < 0x800) {
>>>                             writer.write((0xC0 | character >> 6));
>>>                             writer.write((0x80 | character & 0x3F));
>>>                         } else if (character < 0x10000) {
>>>                             writer.write((0xE0 | character >> 12));
>>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>>                             writer.write((0x80 | character & 0x3F));
>>>                         } else if (character < 0x200000) {
>>>                             writer.write((0xF0 | character >> 18));
>>>                             writer.write((0x80 | character >> 12 & 0x3F));
>>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>>                             writer.write((0x80 | character & 0x3F));
>>>                         }
>>>                         */
>>> and uncommented current escaping code, the all-tests will fail.
>>> As I addressed, these code would be necessary for OutputStream not Writer.
>>> In this case the Writer is used and  the code can be simply rewrited (as in UTF16Encoder)
>>>
>>> writer.write(character);
>>>
>>> I think the all-tests will succeed. (I can verify this now because current CVS all-tests fails.)
>>>
>>
>> Could you run all-tests except those failed chronically (by adding
>> them to excluded list)? If the result is clean, I'm on the change (and
>> it's easy to revert as well, so commit it :-).
>>
>>> For readability of SOAP message, I think it is not the responsibility of Axis.
>>
>> Human readability is one of essenses in XML (and SOAP). Assuming that
>> a SOAP processor processes a SOAP input message readable to a user,
>> then the output of the processing as a form of SOAP must be readable
>> to the user. Therefore when people use Axis as a SOAP processor, they
>> will blame Axis for a result containing unreadably broken characters
>> to them. It's not utterly up to Axis, but Axis can cause it, and Axis
>> should guarantee that there's no distortion in terms of readability
>> from Alpha to Omega of SOAP processing.
>>
>> Ias
>>
>>>
>>> This is the diff:
>>> cvs diff -u UTF8Encoder.java
>>> Index: UTF8Encoder.java
>>> ===================================================================
>>> RCS file: /home/cvspublic/ws-axis/java/src/org/apache/axis/components/encoding/UTF8Encoder.java,v
>>> retrieving revision 1.4
>>> diff -u -r1.4 UTF8Encoder.java
>>> --- UTF8Encoder.java 4 Nov 2004 18:23:12 -0000 1.4
>>> +++ UTF8Encoder.java 30 Dec 2004 01:20:03 -0000
>>> @@ -82,10 +82,6 @@
>>>                                  "invalidXmlCharacter00",
>>>                                  Integer.toHexString(character),
>>>                                  xmlString));
>>> -                    } else if (character > 0x7F) {
>>> -                        writer.write("&#x");
>>> -                        writer.write(Integer.toHexString(character).toUpperCase());
>>> -                        writer.write(";");
>>>                      } else {
>>>                          writer.write(character);
>>>                      }
>>>
>>>
>>> /Jongjin
>>>
>>> ----- Original Message -----
>>> From: "Changshin Lee" <ia...@gmail.com>
>>> To: <ax...@ws.apache.org>
>>> Sent: Thursday, December 30, 2004 1:20 AM
>>> Subject: Re: UTF8Encoder question...
>>>
>>>> Ias,
>>>>
>>>> Even if we consider the system which can't display the soap message well  for its lack of unicode-font,
>>>> I think the default encoding should be as-it-is not scaping.
>>>>
>>>> The soap message is not for display and it is better to generate the more compact soap message from the web services toolkit's point of view.
>>>>
>>>
>>> SOAP messages are not for presentation but should be readable :-)
>>>
>>>> For displaying, the application can convert the soap message to appropriate encoding. (as you know, here in korea, we use euc-kr. and also as you know, the conversion can be possible with some line of java code.)
>>>> Also, as far as I know,  Axis used as-it-is way in Axis 1.0 or 1.1.
>>>>
>>>
>>> That's a good point. However, we need to pay attention to those may
>>> want UTF8Encoder to run conversion like now. If we revert Axis 1.2's
>>> UTF8Encoder, we should inform users of the regression clearly in order
>>> not to puzzle them.
>>>
>>>> I remember that the reason to use scaping in UTF8Encoder was to handle the french accent or german umlaut a few months ago. This is reflected in test.encoding.TestString test case.
>>>>
>>>
>>> The current mechanism came up in April. At the moment
>>>
>>> TODO: Try fixing this block instead of code above.
>>>                         if (character < 0x80) {
>>>                             writer.write(character);
>>>                         } else if (character < 0x800) {
>>>                             writer.write((0xC0 | character >> 6));
>>>                             writer.write((0x80 | character & 0x3F));
>>>                         } else if (character < 0x10000) {
>>>                             writer.write((0xE0 | character >> 12));
>>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>>                             writer.write((0x80 | character & 0x3F));
>>>                         } else if (character < 0x200000) {
>>>                             writer.write((0xF0 | character >> 18));
>>>                             writer.write((0x80 | character >> 12 & 0x3F));
>>>                             writer.write((0x80 | character >> 6 & 0x3F));
>>>                             writer.write((0x80 | character & 0x3F));
>>>                         }
>>>                         */
>>>
>>> but the commented part was gone in 1_2RC2 tag.
>>>
>>>> Any thought?
>>>>
>>>
>>> So, what you're saying is that the current UTF8Encoder's behavior
>>> comes from the test case. In other words, if you change the encoder to
>>> output "as-it-is", then the test fails. Could we make them consistent,
>>> I mean, UTF8Encoder outputs without conversion and at the same time
>>> the case passes?
>>>
>>> Ias
>>>
>>> P.S. I'd like to hear opinions on changing UTF8Encoder's default
>>> behavior (and possibly create another encoder or an option for
>>> conversion). Once we pass all tests with the changed encoder, it is
>>> worth adopting the change, I believe.
>>>
>>>> /Jongjin
>>>>
>>>> ----- Original Message -----
>>>> From: "Ias" <ia...@hotmail.com>
>>>> To: <ax...@ws.apache.org>
>>>> Sent: Wednesday, December 29, 2004 1:53 AM
>>>> Subject: RE: UTF8Encoder question...
>>>>
>>>>>
>>>>> From: Jongjin Choi [mailto:gunsnroz@hotmail.com]
>>>>> Sent: Tuesday, December 28, 2004 11:56 AM
>>>>> To: axis-dev@ws.apache.org
>>>>> Subject: UTF8Encoder question...
>>>>>
>>>>>
>>>>> Dims and all,
>>>>>
>>>>> UTF8Encoder writes escaped string when the character is over 0x7F.
>>>>> The escaping does not seem to be necessary because
>>>>> the Writer (not OutputStream) is used.
>>>>>
>>>>> I think this could be just : (line 86)
>>>>>
>>>>> writer.write(character);
>>>>>
>>>>> instead of : (line 86 ~ 88)
>>>>> writer.write("&#x);
>>>>> writer.write(Integer.toHexString(character).toUpperCase());
>>>>> writer.write(";");
>>>>>
>>>>> The escaping just increases the message size.
>>>>>
>>>> ias> Yes, it does. However, I think representing a character of which codepoint
>>>> ias> is over 0x7F as a form of &#x XML entity is one of the aims of the encoder
>>>> ias> because some systems can't display that character properly due to no
>>>> ias> unicode-wide fonts built in there. In case it's 100% certain that every node
>>>> ias> in a messaging system has no problem with "as-it-is" character
>>>> ias> representation on a XML instance, it must be much more efficient to use a
>>>> ias> compact encoder as you pointed out instead of UTF8Encoder. Interestingly,
>>>> ias> AbstractXMLEncoder (which is not instantiable) works in such a way. In
>>>> ias> consequence, it would be a good idea to create a new encoder to optimize
>>>> ias> message size and use it with ease of configurability. (Yes, we can recommend
>>>> ias> it to users dealing with non-Latin character systems :-)
>>>>>
>>>>> Happy new year,
>>>>>
>>>>> Ias
>>>>>
>>>>> P.S. I'm going to switch iasandcb@hotmail.com to iasandcb@gmail.com (soon,
>>>>> very soon).
>>>>>
>>>>>
>>>>> If the OutputStream is used, the escaping or UTF-8 conversion (which
>>>>> existed in old UTF8Encoder.java) will be needed.
>>>>>
>>>>> Thought?
>>>>>
>>>>> /Jongjin
>>>>>
>>>>>
>>>
>>
>