You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-dev@axis.apache.org by Jens Schumann <ml...@void.fm> on 2004/01/20 16:38:30 UTC

Another take on UTF-8/16 encoding - was: don't understand your patch on axis

Hi Dims,

More than a month ago Cedric and you fixed a problem with French characters,
and Cedric was wondering about the current implementation [1/2]. Mainly he
didn't understand why we use a custom UTF encoder since we (now) use String
Class UTF encoding anyway. Indeed the source of AbstractXMLEncoder looks
wired. Finally I was able to take a deeper look into this, and I believe
there is no easy way to avoid some of the current encoding overhead.


Right now the XMLEncoder fulfills 3 tasks on Strings:

1. Check for invalid characters (0x00,...)
2. Encode several XML Entities (&,<,>,...)
3. Encode the string depending on the given encoding (UTF-8/UTF16)

My initial patch used a simple byte array to create the new UTF
representation within one loop and returned the encoded bytes as a String
(new String(byte[])). But it turned out that the way I created the new
string was not sufficient enough, instead new String(byte[], encoding) is
required. So with the latest patch we got it working finally, but ended up
with a solution which UTF encodes the String two times.

As Cedric pointed out there is obviously no reason to manually encode the
String, we could just do 1 & 2 manually and use the String class for 3.
Indeed a valid point.

I have run a little test and it turns out that for mid size strings the
simple String.getBytes() solution would be half as fast the manual version.

Does it matter? 

Here are the numbers for the different encoding versions (average call in
ms, 100000 iterations) and the results for 10 and 40 calls during soap
envelope serialization (40 calls with mid size Strings isn't a typical use
case I think, 10 will be though).

Manual Encode   0.07611      0.7611 (10 calls)   3.0444 (40 calls)
String.getBytes 0.15559      1.5559 (10 calls)   6.2236 (40 calls)

So if you think that the above numbers aren't that impressive and you don't
care about unnecessary String creation during the process then just stop
reading here. I could send a small patch which will drop most of the current
Encoder stuff, removes as much overhead as possible and simply returns a
String.

What makes me really nervous is that we create those Strings just to add
them in most cases to a Writer instance immediately. I personally don't like
temporary string creation just for nothing. That's why I went ahead and
tried to return a (byte/char) array instead of the String. And here the
trouble starts.

Basically everywhere we use Writers to serialize Soap Envelopes. However
Writers accept char arrays only, byte arrays can't be used. And to be
honest, writing an Encoder which creates char arrays is just a waste of
time. Since I don't know the reason for using Writers within Axis I think I
am beating a dead horse anyway and personally believe we should go ahead and
use the simple version from Cedric. On the other hand we could possibly
improve performance for larger SOAP envelopes a lot by using byte arrays and
Streams instead.

So far,

Jens



[1] http://marc.theaimsgroup.com/?l=axis-dev&m=106986163916928&w=2
[2] http://marc.theaimsgroup.com/?l=axis-dev&m=107022627103508&w=2