You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Jason Height <jh...@minotaur.apache.org> on 2005/08/24 05:43:38 UTC

String encoding (again)

All,

Any idea why the following line from UnicodeRecord (current HEAD rev and 
previous) is actually required?
String unicodeString = new 
String(getString().getBytes("Unicode"),"Unicode");

If i remove it and use:
String unicodeString = getString();

1) All of the unit tests still pass, and
2) There is a 33x performance improvement with workbooks containing a 
large numbers of strings

I am tempted to apply a patch to use my approach. Any 
objections?

Jason

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


Re: String encoding (again)

Posted by Glen Stampoultzis <gs...@iinet.net.au>.
The getBytes() call with no argument uses the default character set but 
the line in the email was using the version of get bytes that explicitly 
specifies unicode as the character set.  From what I can tell the code 
is converting from a unicode string into a unicode byte array and 
slurping it back up into a new unicode string.  Net effect is that it is 
doing a lot of work for no actual reason.

-- glen

acoliver@jboss.org wrote:

> http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()
>
> public byte[] getBytes()
>
>     Encodes this String into a sequence of bytes using the platform's 
> default charset, storing the result into a new byte array.
>
>     The behavior of this method when this string cannot be encoded in 
> the default charset is unspecified. The CharsetEncoder class should be 
> used when more control over the encoding process is required.
>
>     Returns:
>         The resultant byte array
>     Since:
>         JDK1.1
>
>
>
> Glen Stampoultzis wrote:
>
>>
>> Aren't Java strings always stored as 2 byte unicode as defined by the 
>> spec?
>>
>> acoliver@apache.org wrote:
>>
>>> Not all systems default to unicode.  Though that looks doofy to me. 
>>> Your code assumes they do.  You'd need a flag saying 
>>> "amIOnAnAS400()" or something ;-)
>>>
>>> -Andy
>>>
>>> Jason Height wrote:
>>>
>>>> All,
>>>>
>>>> Any idea why the following line from UnicodeRecord (current HEAD 
>>>> rev and previous) is actually required?
>>>> String unicodeString = new 
>>>> String(getString().getBytes("Unicode"),"Unicode");
>>>>
>>>> If i remove it and use:
>>>> String unicodeString = getString();
>>>>
>>>> 1) All of the unit tests still pass, and
>>>> 2) There is a 33x performance improvement with workbooks containing 
>>>> a large numbers of strings
>>>>
>>>> I am tempted to apply a patch to use my approach. Any objections?
>>>>
>>>> Jason
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
>>>> Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
>>>> The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
>>>>
>>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
>> Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
>> The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


Re: String encoding (again)

Posted by ac...@jboss.org.
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()

public byte[] getBytes()

     Encodes this String into a sequence of bytes using the platform's 
default charset, storing the result into a new byte array.

     The behavior of this method when this string cannot be encoded in 
the default charset is unspecified. The CharsetEncoder class should be 
used when more control over the encoding process is required.

     Returns:
         The resultant byte array
     Since:
         JDK1.1



Glen Stampoultzis wrote:
> 
> Aren't Java strings always stored as 2 byte unicode as defined by the spec?
> 
> acoliver@apache.org wrote:
> 
>> Not all systems default to unicode.  Though that looks doofy to me. 
>> Your code assumes they do.  You'd need a flag saying "amIOnAnAS400()" 
>> or something ;-)
>>
>> -Andy
>>
>> Jason Height wrote:
>>
>>> All,
>>>
>>> Any idea why the following line from UnicodeRecord (current HEAD rev 
>>> and previous) is actually required?
>>> String unicodeString = new 
>>> String(getString().getBytes("Unicode"),"Unicode");
>>>
>>> If i remove it and use:
>>> String unicodeString = getString();
>>>
>>> 1) All of the unit tests still pass, and
>>> 2) There is a 33x performance improvement with workbooks containing a 
>>> large numbers of strings
>>>
>>> I am tempted to apply a patch to use my approach. Any objections?
>>>
>>> Jason
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
>>> Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
>>> The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
>>>
>>>
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


Re: String encoding (again)

Posted by Glen Stampoultzis <gs...@iinet.net.au>.
Aren't Java strings always stored as 2 byte unicode as defined by the spec?

acoliver@apache.org wrote:

> Not all systems default to unicode.  Though that looks doofy to me. 
> Your code assumes they do.  You'd need a flag saying "amIOnAnAS400()" 
> or something ;-)
>
> -Andy
>
> Jason Height wrote:
>
>> All,
>>
>> Any idea why the following line from UnicodeRecord (current HEAD rev 
>> and previous) is actually required?
>> String unicodeString = new 
>> String(getString().getBytes("Unicode"),"Unicode");
>>
>> If i remove it and use:
>> String unicodeString = getString();
>>
>> 1) All of the unit tests still pass, and
>> 2) There is a 33x performance improvement with workbooks containing a 
>> large numbers of strings
>>
>> I am tempted to apply a patch to use my approach. Any objections?
>>
>> Jason
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
>> Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
>> The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
>>
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


Re: String encoding (again)

Posted by ac...@apache.org.
Not all systems default to unicode.  Though that looks doofy to me. 
Your code assumes they do.  You'd need a flag saying "amIOnAnAS400()" or 
something ;-)

-Andy

Jason Height wrote:
> All,
> 
> Any idea why the following line from UnicodeRecord (current HEAD rev and 
> previous) is actually required?
> String unicodeString = new 
> String(getString().getBytes("Unicode"),"Unicode");
> 
> If i remove it and use:
> String unicodeString = getString();
> 
> 1) All of the unit tests still pass, and
> 2) There is a 33x performance improvement with workbooks containing a 
> large numbers of strings
> 
> I am tempted to apply a patch to use my approach. Any objections?
> 
> Jason
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
> 
> 


-- 
Andrew C. Oliver
SuperLink Software, Inc.

Java to Excel using POI
http://www.superlinksoftware.com/services/poi
Commercial support including features added/implemented, bugs fixed.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/