You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2013/08/22 11:26:14 UTC

Strig vs Byte[] for values in the server : some new ideas

Hi guys,

it has been years I'm thinking about using byte[] inside the server for
values. I have tried more than once to get rid of the String, with no
success so far : we are too dependant on Strings to get rid of that
(like, the PrepareStrng method works on String, not on byte[], the very
same for the various comparators, normalizers, syntaxc heckers).

Bottom line, we have to keep the values as Strings.

But is this true for every values ?

In fact, we always store the received attribute's values in two
different format :
- a normalized String (if it's a HR Attribute) which gets normalized
yada yada
- a UP String, which is the value as it has been provided by the user,
and which is left untouched.

Now, consider a add operation, folloxed by a search operation, from a
specific attribute point of vue (say, the 'description' AT)

User add :
----------
description:String ---> API ---> conversion to UTF-8 ---> Server

Server AddHandler :
-------------------
description:byte[] ---> decoder ---> conversion to String ---> creation
of the normValue ---> storage on disk ---> conversion of upValue and
NormValue to byte[]


User search :
-------------
send searchRequest
...
wait for response

Server SearchHandler :
----------------------
fetch the entry => deserialize the Up and Norm value of the description
AT (ie, byte[] to String conversion)
entry processing through the interceptors
write the SearchResultEntry ---> conversion of the description AT
UpValue to byte[] (we don't care about the normValue at this point)

User search :
-------------
...
convert the description Up value to String


As we can see, in both operation, we are overdoing : there is no need to
convert the UpValue to a String, as we will do a byte[] -> String ->
byte[] of this UP value in the search. For the Add, it's slightly better
(or less worse) : we can avoid a String--> byte[] conversion when
storing the value.


Making the UpValue a byte[] will save us a lot of wasted CPU, and
probably a bit of space on disk, as a String requires 2 bytes per char
to be serialized.

This is something we have to work on before 2.0, as the underlying
database will be impacted, as we will not serialize the UpValue as a
String but as a byte[].

Thoughts ?

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 


Re: Strig vs Byte[] for values in the server : some new ideas

Posted by Emmanuel Lécharny <el...@gmail.com>.
Le 8/22/13 11:32 AM, Kiran Ayyagari a écrit :
> On Thu, Aug 22, 2013 at 2:56 PM, Emmanuel Lécharny <el...@gmail.com>wrote:
>
>> Making the UpValue a byte[] will save us a lot of wasted CPU, and
>> probably a bit of space on disk, as a String requires 2 bytes per char
>> to be serialized.

even the byte[] will be of same size, I don't see where the gain is
am I missing something?


String are encoded using char[], and a char is 2 bytes. a UTF-8 encoded String will encode the chars on 1 to 6 bytes (well, 1 to 3 in real waorld), and for any char below 0x7F will be encoded as 1 byte.


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 


Re: Strig vs Byte[] for values in the server : some new ideas

Posted by Kiran Ayyagari <ka...@apache.org>.
On Thu, Aug 22, 2013 at 2:56 PM, Emmanuel Lécharny <el...@gmail.com>wrote:

> Hi guys,
>
> it has been years I'm thinking about using byte[] inside the server for
> values. I have tried more than once to get rid of the String, with no
> success so far : we are too dependant on Strings to get rid of that
> (like, the PrepareStrng method works on String, not on byte[], the very
> same for the various comparators, normalizers, syntaxc heckers).
>
> Bottom line, we have to keep the values as Strings.
>
> But is this true for every values ?
>
> In fact, we always store the received attribute's values in two
> different format :
> - a normalized String (if it's a HR Attribute) which gets normalized
> yada yada
> - a UP String, which is the value as it has been provided by the user,
> and which is left untouched.
>
> Now, consider a add operation, folloxed by a search operation, from a
> specific attribute point of vue (say, the 'description' AT)
>
> User add :
> ----------
> description:String ---> API ---> conversion to UTF-8 ---> Server
>
> Server AddHandler :
> -------------------
> description:byte[] ---> decoder ---> conversion to String ---> creation
> of the normValue ---> storage on disk ---> conversion of upValue and
> NormValue to byte[]
>
>
> User search :
> -------------
> send searchRequest
> ...
> wait for response
>
> Server SearchHandler :
> ----------------------
> fetch the entry => deserialize the Up and Norm value of the description
> AT (ie, byte[] to String conversion)
> entry processing through the interceptors
> write the SearchResultEntry ---> conversion of the description AT
> UpValue to byte[] (we don't care about the normValue at this point)
>
> User search :
> -------------
> ...
> convert the description Up value to String
>
>
> As we can see, in both operation, we are overdoing : there is no need to
> convert the UpValue to a String, as we will do a byte[] -> String ->
> byte[] of this UP value in the search. For the Add, it's slightly better
> (or less worse) : we can avoid a String--> byte[] conversion when
> storing the value.
>
>
> Making the UpValue a byte[] will save us a lot of wasted CPU, and
> probably a bit of space on disk, as a String requires 2 bytes per char
> to be serialized.
>
> even the byte[] will be of same size, I don't see where the gain is
am I missing something?

> This is something we have to work on before 2.0, as the underlying
> database will be impacted, as we will not serialize the UpValue as a
> String but as a byte[].
>
> Thoughts ?
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>


-- 
Kiran Ayyagari
http://keydap.com