You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Yang <te...@gmail.com> on 2011/08/12 01:36:09 UTC

why Utf8 (vs String)?

if I declare a field to be "string", the generated java implementation
uses avro......Utf8 for that,

I was wondering what is the thinking behind this, and what is the
proper way to use the Utf8 value -----
oftentimes in my logic, I need to compare the value against other
String's, or store them into other databases , which
of course do not know about Utf8, so that I'd have to transform them
into String's.  so it seems being Utf8 unnecessarily
asks for a lot of transformations.

or I guess I'm not getting the correct usage ?

Thanks
Yang

Re: why Utf8 (vs String)?

Posted by Scott Carey <sc...@apache.org>.
Also, Utf8 caches the result of toString(), so that if you call toString()
many times, it only allocates the String once.
It also implements the CharSequence interface, and many libraries in the
JRE accept CharSequence.

Note that Utf8 is mutable and exposes its backing store (byte array).
String is immutable.  Be careful with how you use Utf8 objects if you hold
on to them for a long time or pass them to other code -- users should not
expect similar characteristics to String for general use.



On 8/11/11 5:08 PM, "Yang" <te...@gmail.com> wrote:

>Thanks  a lot Doug
>
>On Thu, Aug 11, 2011 at 5:02 PM, Doug Cutting <cu...@apache.org> wrote:
>> This is for performance.
>>
>> A Utf8 may be efficiently compared to other Utf8's, e.g., when sorting,
>> without decoding the UTF-8 bytes into characters.  A Utf8 may also be
>> reused, so when iterating through a large number of values (e.g., in a
>> MapReduce job) only a single instance need be allocated, while String
>> would require an allocation per iteration.
>>
>> Note that String may be used when writing data, but that data is
>> generally read as Utf8.  The toString() method may be called whenever a
>> String is required.  If only equality or ordering is needed, and not
>> substring operations, then leaving values as Utf8 is generally faster
>> than converting to String.
>>
>> Doug
>>
>> On 08/11/2011 04:36 PM, Yang wrote:
>>> if I declare a field to be "string", the generated java implementation
>>> uses avro......Utf8 for that,
>>>
>>> I was wondering what is the thinking behind this, and what is the
>>> proper way to use the Utf8 value -----
>>> oftentimes in my logic, I need to compare the value against other
>>> String's, or store them into other databases , which
>>> of course do not know about Utf8, so that I'd have to transform them
>>> into String's.  so it seems being Utf8 unnecessarily
>>> asks for a lot of transformations.
>>>
>>> or I guess I'm not getting the correct usage ?
>>>
>>> Thanks
>>> Yang
>>



Re: why Utf8 (vs String)?

Posted by Yang <te...@gmail.com>.
Thanks  a lot Doug

On Thu, Aug 11, 2011 at 5:02 PM, Doug Cutting <cu...@apache.org> wrote:
> This is for performance.
>
> A Utf8 may be efficiently compared to other Utf8's, e.g., when sorting,
> without decoding the UTF-8 bytes into characters.  A Utf8 may also be
> reused, so when iterating through a large number of values (e.g., in a
> MapReduce job) only a single instance need be allocated, while String
> would require an allocation per iteration.
>
> Note that String may be used when writing data, but that data is
> generally read as Utf8.  The toString() method may be called whenever a
> String is required.  If only equality or ordering is needed, and not
> substring operations, then leaving values as Utf8 is generally faster
> than converting to String.
>
> Doug
>
> On 08/11/2011 04:36 PM, Yang wrote:
>> if I declare a field to be "string", the generated java implementation
>> uses avro......Utf8 for that,
>>
>> I was wondering what is the thinking behind this, and what is the
>> proper way to use the Utf8 value -----
>> oftentimes in my logic, I need to compare the value against other
>> String's, or store them into other databases , which
>> of course do not know about Utf8, so that I'd have to transform them
>> into String's.  so it seems being Utf8 unnecessarily
>> asks for a lot of transformations.
>>
>> or I guess I'm not getting the correct usage ?
>>
>> Thanks
>> Yang
>

Re: why Utf8 (vs String)?

Posted by Doug Cutting <cu...@apache.org>.
This is for performance.

A Utf8 may be efficiently compared to other Utf8's, e.g., when sorting,
without decoding the UTF-8 bytes into characters.  A Utf8 may also be
reused, so when iterating through a large number of values (e.g., in a
MapReduce job) only a single instance need be allocated, while String
would require an allocation per iteration.

Note that String may be used when writing data, but that data is
generally read as Utf8.  The toString() method may be called whenever a
String is required.  If only equality or ordering is needed, and not
substring operations, then leaving values as Utf8 is generally faster
than converting to String.

Doug

On 08/11/2011 04:36 PM, Yang wrote:
> if I declare a field to be "string", the generated java implementation
> uses avro......Utf8 for that,
> 
> I was wondering what is the thinking behind this, and what is the
> proper way to use the Utf8 value -----
> oftentimes in my logic, I need to compare the value against other
> String's, or store them into other databases , which
> of course do not know about Utf8, so that I'd have to transform them
> into String's.  so it seems being Utf8 unnecessarily
> asks for a lot of transformations.
> 
> or I guess I'm not getting the correct usage ?
> 
> Thanks
> Yang