You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Min Zhou <co...@gmail.com> on 2009/07/08 08:59:52 UTC

unicode supporting in hive

Hi all,
It seems that hive would go wrong when storing unicode strings. Hive use
byte comparision for delimiting fields of a record(
see  LazyStruct.java:92, a parse method).
If we use gbk or utf-8 encoding where characters would need more than 1
byte, might 2-3 bytes,  then it would by coincidence seperator for
delimiting fields equal one of byte in our gbk/utf-8 encoding character.
thus things go wrong.
Can hive solve the problem above?

Thanks,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: unicode supporting in hive

Posted by Zheng Shao <zs...@gmail.com>.

However, UTF-8 is hard-coded in a lot of places in Hive (actually,
also hadoop, see Text.java)

If you want to use a different encoding like GBK, we will probably
need to extract that UTF-8 out from all the code.

Zheng

On Wed, Jul 8, 2009 at 12:03 AM, Zheng Shao<zs...@gmail.com> wrote:
> Hi Min,
>
> The separators used in Hive are by default ^A, ^B, ^C ... (ascii code
> 1, 2, 3, etc).
> These won't appear in either UTF-8 or GBK:
>
> Please see these code maps for details:
> http://en.wikipedia.org/wiki/UTF-8
> http://en.wikipedia.org/wiki/GBK
>
>
> Zheng
>
> On Tue, Jul 7, 2009 at 11:59 PM, Min Zhou<co...@gmail.com> wrote:
>> Hi all,
>> It seems that hive would go wrong when storing unicode strings. Hive use
>> byte comparision for delimiting fields of a record(
>> see  LazyStruct.java:92, a parse method).
>> If we use gbk or utf-8 encoding where characters would need more than 1
>> byte, might 2-3 bytes,  then it would by coincidence seperator for
>> delimiting fields equal one of byte in our gbk/utf-8 encoding character.
>> thus things go wrong.
>> Can hive solve the problem above?
>>
>> Thanks,
>> Min
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>
>
>
> --
> Yours,
> Zheng
>



-- 
Yours,
Zheng

Re: unicode supporting in hive

Posted by Zheng Shao <zs...@gmail.com>.

Hi Min,

The separators used in Hive are by default ^A, ^B, ^C ... (ascii code
1, 2, 3, etc).
These won't appear in either UTF-8 or GBK:

Please see these code maps for details:
http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/GBK


Zheng

On Tue, Jul 7, 2009 at 11:59 PM, Min Zhou<co...@gmail.com> wrote:
> Hi all,
> It seems that hive would go wrong when storing unicode strings. Hive use
> byte comparision for delimiting fields of a record(
> see  LazyStruct.java:92, a parse method).
> If we use gbk or utf-8 encoding where characters would need more than 1
> byte, might 2-3 bytes,  then it would by coincidence seperator for
> delimiting fields equal one of byte in our gbk/utf-8 encoding character.
> thus things go wrong.
> Can hive solve the problem above?
>
> Thanks,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
Yours,
Zheng