You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by bc Wong <bc...@cloudera.com> on 2010/08/02 05:31:07 UTC

Hive support for latin1

Hi all,

I'm trying to figure out how to query Hive on latin1 encoded data.

I created a file with 256 characters, with unicode value 0-255,
encoded in latin1. I made a table out of it. But when I do a "select
*", Hive returns the upper ascii rows as '\xef\xbf\xbd', which is the
replacement character '\ufffd' encoded in UTF-8.

Does anyone know how to work with non-UTF8 data?

Cheers,
-- 
bc Wong
Cloudera Software Engineer

Re: Hive support for latin1

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Aug 2, 2010 at 2:52 AM, Zheng Shao <zs...@gmail.com> wrote:
> It's not possible to do without changing code right now.
>
> Let's open a JIRA for Hive and get that fixed/make it configurable.
>
> Zheng
>
> On Sun, Aug 1, 2010 at 11:41 PM, bc Wong <bc...@cloudera.com> wrote:
>> On Sun, Aug 1, 2010 at 11:14 PM, Zheng Shao <zs...@gmail.com> wrote:
>>> Just change FetchTask.java: public boolean fetch(ArrayList<String> res)
>>>
>>>        res.add(((Text) mSerde.serialize(io.o, io.oi)).toString());
>>>
>>> Instead of using Text.toString(), use your own method to convert from
>>> raw bytes to unicode String.
>>
>> Thanks for the reply, Zheng. Is there another way to do it, outside of
>> changing Hive or writing my own serde?
>>
>> I'm working on adding i18n support for Beeswax
>> <https://issues.cloudera.org/browse/HUE-54>, the Hive UI on HUE. It'd
>> be nice if I can find an easy way for Hive to work with non-UTF8 data.
>> Or do people pretty much only use it on UTF8 data?
>>
>> Cheers,
>> --
>> bc Wong
>> Cloudera Software Engineer
>>
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>
Similar issue for the Cassandra storage handler it would be nice to
have some type for 'byte []' in hive.

Re: Hive support for latin1

Posted by bc Wong <bc...@cloudera.com>.
On Sun, Aug 1, 2010 at 11:52 PM, Zheng Shao <zs...@gmail.com> wrote:
> It's not possible to do without changing code right now.
>
> Let's open a JIRA for Hive and get that fixed/make it configurable.
>
> Zheng

Thanks! Filed <https://issues.apache.org/jira/browse/HIVE-1505>.

Cheers,
-- 
bc Wong
Cloudera Software Engineer

Re: Hive support for latin1

Posted by Zheng Shao <zs...@gmail.com>.
It's not possible to do without changing code right now.

Let's open a JIRA for Hive and get that fixed/make it configurable.

Zheng

On Sun, Aug 1, 2010 at 11:41 PM, bc Wong <bc...@cloudera.com> wrote:
> On Sun, Aug 1, 2010 at 11:14 PM, Zheng Shao <zs...@gmail.com> wrote:
>> Just change FetchTask.java: public boolean fetch(ArrayList<String> res)
>>
>>        res.add(((Text) mSerde.serialize(io.o, io.oi)).toString());
>>
>> Instead of using Text.toString(), use your own method to convert from
>> raw bytes to unicode String.
>
> Thanks for the reply, Zheng. Is there another way to do it, outside of
> changing Hive or writing my own serde?
>
> I'm working on adding i18n support for Beeswax
> <https://issues.cloudera.org/browse/HUE-54>, the Hive UI on HUE. It'd
> be nice if I can find an easy way for Hive to work with non-UTF8 data.
> Or do people pretty much only use it on UTF8 data?
>
> Cheers,
> --
> bc Wong
> Cloudera Software Engineer
>



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: Hive support for latin1

Posted by bc Wong <bc...@cloudera.com>.
On Sun, Aug 1, 2010 at 11:14 PM, Zheng Shao <zs...@gmail.com> wrote:
> Just change FetchTask.java: public boolean fetch(ArrayList<String> res)
>
>        res.add(((Text) mSerde.serialize(io.o, io.oi)).toString());
>
> Instead of using Text.toString(), use your own method to convert from
> raw bytes to unicode String.

Thanks for the reply, Zheng. Is there another way to do it, outside of
changing Hive or writing my own serde?

I'm working on adding i18n support for Beeswax
<https://issues.cloudera.org/browse/HUE-54>, the Hive UI on HUE. It'd
be nice if I can find an easy way for Hive to work with non-UTF8 data.
Or do people pretty much only use it on UTF8 data?

Cheers,
-- 
bc Wong
Cloudera Software Engineer

Re: Hive support for latin1

Posted by Zheng Shao <zs...@gmail.com>.
Just change FetchTask.java: public boolean fetch(ArrayList<String> res)

        res.add(((Text) mSerde.serialize(io.o, io.oi)).toString());

Instead of using Text.toString(), use your own method to convert from
raw bytes to unicode String.


Zheng

On Sun, Aug 1, 2010 at 8:31 PM, bc Wong <bc...@cloudera.com> wrote:
> Hi all,
>
> I'm trying to figure out how to query Hive on latin1 encoded data.
>
> I created a file with 256 characters, with unicode value 0-255,
> encoded in latin1. I made a table out of it. But when I do a "select
> *", Hive returns the upper ascii rows as '\xef\xbf\xbd', which is the
> replacement character '\ufffd' encoded in UTF-8.
>
> Does anyone know how to work with non-UTF8 data?
>
> Cheers,
> --
> bc Wong
> Cloudera Software Engineer
>



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao