You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Severance, Steve" <ss...@ebay.com> on 2011/03/17 23:48:35 UTC

Building Custom RCFiles

Hi,

I am working on building a MR job that generates RCFiles that will become partitions of a hive table. I have most of it working however only strings (Text) are being deserialized inside of Hive. The hive table is specified to use a columnarserde which I thought should allow the writable types stored in the RCFile to be deserialized properly.

Currently all numeric types (IntWritable and LongWritable) come back a null.

Has anyone else seen anything like this or have any ideas? I would rather not convert all my data to strings to use RCFile.

Thanks.

Steve

RE: Building Custom RCFiles

Posted by "Severance, Steve" <ss...@ebay.com>.
Got it working using the columnar serde with the default seperators.

Steve

-----Original Message-----
From: yongqiang he [mailto:heyongqiangict@gmail.com] 
Sent: Friday, March 18, 2011 3:50 PM
To: user@hive.apache.org
Subject: Re: Building Custom RCFiles

what's your table definition?

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

See ROW FORMAT


Thanks
Yongqiang
On Fri, Mar 18, 2011 at 3:33 PM, Severance, Steve <ss...@ebay.com> wrote:
> One more question. I have everything working except a Map<String,String>.
>
> I understand that the whole Map will be physically stored as a single Text object in the RCFile.
>
> I have had considerable trouble setting up the delimiters for this Map.
>
> I want to have
>        MAP KEYS TERMINATED BY '='
>        COLLECTION ITEMS TERMINATED BY '&'
>
> Hive doesn't seem to want to take that. I have also tried using the ascii OCT codes.
>
> What do I need to setup to make this Map work?
>
> Thanks.
>
> Steve
>
> -----Original Message-----
> From: yongqiang he [mailto:heyongqiangict@gmail.com]
> Sent: Thursday, March 17, 2011 5:09 PM
> To: user@hive.apache.org
> Subject: Re: Building Custom RCFiles
>
> Yes. It is the same with normal hive tables.
>
> thanks
> yongqiang
> On Thu, Mar 17, 2011 at 4:54 PM, Severance, Steve <ss...@ebay.com> wrote:
>> Thanks Yongqiang.
>>
>> So for more complex types like map do I just setup a
>>
>> ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...
>>
>> Thanks.
>>
>> Steve
>>
>> -----Original Message-----
>> From: yongqiang he [mailto:heyongqiangict@gmail.com]
>> Sent: Thursday, March 17, 2011 4:35 PM
>> To: user@hive.apache.org
>> Subject: Re: Building Custom RCFiles
>>
>> A side note, in hive, we make all columns saved as Text internally
>> (even the column's type is int or double etc). And with some
>> experiments, string is more friendly to compression. But it needs CPU
>> to decode to its original type.
>>
>> Thanks
>> Yongqiang
>> On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he <he...@gmail.com> wrote:
>>> You need to customize Hive's ColumnarSerde (maybe functions in
>>> LazySerde)'s serde and deserialize function (depends you want to read
>>> or write.). And the main thing is that you need to use your own type
>>> def (not LazyInt/LazyLong).
>>>
>>> If your type is int or long (not double/float), casting it to string
>>> only wastes some CPU, but can save you more spaces.
>>>
>>> Thanks
>>> Yongqiang
>>> On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve <ss...@ebay.com> wrote:
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am working on building a MR job that generates RCFiles that will become
>>>> partitions of a hive table. I have most of it working however only strings
>>>> (Text) are being deserialized inside of Hive. The hive table is specified to
>>>> use a columnarserde which I thought should allow the writable types stored
>>>> in the RCFile to be deserialized properly.
>>>>
>>>>
>>>>
>>>> Currently all numeric types (IntWritable and LongWritable) come back a null.
>>>>
>>>>
>>>>
>>>> Has anyone else seen anything like this or have any ideas? I would rather
>>>> not convert all my data to strings to use RCFile.
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> Steve
>>>
>>
>

Re: Building Custom RCFiles

Posted by yongqiang he <he...@gmail.com>.
what's your table definition?

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

See ROW FORMAT


Thanks
Yongqiang
On Fri, Mar 18, 2011 at 3:33 PM, Severance, Steve <ss...@ebay.com> wrote:
> One more question. I have everything working except a Map<String,String>.
>
> I understand that the whole Map will be physically stored as a single Text object in the RCFile.
>
> I have had considerable trouble setting up the delimiters for this Map.
>
> I want to have
>        MAP KEYS TERMINATED BY '='
>        COLLECTION ITEMS TERMINATED BY '&'
>
> Hive doesn't seem to want to take that. I have also tried using the ascii OCT codes.
>
> What do I need to setup to make this Map work?
>
> Thanks.
>
> Steve
>
> -----Original Message-----
> From: yongqiang he [mailto:heyongqiangict@gmail.com]
> Sent: Thursday, March 17, 2011 5:09 PM
> To: user@hive.apache.org
> Subject: Re: Building Custom RCFiles
>
> Yes. It is the same with normal hive tables.
>
> thanks
> yongqiang
> On Thu, Mar 17, 2011 at 4:54 PM, Severance, Steve <ss...@ebay.com> wrote:
>> Thanks Yongqiang.
>>
>> So for more complex types like map do I just setup a
>>
>> ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...
>>
>> Thanks.
>>
>> Steve
>>
>> -----Original Message-----
>> From: yongqiang he [mailto:heyongqiangict@gmail.com]
>> Sent: Thursday, March 17, 2011 4:35 PM
>> To: user@hive.apache.org
>> Subject: Re: Building Custom RCFiles
>>
>> A side note, in hive, we make all columns saved as Text internally
>> (even the column's type is int or double etc). And with some
>> experiments, string is more friendly to compression. But it needs CPU
>> to decode to its original type.
>>
>> Thanks
>> Yongqiang
>> On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he <he...@gmail.com> wrote:
>>> You need to customize Hive's ColumnarSerde (maybe functions in
>>> LazySerde)'s serde and deserialize function (depends you want to read
>>> or write.). And the main thing is that you need to use your own type
>>> def (not LazyInt/LazyLong).
>>>
>>> If your type is int or long (not double/float), casting it to string
>>> only wastes some CPU, but can save you more spaces.
>>>
>>> Thanks
>>> Yongqiang
>>> On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve <ss...@ebay.com> wrote:
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am working on building a MR job that generates RCFiles that will become
>>>> partitions of a hive table. I have most of it working however only strings
>>>> (Text) are being deserialized inside of Hive. The hive table is specified to
>>>> use a columnarserde which I thought should allow the writable types stored
>>>> in the RCFile to be deserialized properly.
>>>>
>>>>
>>>>
>>>> Currently all numeric types (IntWritable and LongWritable) come back a null.
>>>>
>>>>
>>>>
>>>> Has anyone else seen anything like this or have any ideas? I would rather
>>>> not convert all my data to strings to use RCFile.
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> Steve
>>>
>>
>

RE: Building Custom RCFiles

Posted by "Severance, Steve" <ss...@ebay.com>.
One more question. I have everything working except a Map<String,String>.

I understand that the whole Map will be physically stored as a single Text object in the RCFile.

I have had considerable trouble setting up the delimiters for this Map.

I want to have
	MAP KEYS TERMINATED BY '='
	COLLECTION ITEMS TERMINATED BY '&'

Hive doesn't seem to want to take that. I have also tried using the ascii OCT codes.

What do I need to setup to make this Map work?

Thanks.

Steve 

-----Original Message-----
From: yongqiang he [mailto:heyongqiangict@gmail.com] 
Sent: Thursday, March 17, 2011 5:09 PM
To: user@hive.apache.org
Subject: Re: Building Custom RCFiles

Yes. It is the same with normal hive tables.

thanks
yongqiang
On Thu, Mar 17, 2011 at 4:54 PM, Severance, Steve <ss...@ebay.com> wrote:
> Thanks Yongqiang.
>
> So for more complex types like map do I just setup a
>
> ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...
>
> Thanks.
>
> Steve
>
> -----Original Message-----
> From: yongqiang he [mailto:heyongqiangict@gmail.com]
> Sent: Thursday, March 17, 2011 4:35 PM
> To: user@hive.apache.org
> Subject: Re: Building Custom RCFiles
>
> A side note, in hive, we make all columns saved as Text internally
> (even the column's type is int or double etc). And with some
> experiments, string is more friendly to compression. But it needs CPU
> to decode to its original type.
>
> Thanks
> Yongqiang
> On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he <he...@gmail.com> wrote:
>> You need to customize Hive's ColumnarSerde (maybe functions in
>> LazySerde)'s serde and deserialize function (depends you want to read
>> or write.). And the main thing is that you need to use your own type
>> def (not LazyInt/LazyLong).
>>
>> If your type is int or long (not double/float), casting it to string
>> only wastes some CPU, but can save you more spaces.
>>
>> Thanks
>> Yongqiang
>> On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve <ss...@ebay.com> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I am working on building a MR job that generates RCFiles that will become
>>> partitions of a hive table. I have most of it working however only strings
>>> (Text) are being deserialized inside of Hive. The hive table is specified to
>>> use a columnarserde which I thought should allow the writable types stored
>>> in the RCFile to be deserialized properly.
>>>
>>>
>>>
>>> Currently all numeric types (IntWritable and LongWritable) come back a null.
>>>
>>>
>>>
>>> Has anyone else seen anything like this or have any ideas? I would rather
>>> not convert all my data to strings to use RCFile.
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Steve
>>
>

Re: Building Custom RCFiles

Posted by yongqiang he <he...@gmail.com>.
Yes. It is the same with normal hive tables.

thanks
yongqiang
On Thu, Mar 17, 2011 at 4:54 PM, Severance, Steve <ss...@ebay.com> wrote:
> Thanks Yongqiang.
>
> So for more complex types like map do I just setup a
>
> ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...
>
> Thanks.
>
> Steve
>
> -----Original Message-----
> From: yongqiang he [mailto:heyongqiangict@gmail.com]
> Sent: Thursday, March 17, 2011 4:35 PM
> To: user@hive.apache.org
> Subject: Re: Building Custom RCFiles
>
> A side note, in hive, we make all columns saved as Text internally
> (even the column's type is int or double etc). And with some
> experiments, string is more friendly to compression. But it needs CPU
> to decode to its original type.
>
> Thanks
> Yongqiang
> On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he <he...@gmail.com> wrote:
>> You need to customize Hive's ColumnarSerde (maybe functions in
>> LazySerde)'s serde and deserialize function (depends you want to read
>> or write.). And the main thing is that you need to use your own type
>> def (not LazyInt/LazyLong).
>>
>> If your type is int or long (not double/float), casting it to string
>> only wastes some CPU, but can save you more spaces.
>>
>> Thanks
>> Yongqiang
>> On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve <ss...@ebay.com> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I am working on building a MR job that generates RCFiles that will become
>>> partitions of a hive table. I have most of it working however only strings
>>> (Text) are being deserialized inside of Hive. The hive table is specified to
>>> use a columnarserde which I thought should allow the writable types stored
>>> in the RCFile to be deserialized properly.
>>>
>>>
>>>
>>> Currently all numeric types (IntWritable and LongWritable) come back a null.
>>>
>>>
>>>
>>> Has anyone else seen anything like this or have any ideas? I would rather
>>> not convert all my data to strings to use RCFile.
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Steve
>>
>

RE: Building Custom RCFiles

Posted by "Severance, Steve" <ss...@ebay.com>.
Thanks Yongqiang.

So for more complex types like map do I just setup a 

ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...

Thanks.

Steve

-----Original Message-----
From: yongqiang he [mailto:heyongqiangict@gmail.com] 
Sent: Thursday, March 17, 2011 4:35 PM
To: user@hive.apache.org
Subject: Re: Building Custom RCFiles

A side note, in hive, we make all columns saved as Text internally
(even the column's type is int or double etc). And with some
experiments, string is more friendly to compression. But it needs CPU
to decode to its original type.

Thanks
Yongqiang
On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he <he...@gmail.com> wrote:
> You need to customize Hive's ColumnarSerde (maybe functions in
> LazySerde)'s serde and deserialize function (depends you want to read
> or write.). And the main thing is that you need to use your own type
> def (not LazyInt/LazyLong).
>
> If your type is int or long (not double/float), casting it to string
> only wastes some CPU, but can save you more spaces.
>
> Thanks
> Yongqiang
> On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve <ss...@ebay.com> wrote:
>> Hi,
>>
>>
>>
>> I am working on building a MR job that generates RCFiles that will become
>> partitions of a hive table. I have most of it working however only strings
>> (Text) are being deserialized inside of Hive. The hive table is specified to
>> use a columnarserde which I thought should allow the writable types stored
>> in the RCFile to be deserialized properly.
>>
>>
>>
>> Currently all numeric types (IntWritable and LongWritable) come back a null.
>>
>>
>>
>> Has anyone else seen anything like this or have any ideas? I would rather
>> not convert all my data to strings to use RCFile.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> Steve
>

Re: Building Custom RCFiles

Posted by yongqiang he <he...@gmail.com>.
A side note, in hive, we make all columns saved as Text internally
(even the column's type is int or double etc). And with some
experiments, string is more friendly to compression. But it needs CPU
to decode to its original type.

Thanks
Yongqiang
On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he <he...@gmail.com> wrote:
> You need to customize Hive's ColumnarSerde (maybe functions in
> LazySerde)'s serde and deserialize function (depends you want to read
> or write.). And the main thing is that you need to use your own type
> def (not LazyInt/LazyLong).
>
> If your type is int or long (not double/float), casting it to string
> only wastes some CPU, but can save you more spaces.
>
> Thanks
> Yongqiang
> On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve <ss...@ebay.com> wrote:
>> Hi,
>>
>>
>>
>> I am working on building a MR job that generates RCFiles that will become
>> partitions of a hive table. I have most of it working however only strings
>> (Text) are being deserialized inside of Hive. The hive table is specified to
>> use a columnarserde which I thought should allow the writable types stored
>> in the RCFile to be deserialized properly.
>>
>>
>>
>> Currently all numeric types (IntWritable and LongWritable) come back a null.
>>
>>
>>
>> Has anyone else seen anything like this or have any ideas? I would rather
>> not convert all my data to strings to use RCFile.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> Steve
>

Re: Building Custom RCFiles

Posted by yongqiang he <he...@gmail.com>.
You need to customize Hive's ColumnarSerde (maybe functions in
LazySerde)'s serde and deserialize function (depends you want to read
or write.). And the main thing is that you need to use your own type
def (not LazyInt/LazyLong).

If your type is int or long (not double/float), casting it to string
only wastes some CPU, but can save you more spaces.

Thanks
Yongqiang
On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve <ss...@ebay.com> wrote:
> Hi,
>
>
>
> I am working on building a MR job that generates RCFiles that will become
> partitions of a hive table. I have most of it working however only strings
> (Text) are being deserialized inside of Hive. The hive table is specified to
> use a columnarserde which I thought should allow the writable types stored
> in the RCFile to be deserialized properly.
>
>
>
> Currently all numeric types (IntWritable and LongWritable) come back a null.
>
>
>
> Has anyone else seen anything like this or have any ideas? I would rather
> not convert all my data to strings to use RCFile.
>
>
>
> Thanks.
>
>
>
> Steve