You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Andrey Pankov <ap...@iponweb.net> on 2009/11/05 17:20:24 UTC

key part of sequence files

Hi guys,

We have a lot of data stored inside compressed SEQ files. Since SEQ is
a sequence of (key,value) pairs we are storing set of columns joined
by tab in key part of SEQ, and the same for value part for another set
of columns. So our SEQ files are of type (Text,Text).
Hive cannot understand such files correctly, i.e. I'm not satisfied by
its defaults. What it does - it ignores key part of SEQ, and value
part can deserialize into set of columns successfully.
Can some please point me how to get Hive not ignore SEQ's key?
Thanks.

-- 
Andrey Pankov

Re: key part of sequence files

Posted by Andrey Pankov <ap...@iponweb.net>.

Thanks Bobby, you saved my time.

On Thu, Nov 5, 2009 at 20:54, Bobby Rullo <bo...@metaweb.com> wrote:
> Andrey,
>
> Here you go:
>
> http://pastebin.com/m5724ce8a
>
> Bobby
> On Nov 5, 2009, at 8:59 AM, Andrey Pankov wrote:
>
>> Thanks Bobby. Yeah, could be nice to take a look into your class, just
>> to get familiar with. Could you please post at pastebin.com ? Thanks a
>> lot!
>>
>> On Thu, Nov 5, 2009 at 18:56, Bobby Rullo <bo...@metaweb.com> wrote:
>>>
>>> I had the exact same question, and Zheng told me I had to implement a new
>>> FileInputFormat, so I extended SequenceFileInputFormat, and it worked out
>>> pretty well.
>>>
>>> If you like, I can post the source code somewhere (here?), but it was
>>> pretty
>>> easy.
>>>
>>> Bobby
>>> On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:
>>>
>>>> Hi guys,
>>>>
>>>> We have a lot of data stored inside compressed SEQ files. Since SEQ is
>>>> a sequence of (key,value) pairs we are storing set of columns joined
>>>> by tab in key part of SEQ, and the same for value part for another set
>>>> of columns. So our SEQ files are of type (Text,Text).
>>>> Hive cannot understand such files correctly, i.e. I'm not satisfied by
>>>> its defaults. What it does - it ignores key part of SEQ, and value
>>>> part can deserialize into set of columns successfully.
>>>> Can some please point me how to get Hive not ignore SEQ's key?
>>>> Thanks.
>>>>
>>>> --
>>>> Andrey Pankov
>>>
>>>
>>
>>
>>
>> --
>> Andrey Pankov
>
>



-- 
Andrey Pankov

Re: key part of sequence files

Posted by Bobby Rullo <bo...@metaweb.com>.

Zheng,

Sure, but it is pretty hacky!

Bobby
On Nov 5, 2009, at 12:51 PM, Zheng Shao wrote:

> Hi Bobby,
>
> Can you open a jira and attach a patch?
> We can put that to contrib.
>
> Zheng
>
>
> On 11/5/09, Bobby Rullo <bo...@metaweb.com> wrote:
>> Andrey,
>>
>> Here you go:
>>
>> http://pastebin.com/m5724ce8a
>>
>> Bobby
>> On Nov 5, 2009, at 8:59 AM, Andrey Pankov wrote:
>>
>>> Thanks Bobby. Yeah, could be nice to take a look into your class,  
>>> just
>>> to get familiar with. Could you please post at pastebin.com ?  
>>> Thanks a
>>> lot!
>>>
>>> On Thu, Nov 5, 2009 at 18:56, Bobby Rullo <bo...@metaweb.com> wrote:
>>>> I had the exact same question, and Zheng told me I had to implement
>>>> a new
>>>> FileInputFormat, so I extended SequenceFileInputFormat, and it
>>>> worked out
>>>> pretty well.
>>>>
>>>> If you like, I can post the source code somewhere (here?), but it
>>>> was pretty
>>>> easy.
>>>>
>>>> Bobby
>>>> On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> We have a lot of data stored inside compressed SEQ files. Since
>>>>> SEQ is
>>>>> a sequence of (key,value) pairs we are storing set of columns  
>>>>> joined
>>>>> by tab in key part of SEQ, and the same for value part for another
>>>>> set
>>>>> of columns. So our SEQ files are of type (Text,Text).
>>>>> Hive cannot understand such files correctly, i.e. I'm not
>>>>> satisfied by
>>>>> its defaults. What it does - it ignores key part of SEQ, and value
>>>>> part can deserialize into set of columns successfully.
>>>>> Can some please point me how to get Hive not ignore SEQ's key?
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> Andrey Pankov
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Andrey Pankov
>>
>>
>
> -- 
> Sent from Gmail for mobile | mobile.google.com
>
> Yours,
> Zheng

Re: key part of sequence files

Posted by Zheng Shao <zs...@gmail.com>.

Hi Bobby,

Can you open a jira and attach a patch?
We can put that to contrib.

Zheng


On 11/5/09, Bobby Rullo <bo...@metaweb.com> wrote:
> Andrey,
>
> Here you go:
>
> http://pastebin.com/m5724ce8a
>
> Bobby
> On Nov 5, 2009, at 8:59 AM, Andrey Pankov wrote:
>
>> Thanks Bobby. Yeah, could be nice to take a look into your class, just
>> to get familiar with. Could you please post at pastebin.com ? Thanks a
>> lot!
>>
>> On Thu, Nov 5, 2009 at 18:56, Bobby Rullo <bo...@metaweb.com> wrote:
>>> I had the exact same question, and Zheng told me I had to implement
>>> a new
>>> FileInputFormat, so I extended SequenceFileInputFormat, and it
>>> worked out
>>> pretty well.
>>>
>>> If you like, I can post the source code somewhere (here?), but it
>>> was pretty
>>> easy.
>>>
>>> Bobby
>>> On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:
>>>
>>>> Hi guys,
>>>>
>>>> We have a lot of data stored inside compressed SEQ files. Since
>>>> SEQ is
>>>> a sequence of (key,value) pairs we are storing set of columns joined
>>>> by tab in key part of SEQ, and the same for value part for another
>>>> set
>>>> of columns. So our SEQ files are of type (Text,Text).
>>>> Hive cannot understand such files correctly, i.e. I'm not
>>>> satisfied by
>>>> its defaults. What it does - it ignores key part of SEQ, and value
>>>> part can deserialize into set of columns successfully.
>>>> Can some please point me how to get Hive not ignore SEQ's key?
>>>> Thanks.
>>>>
>>>> --
>>>> Andrey Pankov
>>>
>>>
>>
>>
>>
>> --
>> Andrey Pankov
>
>

-- 
Sent from Gmail for mobile | mobile.google.com

Yours,
Zheng

Re: key part of sequence files

Posted by Bobby Rullo <bo...@metaweb.com>.

Andrey,

Here you go:

http://pastebin.com/m5724ce8a

Bobby
On Nov 5, 2009, at 8:59 AM, Andrey Pankov wrote:

> Thanks Bobby. Yeah, could be nice to take a look into your class, just
> to get familiar with. Could you please post at pastebin.com ? Thanks a
> lot!
>
> On Thu, Nov 5, 2009 at 18:56, Bobby Rullo <bo...@metaweb.com> wrote:
>> I had the exact same question, and Zheng told me I had to implement  
>> a new
>> FileInputFormat, so I extended SequenceFileInputFormat, and it  
>> worked out
>> pretty well.
>>
>> If you like, I can post the source code somewhere (here?), but it  
>> was pretty
>> easy.
>>
>> Bobby
>> On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:
>>
>>> Hi guys,
>>>
>>> We have a lot of data stored inside compressed SEQ files. Since  
>>> SEQ is
>>> a sequence of (key,value) pairs we are storing set of columns joined
>>> by tab in key part of SEQ, and the same for value part for another  
>>> set
>>> of columns. So our SEQ files are of type (Text,Text).
>>> Hive cannot understand such files correctly, i.e. I'm not  
>>> satisfied by
>>> its defaults. What it does - it ignores key part of SEQ, and value
>>> part can deserialize into set of columns successfully.
>>> Can some please point me how to get Hive not ignore SEQ's key?
>>> Thanks.
>>>
>>> --
>>> Andrey Pankov
>>
>>
>
>
>
> -- 
> Andrey Pankov

Re: key part of sequence files

Posted by Andrey Pankov <ap...@iponweb.net>.

Thanks Bobby. Yeah, could be nice to take a look into your class, just
to get familiar with. Could you please post at pastebin.com ? Thanks a
lot!

On Thu, Nov 5, 2009 at 18:56, Bobby Rullo <bo...@metaweb.com> wrote:
> I had the exact same question, and Zheng told me I had to implement a new
> FileInputFormat, so I extended SequenceFileInputFormat, and it worked out
> pretty well.
>
> If you like, I can post the source code somewhere (here?), but it was pretty
> easy.
>
> Bobby
> On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:
>
>> Hi guys,
>>
>> We have a lot of data stored inside compressed SEQ files. Since SEQ is
>> a sequence of (key,value) pairs we are storing set of columns joined
>> by tab in key part of SEQ, and the same for value part for another set
>> of columns. So our SEQ files are of type (Text,Text).
>> Hive cannot understand such files correctly, i.e. I'm not satisfied by
>> its defaults. What it does - it ignores key part of SEQ, and value
>> part can deserialize into set of columns successfully.
>> Can some please point me how to get Hive not ignore SEQ's key?
>> Thanks.
>>
>> --
>> Andrey Pankov
>
>



-- 
Andrey Pankov

Re: key part of sequence files

Posted by Bobby Rullo <bo...@metaweb.com>.

I had the exact same question, and Zheng told me I had to implement a  
new FileInputFormat, so I extended SequenceFileInputFormat, and it  
worked out pretty well.

If you like, I can post the source code somewhere (here?), but it was  
pretty easy.

Bobby
On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:

> Hi guys,
>
> We have a lot of data stored inside compressed SEQ files. Since SEQ is
> a sequence of (key,value) pairs we are storing set of columns joined
> by tab in key part of SEQ, and the same for value part for another set
> of columns. So our SEQ files are of type (Text,Text).
> Hive cannot understand such files correctly, i.e. I'm not satisfied by
> its defaults. What it does - it ignores key part of SEQ, and value
> part can deserialize into set of columns successfully.
> Can some please point me how to get Hive not ignore SEQ's key?
> Thanks.
>
> -- 
> Andrey Pankov