You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Roberto Congiu <ro...@openx.org> on 2009/07/09 01:26:13 UTC

how to write a SerDe

Hi,I am writing a SerDe class to be able to query some proprietary format we
have from hive.
The format is basically a sequence of records that are maps coded in binary
for which we have access libraries.
The file is also gzipped.

For what I understand, I need to
1 - write a FileInputFormat class to read the file and extract the single
records as Writables (but I am not clear how I tell hive to use this
fileformat since all I can use is STORED AS SEQUENCEFILE/TEXTFILE. How do I
plug my format in there? )
2 - Write a SerDe (Since I just need to read it I need just the deserializer
part) and an ObjectInspector to let hive understand how to find a column

is there any info around for these or somebody who's done something similar
?
Thanks in advance,
Roberto

Re: how to write a SerDe

Posted by Zheng Shao <zs...@gmail.com>.
Sorry about the delay on this.


Here are several example SerDes that got added to the code base recently:

RegexSerDe: A SerDe for parsing text using regex (and an example for
parsing Apache Log using a regex)
  https://issues.apache.org/jira/browse/HIVE-167
  contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

BinarySortableSerDe: A SerDe that serializes rows into a binary format
that keeps the relative order of rows.
  https://issues.apache.org/jira/browse/HIVE-553
  serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java

We also have a ThriftDeserializer.java, which you can take as an
example for writing ProtocolBufferSerDe.
I will also try to clean up ThriftDeserializer a bit.


Zheng

On Sun, Jul 12, 2009 at 11:41 PM, Zheng Shao<zs...@gmail.com> wrote:
> Hi Kevin,
>
> Yes I will work on a how-to tutorial on SerDe this week.
>
> One important performance benefit of Hive SerDe is that it can reuse
> the same object to deserialize different rows - which means there can
> be no object creation needed for each of the rows.
>
> Zheng
>
> On Sun, Jul 12, 2009 at 10:15 PM, Kevin Weil<ke...@gmail.com> wrote:
>> +1 to Roberto's question... I'd love some more examples here too.  I looked
>> into writing a protocol buffer Serde a little while ago (the company I was
>> working for had data coming in as protobufs, and it seemed silly to convert
>> every piece to thrift first) and was underwhelmed by the
>> documentation/explanations.  FWIW, and maybe to generate a little friendly
>> competition, I was able to write a pig LoadFunc to load arbitrary protocol
>> buffers to pig tuples without much trouble...
>> Kevin
>>
>> On Wed, Jul 8, 2009 at 4:26 PM, Roberto Congiu <ro...@openx.org>
>> wrote:
>>>
>>> Hi,
>>> I am writing a SerDe class to be able to query some proprietary format we
>>> have from hive.
>>> The format is basically a sequence of records that are maps coded in
>>> binary for which we have access libraries.
>>> The file is also gzipped.
>>> For what I understand, I need to
>>> 1 - write a FileInputFormat class to read the file and extract the single
>>> records as Writables (but I am not clear how I tell hive to use this
>>> fileformat since all I can use is STORED AS SEQUENCEFILE/TEXTFILE. How do I
>>> plug my format in there? )
>>> 2 - Write a SerDe (Since I just need to read it I need just the
>>> deserializer part) and an ObjectInspector to let hive understand how to find
>>> a column
>>> is there any info around for these or somebody who's done something
>>> similar ?
>>> Thanks in advance,
>>> Roberto
>>
>
>
>
> --
> Yours,
> Zheng
>



-- 
Yours,
Zheng

Re: how to write a SerDe

Posted by Zheng Shao <zs...@gmail.com>.
Hi Kevin,

Yes I will work on a how-to tutorial on SerDe this week.

One important performance benefit of Hive SerDe is that it can reuse
the same object to deserialize different rows - which means there can
be no object creation needed for each of the rows.

Zheng

On Sun, Jul 12, 2009 at 10:15 PM, Kevin Weil<ke...@gmail.com> wrote:
> +1 to Roberto's question... I'd love some more examples here too.  I looked
> into writing a protocol buffer Serde a little while ago (the company I was
> working for had data coming in as protobufs, and it seemed silly to convert
> every piece to thrift first) and was underwhelmed by the
> documentation/explanations.  FWIW, and maybe to generate a little friendly
> competition, I was able to write a pig LoadFunc to load arbitrary protocol
> buffers to pig tuples without much trouble...
> Kevin
>
> On Wed, Jul 8, 2009 at 4:26 PM, Roberto Congiu <ro...@openx.org>
> wrote:
>>
>> Hi,
>> I am writing a SerDe class to be able to query some proprietary format we
>> have from hive.
>> The format is basically a sequence of records that are maps coded in
>> binary for which we have access libraries.
>> The file is also gzipped.
>> For what I understand, I need to
>> 1 - write a FileInputFormat class to read the file and extract the single
>> records as Writables (but I am not clear how I tell hive to use this
>> fileformat since all I can use is STORED AS SEQUENCEFILE/TEXTFILE. How do I
>> plug my format in there? )
>> 2 - Write a SerDe (Since I just need to read it I need just the
>> deserializer part) and an ObjectInspector to let hive understand how to find
>> a column
>> is there any info around for these or somebody who's done something
>> similar ?
>> Thanks in advance,
>> Roberto
>



-- 
Yours,
Zheng

Re: how to write a SerDe

Posted by Kevin Weil <ke...@gmail.com>.
+1 to Roberto's question... I'd love some more examples here too.  I looked
into writing a protocol buffer Serde a little while ago (the company I was
working for had data coming in as protobufs, and it seemed silly to convert
every piece to thrift first) and was underwhelmed by the
documentation/explanations.  FWIW, and maybe to generate a little friendly
competition, I was able to write a pig LoadFunc to load arbitrary protocol
buffers to pig tuples without much trouble...
Kevin

On Wed, Jul 8, 2009 at 4:26 PM, Roberto Congiu <ro...@openx.org>wrote:

> Hi,I am writing a SerDe class to be able to query some proprietary format
> we have from hive.
> The format is basically a sequence of records that are maps coded in binary
> for which we have access libraries.
> The file is also gzipped.
>
> For what I understand, I need to
> 1 - write a FileInputFormat class to read the file and extract the single
> records as Writables (but I am not clear how I tell hive to use this
> fileformat since all I can use is STORED AS SEQUENCEFILE/TEXTFILE. How do
> I plug my format in there? )
> 2 - Write a SerDe (Since I just need to read it I need just the
> deserializer part) and an ObjectInspector to let hive understand how to find
> a column
>
> is there any info around for these or somebody who's done something similar
> ?
> Thanks in advance,
> Roberto
>