You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Bill Craig <bc...@gmail.com> on 2009/04/13 18:06:33 UTC

SerDe with a binary formatted file.

I am attempting to write a SerDe implementation to load a binary
formatted file which consists of the following repeating form:

Integer (4 Bytes, length of binary block)
Binary block of data of variable length designated by the preceding
Integer value (This happens to be a protocol buffer).

Deserializing the protocol buffer is fairly straight forward if given
the correct size writeable blob from Hive. However, since the file is
binary I do not see how to give Hive a way to send me the correct size
blob of data.  There is no way to specify a “row delimited by” .
While this problem is using Protocol Buffers it should be the same as
parsing the input to any binary file that requires  sequential
reading.  I have been looking into extending the byteswritable
interface, which would work with a direct hadoop read but I don’t know
how to get Hive to read using that interface.

I know I can make a hadoop format program and reformat these files but
there  is quite a lot of data and would like to avoid doing that.

Am I missing something obvious?

Re: SerDe with a binary formatted file.

Posted by Zheng Shao <zs...@gmail.com>.
Hi Bill,

There are 2 missing pieces of code to make Hive directly read data like
this:

1. FileFormat: We need to write a derived class of InputFileFormat in Hadoop
to be able to read this file format. FileFormat tells us how the rows are
stored in the file.
2. ProtocolBufferSerDe. We need to write a class to implement the SerDe
interface from Hive. SerDe tells us what is the format of the row.

Let us know if you have more questions on this.

Zheng

On Mon, Apr 13, 2009 at 9:06 AM, Bill Craig <bc...@gmail.com> wrote:

> I am attempting to write a SerDe implementation to load a binary
> formatted file which consists of the following repeating form:
>
> Integer (4 Bytes, length of binary block)
> Binary block of data of variable length designated by the preceding
> Integer value (This happens to be a protocol buffer).
>
> Deserializing the protocol buffer is fairly straight forward if given
> the correct size writeable blob from Hive. However, since the file is
> binary I do not see how to give Hive a way to send me the correct size
> blob of data.  There is no way to specify a “row delimited by” .
> While this problem is using Protocol Buffers it should be the same as
> parsing the input to any binary file that requires  sequential
> reading.  I have been looking into extending the byteswritable
> interface, which would work with a direct hadoop read but I don’t know
> how to get Hive to read using that interface.
>
> I know I can make a hadoop format program and reformat these files but
> there  is quite a lot of data and would like to avoid doing that.
>
> Am I missing something obvious?
>



-- 
Yours,
Zheng