You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2012/07/14 17:18:25 UTC

[ANN] Hive-protobuf support

Hello all,

My employer, m6d.com, has given the thumbs up to open source our
latest hive tool, hive-protobuf. We created this because we work with
protobuf formats often and wanted to be able to directly log an query
this types without writing one-off User Defined Functions or Input
Formats.

https://github.com/edwardcapriolo/hive-protobuf

Hive-protobuf is much like the new avro support and the already
existing thrift support. Here is how it works:

if you have a sequence file with a serialized protobuf in the key and
a serialized protobuf in the value, a table can be created that
describes the data to hive. The table needs only be configured with
the protobuf generated class name for the key and value and it turns
the nested classes into nested structs.

We eventually will migrate the project into core hive but we want to
let it incubate in github for a time. (For example there is no support
for union types at the moment, maybe other kinks or tunes). Please
checkout the project and send pull requests if you have patches.

Thank you,
Edward

Re: [ANN] Hive-protobuf support

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Hi Edward,

This project looks really good.

Internally, we also have been working on similar changes. Specifically,
enhancing the existing HIve/HBase Integration to support protobufs/thrifts
stored in HBase. Because of the need to specify explicit columns mapping
and number of issues faced [1] with getting the existing
ProtocolBuffersObjectInspector working with the latest protobuf 2.4.1, I
decided to write totally new ObjectInspectors to cleanly deserialize
protobufs and thrifts that use the provided reflections API to perform
deserialization and field extraction.

In short, some of the enhancements are:

1. Support thrift/protobuf stored in HBase using the new ObjectInspectors.
2. Auto generate the columns and column types using the provided
deserializer class by translating them into nested structs. (HIVE-3211)

Some of this stuff is still in development/testing phase. Once that is
done, I can have a patch for this enhancement up for review.

[1]
http://mail-archives.apache.org/mod_mbox/hive-user/201205.mbox/%3CCAENxBwxaSOq1=0u+keaj6NG_s8Zh6=rZvLZ4P2YwGe-UQ+jNeQ@mail.gmail.com%3E

Thanks,


On Sat, Jul 14, 2012 at 10:18 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Hello all,
>
> My employer, m6d.com, has given the thumbs up to open source our
> latest hive tool, hive-protobuf. We created this because we work with
> protobuf formats often and wanted to be able to directly log an query
> this types without writing one-off User Defined Functions or Input
> Formats.
>
> https://github.com/edwardcapriolo/hive-protobuf
>
> Hive-protobuf is much like the new avro support and the already
> existing thrift support. Here is how it works:
>
> if you have a sequence file with a serialized protobuf in the key and
> a serialized protobuf in the value, a table can be created that
> describes the data to hive. The table needs only be configured with
> the protobuf generated class name for the key and value and it turns
> the nested classes into nested structs.
>
> We eventually will migrate the project into core hive but we want to
> let it incubate in github for a time. (For example there is no support
> for union types at the moment, maybe other kinks or tunes). Please
> checkout the project and send pull requests if you have patches.
>
> Thank you,
> Edward
>



-- 
Swarnim