You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2010/04/12 08:57:16 UTC

Any HbaseStorage users?

Hi folks, I just want to get a show of hands -- does anyone actually use the
current implementation of HBaseStorage in production?

Thanks,
-Dmitriy

Re: Any HbaseStorage users?

Posted by Dan Harvey <da...@mendeley.com>.
I'm doing the same for MySQL dumps right now in Pig to extract the Json from
them with a custom UDF so that makes sense.
I'll try the ProtobufBytesToTuple out then with Pig here and will see how
that goes for us.

Thanks for the help.

On 21 April 2010 23:51, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Dan,
> We have both Thrift and Protobufs in HBase -- with Pig, we simply pick them
> up from HBase as byte array, and pass them through a function which then
> interprets the binary data to produce a Pig tuple (via conversion to either
> thrift or protobuf). By the way, the thrift converter is also new in my
> branch of elephantbird. ProtobufBytesToTuple and corresponding generator
> are
> in "official" Elephant Bird already.
>
> A handy trick is that our exporter extracts the whole row into a protobuf,
> but also puts some of the sql columns into their own columns, so that we
> can
> filter or build secondary indexes on the rows by column values. This does
> duplicate the data, so we don't do this for all the columns, just the ones
> that will likely need this kind of treatment.
>
> -Dmitriy
>
> On Wed, Apr 21, 2010 at 3:27 PM, Dan Harvey <da...@mendeley.com>
> wrote:
>
> > Hey,
> >
> > I spoke with Kevin today at nosqleu about this so that's well timed!
> >
> > Sorry I didn't reply before about HBaseStorage, we've been evaluating how
> > to
> > use Pig with HBase but we're not using that in production or testing.
> >
> > I'm working at Mendeley.com and we are just moving our work from using
> Pig
> > on MySql dumps to HBase so we're looking at the ways we can figure out
> how
> > to easily figure out how to do the binary serialisation to make it easy
> to
> > use Pig, which I guessed you would be doing at Twitter.
> >
> > We are currently comparing using multiple columns in HBase with Java
> > serialisation to encoding the whole row with protocol buffers. Speaking
> to
> > Kevin that you use a mix of both at twitter but only protocol buffers in
> > Pig, which seems different then this code shows?
> >
> > I've had a quick look over your code and it seems that it supports
> > de-serialisation of simple types to Java object, do you have code to use
> > protocol buffers inside HBase as well?
> >
> > It would be interesting to see the details of what you are doing for this
> > and where you are heading to try and share the work we are doing. I would
> > like to get involved with the code to link HBase and Pig and over the
> next
> > few weeks will hopefully be able to start doing that more.
> >
> > Thanks,
> >
> > On 21 April 2010 21:36, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> >
> > > (apparently, the answer to my earlier question is "no one").
> > >
> > > I posted a rough draft of our HbaseLoader (based on the existing
> > > HBaseStorage, but with quite a bit of the guts reworked) to my fork of
> > > Elephant Bird; it should work with Pig 0.6. It's not very battle tested
> > > yet,
> > > but is being used for a few jobs here at Twitter, and seems to be
> > > performing
> > > reasonably well: http://github.com/dvryaboy/elephant-bird
> > >
> > > Would appreciate any feedback.
> > >
> > > It will probably make its way into the main branch of elephant-bird (on
> > > Kevin Weil's github account) some time this week. The plan is to move
> > this
> > > into Pig proper once we make it work with 0.7+, and possibly merge with
> > > Jeff
> > > Zhang's work on making it write as well as read. In addition 0.7 opens
> up
> > a
> > > few nice possibilities (automatic pushing down of filters and
> > projections,
> > > for example), which will be good additions.
> > >
> > > I would like to replace the current all-string HBaseStorage class with
> > this
> > > binary-friendly approach in Pig 0.8. Please speak up if you have
> > > objections.
> > >
> > > -Dmitriy
> > >
> > > On Sun, Apr 11, 2010 at 11:57 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > > wrote:
> > >
> > > > Hi folks, I just want to get a show of hands -- does anyone actually
> > use
> > > > the current implementation of HBaseStorage in production?
> > > >
> > > > Thanks,
> > > > -Dmitriy
> > > >
> > >
> >
> >
> >
> > --
> > Dan Harvey | Datamining Engineer
> > www.mendeley.com/profiles/dan-harvey
> >
> > Mendeley Limited | London, UK | www.mendeley.com
> > Registered in England and Wales | Company Number 6419015
> >
>



-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Re: Any HbaseStorage users?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Dan,
We have both Thrift and Protobufs in HBase -- with Pig, we simply pick them
up from HBase as byte array, and pass them through a function which then
interprets the binary data to produce a Pig tuple (via conversion to either
thrift or protobuf). By the way, the thrift converter is also new in my
branch of elephantbird. ProtobufBytesToTuple and corresponding generator are
in "official" Elephant Bird already.

A handy trick is that our exporter extracts the whole row into a protobuf,
but also puts some of the sql columns into their own columns, so that we can
filter or build secondary indexes on the rows by column values. This does
duplicate the data, so we don't do this for all the columns, just the ones
that will likely need this kind of treatment.

-Dmitriy

On Wed, Apr 21, 2010 at 3:27 PM, Dan Harvey <da...@mendeley.com> wrote:

> Hey,
>
> I spoke with Kevin today at nosqleu about this so that's well timed!
>
> Sorry I didn't reply before about HBaseStorage, we've been evaluating how
> to
> use Pig with HBase but we're not using that in production or testing.
>
> I'm working at Mendeley.com and we are just moving our work from using Pig
> on MySql dumps to HBase so we're looking at the ways we can figure out how
> to easily figure out how to do the binary serialisation to make it easy to
> use Pig, which I guessed you would be doing at Twitter.
>
> We are currently comparing using multiple columns in HBase with Java
> serialisation to encoding the whole row with protocol buffers. Speaking to
> Kevin that you use a mix of both at twitter but only protocol buffers in
> Pig, which seems different then this code shows?
>
> I've had a quick look over your code and it seems that it supports
> de-serialisation of simple types to Java object, do you have code to use
> protocol buffers inside HBase as well?
>
> It would be interesting to see the details of what you are doing for this
> and where you are heading to try and share the work we are doing. I would
> like to get involved with the code to link HBase and Pig and over the next
> few weeks will hopefully be able to start doing that more.
>
> Thanks,
>
> On 21 April 2010 21:36, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > (apparently, the answer to my earlier question is "no one").
> >
> > I posted a rough draft of our HbaseLoader (based on the existing
> > HBaseStorage, but with quite a bit of the guts reworked) to my fork of
> > Elephant Bird; it should work with Pig 0.6. It's not very battle tested
> > yet,
> > but is being used for a few jobs here at Twitter, and seems to be
> > performing
> > reasonably well: http://github.com/dvryaboy/elephant-bird
> >
> > Would appreciate any feedback.
> >
> > It will probably make its way into the main branch of elephant-bird (on
> > Kevin Weil's github account) some time this week. The plan is to move
> this
> > into Pig proper once we make it work with 0.7+, and possibly merge with
> > Jeff
> > Zhang's work on making it write as well as read. In addition 0.7 opens up
> a
> > few nice possibilities (automatic pushing down of filters and
> projections,
> > for example), which will be good additions.
> >
> > I would like to replace the current all-string HBaseStorage class with
> this
> > binary-friendly approach in Pig 0.8. Please speak up if you have
> > objections.
> >
> > -Dmitriy
> >
> > On Sun, Apr 11, 2010 at 11:57 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > Hi folks, I just want to get a show of hands -- does anyone actually
> use
> > > the current implementation of HBaseStorage in production?
> > >
> > > Thanks,
> > > -Dmitriy
> > >
> >
>
>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>

Re: Any HbaseStorage users?

Posted by Dan Harvey <da...@mendeley.com>.
Hey,

I spoke with Kevin today at nosqleu about this so that's well timed!

Sorry I didn't reply before about HBaseStorage, we've been evaluating how to
use Pig with HBase but we're not using that in production or testing.

I'm working at Mendeley.com and we are just moving our work from using Pig
on MySql dumps to HBase so we're looking at the ways we can figure out how
to easily figure out how to do the binary serialisation to make it easy to
use Pig, which I guessed you would be doing at Twitter.

We are currently comparing using multiple columns in HBase with Java
serialisation to encoding the whole row with protocol buffers. Speaking to
Kevin that you use a mix of both at twitter but only protocol buffers in
Pig, which seems different then this code shows?

I've had a quick look over your code and it seems that it supports
de-serialisation of simple types to Java object, do you have code to use
protocol buffers inside HBase as well?

It would be interesting to see the details of what you are doing for this
and where you are heading to try and share the work we are doing. I would
like to get involved with the code to link HBase and Pig and over the next
few weeks will hopefully be able to start doing that more.

Thanks,

On 21 April 2010 21:36, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> (apparently, the answer to my earlier question is "no one").
>
> I posted a rough draft of our HbaseLoader (based on the existing
> HBaseStorage, but with quite a bit of the guts reworked) to my fork of
> Elephant Bird; it should work with Pig 0.6. It's not very battle tested
> yet,
> but is being used for a few jobs here at Twitter, and seems to be
> performing
> reasonably well: http://github.com/dvryaboy/elephant-bird
>
> Would appreciate any feedback.
>
> It will probably make its way into the main branch of elephant-bird (on
> Kevin Weil's github account) some time this week. The plan is to move this
> into Pig proper once we make it work with 0.7+, and possibly merge with
> Jeff
> Zhang's work on making it write as well as read. In addition 0.7 opens up a
> few nice possibilities (automatic pushing down of filters and projections,
> for example), which will be good additions.
>
> I would like to replace the current all-string HBaseStorage class with this
> binary-friendly approach in Pig 0.8. Please speak up if you have
> objections.
>
> -Dmitriy
>
> On Sun, Apr 11, 2010 at 11:57 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > Hi folks, I just want to get a show of hands -- does anyone actually use
> > the current implementation of HBaseStorage in production?
> >
> > Thanks,
> > -Dmitriy
> >
>



-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Re: Any HbaseStorage users?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
(apparently, the answer to my earlier question is "no one").

I posted a rough draft of our HbaseLoader (based on the existing
HBaseStorage, but with quite a bit of the guts reworked) to my fork of
Elephant Bird; it should work with Pig 0.6. It's not very battle tested yet,
but is being used for a few jobs here at Twitter, and seems to be performing
reasonably well: http://github.com/dvryaboy/elephant-bird

Would appreciate any feedback.

It will probably make its way into the main branch of elephant-bird (on
Kevin Weil's github account) some time this week. The plan is to move this
into Pig proper once we make it work with 0.7+, and possibly merge with Jeff
Zhang's work on making it write as well as read. In addition 0.7 opens up a
few nice possibilities (automatic pushing down of filters and projections,
for example), which will be good additions.

I would like to replace the current all-string HBaseStorage class with this
binary-friendly approach in Pig 0.8. Please speak up if you have objections.

-Dmitriy

On Sun, Apr 11, 2010 at 11:57 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hi folks, I just want to get a show of hands -- does anyone actually use
> the current implementation of HBaseStorage in production?
>
> Thanks,
> -Dmitriy
>