You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Norbert Burger <no...@gmail.com> on 2011/08/15 18:19:38 UTC

push down filters for HbaseStorage

Hi folks,

We have a ~35 GB Hbase table that's split across several hundred regions.
I'm using the Pig version bundled with CDH3u1, which is 0.8.1 plus a few
patches.  In particular, it includes PIG-1680.

With the push down filters from PIG-1680, my thought was that a LOAD/FILTER
combo like [1] would only result in map tasks being created for the regions
that overlap the requested key space (eg., greater than '12344323413').
 Instead I see a map task being created for every region in the table.  Was
my assumption off?

Fwiw, I see the same results if I use the -gte param to HbaseStorage.

Norbert

[1]
cvps = LOAD 'hbase://cvps' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('data:value','-loadKey') as
(rowkey:chararray, datavalue:chararray);
A = FILTER cvps BY rowkey > '12344323413';

Re: push down filters for HbaseStorage

Posted by Bill Graham <bi...@gmail.com>.
Glad to hear you got it working. I don't know of a JIRA for that use case,
but it seems like a valid request so feel free to open one.

I'm not entirely sure though if that could be supported under the current
LoadFunc design, but I'd be curious to hear if others have ideas for how to
do it.


On Mon, Aug 15, 2011 at 10:20 AM, Norbert Burger
<no...@gmail.com>wrote:

> Bill -- thanks for your quick response.  I just tried to put together a
> debug log for the -gte case to provide more info, and realized that it WAS
> working as advertised (map tasks created only for overlapping regions).
>  Sorry for the false alarm.
>
> Out of curiosity, is there a JIRA option to track the FILTER version of
> this?  PIG-1205 seems to be an umbrella ticket for all the changes.
>
> Norbert
>
> On Mon, Aug 15, 2011 at 12:37 PM, Bill Graham <bi...@gmail.com>wrote:
>
>> I don't think the predicate push-down you're showing in [1] is currently
>> supported, but the -gte param in the constructor definitely is (see
>> HBaseTableInputFormat and PIG-1205). If  that's not working, then it's a
>> bug. Is there anything helpful in the logs?
>>
>>
>>
>> On Mon, Aug 15, 2011 at 9:19 AM, Norbert Burger <norbert.burger@gmail.com
>> >wrote:
>>
>> > Hi folks,
>> >
>> > We have a ~35 GB Hbase table that's split across several hundred
>> regions.
>> > I'm using the Pig version bundled with CDH3u1, which is 0.8.1 plus a few
>> > patches.  In particular, it includes PIG-1680.
>> >
>> > With the push down filters from PIG-1680, my thought was that a
>> LOAD/FILTER
>> > combo like [1] would only result in map tasks being created for the
>> regions
>> > that overlap the requested key space (eg., greater than '12344323413').
>> >  Instead I see a map task being created for every region in the table.
>>  Was
>> > my assumption off?
>> >
>> > Fwiw, I see the same results if I use the -gte param to HbaseStorage.
>> >
>> > Norbert
>> >
>> > [1]
>> > cvps = LOAD 'hbase://cvps' USING
>> >
>> org.apache.pig.backend.hadoop.hbase.HBaseStorage('data:value','-loadKey')
>> > as
>> > (rowkey:chararray, datavalue:chararray);
>> > A = FILTER cvps BY rowkey > '12344323413';
>> >
>>
>
>

Re: push down filters for HbaseStorage

Posted by Norbert Burger <no...@gmail.com>.
Bill -- thanks for your quick response.  I just tried to put together a
debug log for the -gte case to provide more info, and realized that it WAS
working as advertised (map tasks created only for overlapping regions).
 Sorry for the false alarm.

Out of curiosity, is there a JIRA option to track the FILTER version of
this?  PIG-1205 seems to be an umbrella ticket for all the changes.

Norbert

On Mon, Aug 15, 2011 at 12:37 PM, Bill Graham <bi...@gmail.com> wrote:

> I don't think the predicate push-down you're showing in [1] is currently
> supported, but the -gte param in the constructor definitely is (see
> HBaseTableInputFormat and PIG-1205). If  that's not working, then it's a
> bug. Is there anything helpful in the logs?
>
>
>
> On Mon, Aug 15, 2011 at 9:19 AM, Norbert Burger <norbert.burger@gmail.com
> >wrote:
>
> > Hi folks,
> >
> > We have a ~35 GB Hbase table that's split across several hundred regions.
> > I'm using the Pig version bundled with CDH3u1, which is 0.8.1 plus a few
> > patches.  In particular, it includes PIG-1680.
> >
> > With the push down filters from PIG-1680, my thought was that a
> LOAD/FILTER
> > combo like [1] would only result in map tasks being created for the
> regions
> > that overlap the requested key space (eg., greater than '12344323413').
> >  Instead I see a map task being created for every region in the table.
>  Was
> > my assumption off?
> >
> > Fwiw, I see the same results if I use the -gte param to HbaseStorage.
> >
> > Norbert
> >
> > [1]
> > cvps = LOAD 'hbase://cvps' USING
> > org.apache.pig.backend.hadoop.hbase.HBaseStorage('data:value','-loadKey')
> > as
> > (rowkey:chararray, datavalue:chararray);
> > A = FILTER cvps BY rowkey > '12344323413';
> >
>

Re: push down filters for HbaseStorage

Posted by Bill Graham <bi...@gmail.com>.
I don't think the predicate push-down you're showing in [1] is currently
supported, but the -gte param in the constructor definitely is (see
HBaseTableInputFormat and PIG-1205). If  that's not working, then it's a
bug. Is there anything helpful in the logs?



On Mon, Aug 15, 2011 at 9:19 AM, Norbert Burger <no...@gmail.com>wrote:

> Hi folks,
>
> We have a ~35 GB Hbase table that's split across several hundred regions.
> I'm using the Pig version bundled with CDH3u1, which is 0.8.1 plus a few
> patches.  In particular, it includes PIG-1680.
>
> With the push down filters from PIG-1680, my thought was that a LOAD/FILTER
> combo like [1] would only result in map tasks being created for the regions
> that overlap the requested key space (eg., greater than '12344323413').
>  Instead I see a map task being created for every region in the table.  Was
> my assumption off?
>
> Fwiw, I see the same results if I use the -gte param to HbaseStorage.
>
> Norbert
>
> [1]
> cvps = LOAD 'hbase://cvps' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('data:value','-loadKey')
> as
> (rowkey:chararray, datavalue:chararray);
> A = FILTER cvps BY rowkey > '12344323413';
>