You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent Barat <vi...@gmail.com> on 2011/07/28 12:18:22 UTC

How to load a subset of an HBase table (timestamp based) ?

Hi,

I'd like to make PIG load only a subset of an HBase table, based on 
the timestamp of the records, or on the key of the rows.

As an example, I'd like to load only records that have a timestamp > 
N, or a key > "something".

I know that HBase can handle scanners that are highly optimized to 
perform this kind of things, and it would greatly improve the time 
needed to load my data.

Is there any way to do this ?
If not, it is planned to be added in the HBase loader ?
If not, is it technically possible to do it ?
If yes, can I contribute and propose a patch on that ?

Thank a lot !

Re: How to load a subset of an HBase table (timestamp based) ?

Posted by Bill Graham <bi...@gmail.com>.
Timestamp based querying is being handled in
https://issues.apache.org/jira/browse/PIG-2114 FYI.

On Thu, Jul 28, 2011 at 6:00 AM, Norbert Burger <no...@gmail.com>wrote:

> [3] is titled with respect to storage, but if you read through the comments
> of [2], Dmitriy mentions that it'll also include querying.
>
> Norbert
>
> On Thu, Jul 28, 2011 at 8:53 AM, Vincent Barat <vincent.barat@gmail.com
> >wrote:
>
> > Thanks for the input, [3] is more related to timestamp storage, anyway I
> > added my 2 cents to the issue concerning loading by timestamp.
> >
> > Le 28/07/11 13:19, Norbert Burger a écrit :
> >
> >  You can instruct HBaseStorage to load a subset of the rows using the
> "-gt"
> >> and "-lt" options to HBaseStorage, documented here [1].
> >>
> >> I don't believe querying by timestamp is currently supported in Pig,
> based
> >> on the comments to [2].  There is a standalone JIRA that's been created
> >> [3].
> >>
> >> Norbert
> >>
> >> [1]
> >> http://ofps.oreilly.com/**titles/9781449302641/**
> >> community.html#hbase_options_**table<
> http://ofps.oreilly.com/titles/9781449302641/community.html#hbase_options_table
> >
> >> [2] https://issues.apache.org/**jira/browse/PIG-1782<
> https://issues.apache.org/jira/browse/PIG-1782>
> >> [3] https://issues.apache.org/**jira/browse/PIG-1832<
> https://issues.apache.org/jira/browse/PIG-1832>
> >>
> >> On Thu, Jul 28, 2011 at 6:18 AM, Vincent Barat<vincent.barat@gmail.com
> >**
> >> wrote:
> >>
> >>  Hi,
> >>>
> >>> I'd like to make PIG load only a subset of an HBase table, based on the
> >>> timestamp of the records, or on the key of the rows.
> >>>
> >>> As an example, I'd like to load only records that have a timestamp>  N,
> >>> or
> >>> a key>  "something".
> >>>
> >>> I know that HBase can handle scanners that are highly optimized to
> >>> perform
> >>> this kind of things, and it would greatly improve the time needed to
> load
> >>> my
> >>> data.
> >>>
> >>> Is there any way to do this ?
> >>> If not, it is planned to be added in the HBase loader ?
> >>> If not, is it technically possible to do it ?
> >>> If yes, can I contribute and propose a patch on that ?
> >>>
> >>> Thank a lot !
> >>>
> >>>
>

Re: How to load a subset of an HBase table (timestamp based) ?

Posted by Norbert Burger <no...@gmail.com>.
[3] is titled with respect to storage, but if you read through the comments
of [2], Dmitriy mentions that it'll also include querying.

Norbert

On Thu, Jul 28, 2011 at 8:53 AM, Vincent Barat <vi...@gmail.com>wrote:

> Thanks for the input, [3] is more related to timestamp storage, anyway I
> added my 2 cents to the issue concerning loading by timestamp.
>
> Le 28/07/11 13:19, Norbert Burger a écrit :
>
>  You can instruct HBaseStorage to load a subset of the rows using the "-gt"
>> and "-lt" options to HBaseStorage, documented here [1].
>>
>> I don't believe querying by timestamp is currently supported in Pig, based
>> on the comments to [2].  There is a standalone JIRA that's been created
>> [3].
>>
>> Norbert
>>
>> [1]
>> http://ofps.oreilly.com/**titles/9781449302641/**
>> community.html#hbase_options_**table<http://ofps.oreilly.com/titles/9781449302641/community.html#hbase_options_table>
>> [2] https://issues.apache.org/**jira/browse/PIG-1782<https://issues.apache.org/jira/browse/PIG-1782>
>> [3] https://issues.apache.org/**jira/browse/PIG-1832<https://issues.apache.org/jira/browse/PIG-1832>
>>
>> On Thu, Jul 28, 2011 at 6:18 AM, Vincent Barat<vi...@gmail.com>**
>> wrote:
>>
>>  Hi,
>>>
>>> I'd like to make PIG load only a subset of an HBase table, based on the
>>> timestamp of the records, or on the key of the rows.
>>>
>>> As an example, I'd like to load only records that have a timestamp>  N,
>>> or
>>> a key>  "something".
>>>
>>> I know that HBase can handle scanners that are highly optimized to
>>> perform
>>> this kind of things, and it would greatly improve the time needed to load
>>> my
>>> data.
>>>
>>> Is there any way to do this ?
>>> If not, it is planned to be added in the HBase loader ?
>>> If not, is it technically possible to do it ?
>>> If yes, can I contribute and propose a patch on that ?
>>>
>>> Thank a lot !
>>>
>>>

Re: How to load a subset of an HBase table (timestamp based) ?

Posted by Vincent Barat <vi...@gmail.com>.
Thanks for the input, [3] is more related to timestamp storage, 
anyway I added my 2 cents to the issue concerning loading by timestamp.

Le 28/07/11 13:19, Norbert Burger a écrit :
> You can instruct HBaseStorage to load a subset of the rows using the "-gt"
> and "-lt" options to HBaseStorage, documented here [1].
>
> I don't believe querying by timestamp is currently supported in Pig, based
> on the comments to [2].  There is a standalone JIRA that's been created [3].
>
> Norbert
>
> [1]
> http://ofps.oreilly.com/titles/9781449302641/community.html#hbase_options_table
> [2] https://issues.apache.org/jira/browse/PIG-1782
> [3] https://issues.apache.org/jira/browse/PIG-1832
>
> On Thu, Jul 28, 2011 at 6:18 AM, Vincent Barat<vi...@gmail.com>wrote:
>
>> Hi,
>>
>> I'd like to make PIG load only a subset of an HBase table, based on the
>> timestamp of the records, or on the key of the rows.
>>
>> As an example, I'd like to load only records that have a timestamp>  N, or
>> a key>  "something".
>>
>> I know that HBase can handle scanners that are highly optimized to perform
>> this kind of things, and it would greatly improve the time needed to load my
>> data.
>>
>> Is there any way to do this ?
>> If not, it is planned to be added in the HBase loader ?
>> If not, is it technically possible to do it ?
>> If yes, can I contribute and propose a patch on that ?
>>
>> Thank a lot !
>>

Re: How to load a subset of an HBase table (timestamp based) ?

Posted by Norbert Burger <no...@gmail.com>.
You can instruct HBaseStorage to load a subset of the rows using the "-gt"
and "-lt" options to HBaseStorage, documented here [1].

I don't believe querying by timestamp is currently supported in Pig, based
on the comments to [2].  There is a standalone JIRA that's been created [3].

Norbert

[1]
http://ofps.oreilly.com/titles/9781449302641/community.html#hbase_options_table
[2] https://issues.apache.org/jira/browse/PIG-1782
[3] https://issues.apache.org/jira/browse/PIG-1832

On Thu, Jul 28, 2011 at 6:18 AM, Vincent Barat <vi...@gmail.com>wrote:

> Hi,
>
> I'd like to make PIG load only a subset of an HBase table, based on the
> timestamp of the records, or on the key of the rows.
>
> As an example, I'd like to load only records that have a timestamp > N, or
> a key > "something".
>
> I know that HBase can handle scanners that are highly optimized to perform
> this kind of things, and it would greatly improve the time needed to load my
> data.
>
> Is there any way to do this ?
> If not, it is planned to be added in the HBase loader ?
> If not, is it technically possible to do it ?
> If yes, can I contribute and propose a patch on that ?
>
> Thank a lot !
>