You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by James Young <br...@gmail.com> on 2012/02/14 18:45:46 UTC

multiple partial scans in the row

Hi there,

I am pretty new to HBase and i am trying to understand the best
practice to do the scan based on two/multiple partial scans for the
row key.

For example, I have a row key like:  orderId-timeStamp-item. The
orderId has nothing to with the timeStamp and i have a requirement to
scan rows for certain orderIds ( a range of orderIds)  within certain
time period.    I am not sure if it is possible  to perform two
partial scan: one is for orderId and another one is for the timeStamp.

Also, doing regular expression on the row key might work out.  But it
is more expensive. so I am wondering what would be the best practice
for solving such a problem.


Thanks in advance,

James

Re: multiple partial scans in the row

Posted by NNever <nn...@gmail.com>.

Hi James, I'm new to HBase too.
How about this:

with "a range of orderIds", select the first id.
Step1 :   set this ID as startRow, then checkout the closest id(Only fetch
one),
Step2:    then with this fetched ID, setStartRow(fetchedID-startTimestamp),
setEndRow(fetchedID-endTimestamp),
Step3:    then use this fetchedID as newStartRow, then checkout the closest
id(Only fetch one),
Then loop Step2 and Step1 util reaching the End range of IDs

I think without using Filter, this operation will be fast (It only rely on
the dictionary-turns). THE ONLY problem is there may be too many RPC calls.
For Sloving this problem, you can use Endpoint to do those Scans on the
RegionServer and combine results through single RPC call.


2012/2/15 James Young <br...@gmail.com>

> Thank you Ian! Yes, the orderIds are ordered.
>
> I might try timeStamp filter. But it still doesn't provide the early
> out feature. not sure how the performance it could be. Do you think it
> might be worth having a custom filter to do two partial scans?
>
> Thanks again.
> James
>
> On Wed, Feb 15, 2012 at 2:01 AM, Ian Varley <iv...@salesforce.com>
> wrote:
> > James,
> >
> > Are your orderIds ordered? You say "a range of orderIds", which implies
> that (i.e. they're sequential numbers like 001, 002, etc, not hashes or
> random values). If so, then a single scan can hit the rows for multiple
> contiguous orderIds (you'd set the start and stop rows based on a prefix of
> the row key that's just the length of the orderid).
> >
> > Another question: are the time ranges you're scanning a big or small
> proportion of all the rows for each order id? If you generally expect to
> return a majority of the rows per each order, then a single scan (starting
> with the lowest orderId, and proceeding to the highest) is possibly still a
> good fit. You can also apply timestamp filters (which enables an
> optimization to exclude storefiles that couldn't possibly contain values in
> that timestamp range); that only works if the timestamps on your cells
> match the timestamp in the row key.
> >
> > Alternately, if you expect to return only a small portion of the records
> (i.e. you keep a lot of items with a wide range of timestamps in each
> orderId, but you only want to retrieve a small set of them), you might want
> to do one scan per orderId. You can choose how much parallelism to put into
> it by controlling that yourself (i.e. use a thread per scan on the client
> side); you could theoretically do a thread per order id, but of course, if
> you have a very large number of them, that could be harmful.
> >
> > A regular expression doesn't get you past the fundamental requirement,
> which is that at the server side, it has to look at every row (excepting
> special optimizations like the timestamp one I mentioned above).
> >
> > Your best bet is to implement it a couple ways, with real data, and see
> which ones seem to work the fastest.
> >
> > Ian
> >
> > On Feb 14, 2012, at 11:45 AM, James Young wrote:
> >
> > Hi there,
> >
> > I am pretty new to HBase and i am trying to understand the best
> > practice to do the scan based on two/multiple partial scans for the
> > row key.
> >
> > For example, I have a row key like:  orderId-timeStamp-item. The
> > orderId has nothing to with the timeStamp and i have a requirement to
> > scan rows for certain orderIds ( a range of orderIds)  within certain
> > time period.    I am not sure if it is possible  to perform two
> > partial scan: one is for orderId and another one is for the timeStamp.
> >
> > Also, doing regular expression on the row key might work out.  But it
> > is more expensive. so I am wondering what would be the best practice
> > for solving such a problem.
> >
> >
> > Thanks in advance,
> >
> > James
> >
>

Re: multiple partial scans in the row

Posted by James Young <br...@gmail.com>.

Thank you Ian! Yes, the orderIds are ordered.

I might try timeStamp filter. But it still doesn't provide the early
out feature. not sure how the performance it could be. Do you think it
might be worth having a custom filter to do two partial scans?

Thanks again.
James

On Wed, Feb 15, 2012 at 2:01 AM, Ian Varley <iv...@salesforce.com> wrote:
> James,
>
> Are your orderIds ordered? You say "a range of orderIds", which implies that (i.e. they're sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows based on a prefix of the row key that's just the length of the orderid).
>
> Another question: are the time ranges you're scanning a big or small proportion of all the rows for each order id? If you generally expect to return a majority of the rows per each order, then a single scan (starting with the lowest orderId, and proceeding to the highest) is possibly still a good fit. You can also apply timestamp filters (which enables an optimization to exclude storefiles that couldn't possibly contain values in that timestamp range); that only works if the timestamps on your cells match the timestamp in the row key.
>
> Alternately, if you expect to return only a small portion of the records (i.e. you keep a lot of items with a wide range of timestamps in each orderId, but you only want to retrieve a small set of them), you might want to do one scan per orderId. You can choose how much parallelism to put into it by controlling that yourself (i.e. use a thread per scan on the client side); you could theoretically do a thread per order id, but of course, if you have a very large number of them, that could be harmful.
>
> A regular expression doesn't get you past the fundamental requirement, which is that at the server side, it has to look at every row (excepting special optimizations like the timestamp one I mentioned above).
>
> Your best bet is to implement it a couple ways, with real data, and see which ones seem to work the fastest.
>
> Ian
>
> On Feb 14, 2012, at 11:45 AM, James Young wrote:
>
> Hi there,
>
> I am pretty new to HBase and i am trying to understand the best
> practice to do the scan based on two/multiple partial scans for the
> row key.
>
> For example, I have a row key like:  orderId-timeStamp-item. The
> orderId has nothing to with the timeStamp and i have a requirement to
> scan rows for certain orderIds ( a range of orderIds)  within certain
> time period.    I am not sure if it is possible  to perform two
> partial scan: one is for orderId and another one is for the timeStamp.
>
> Also, doing regular expression on the row key might work out.  But it
> is more expensive. so I am wondering what would be the best practice
> for solving such a problem.
>
>
> Thanks in advance,
>
> James
>

Re: multiple partial scans in the row

Posted by Ian Varley <iv...@salesforce.com>.

James,

Are your orderIds ordered? You say "a range of orderIds", which implies that (i.e. they're sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows based on a prefix of the row key that's just the length of the orderid).

Another question: are the time ranges you're scanning a big or small proportion of all the rows for each order id? If you generally expect to return a majority of the rows per each order, then a single scan (starting with the lowest orderId, and proceeding to the highest) is possibly still a good fit. You can also apply timestamp filters (which enables an optimization to exclude storefiles that couldn't possibly contain values in that timestamp range); that only works if the timestamps on your cells match the timestamp in the row key.

Alternately, if you expect to return only a small portion of the records (i.e. you keep a lot of items with a wide range of timestamps in each orderId, but you only want to retrieve a small set of them), you might want to do one scan per orderId. You can choose how much parallelism to put into it by controlling that yourself (i.e. use a thread per scan on the client side); you could theoretically do a thread per order id, but of course, if you have a very large number of them, that could be harmful.

A regular expression doesn't get you past the fundamental requirement, which is that at the server side, it has to look at every row (excepting special optimizations like the timestamp one I mentioned above).

Your best bet is to implement it a couple ways, with real data, and see which ones seem to work the fastest.

Ian

On Feb 14, 2012, at 11:45 AM, James Young wrote:

Hi there,

I am pretty new to HBase and i am trying to understand the best
practice to do the scan based on two/multiple partial scans for the
row key.

For example, I have a row key like: orderId-timeStamp-item. The
orderId has nothing to with the timeStamp and i have a requirement to
scan rows for certain orderIds ( a range of orderIds) within certain
time period. I am not sure if it is possible to perform two
partial scan: one is for orderId and another one is for the timeStamp.

Also, doing regular expression on the row key might work out. But it
is more expensive. so I am wondering what would be the best practice
for solving such a problem.

Thanks in advance,

James