You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Kenneth Chan <ck...@gmail.com> on 2014/11/07 04:05:34 UTC

range scan based middle of rowkey

Hi,

If my rowkey is:  hash(id) | timestamp
and If i want to read all data within time range based on the timestamp
(for all id), i can't use range scan in this case right?

what would be the best/efficient way to retrieve all records within time
range for this case?

thanks for the help!
Kenneth

Re: range scan based middle of rowkey

Posted by Shahab Yunus <sh...@gmail.com>.

Yes, you have to define the bucket number before hand.

As for bucket-number being equal to number of regions. It i snot necessary
but if your bucket number is less than number of regions then you will have
to make sure yourself that data is evenly spread. Of course this issue is
only applicable if you are concerned about hotspotting to being with.

Having said all that, my main point to refer you to the article was to give
you an idea about how to perform range scans when data is stored the way
you are storing. You don't have to exactly follow the whole process
explained there in that blog entry. It all depends at the day end on your
use-case.

Regards,
Shahab

On Thu, Nov 6, 2014 at 11:31 PM, Kenneth Chan <ck...@gmail.com> wrote:

> Hi thanks for the link!
> with this approach, does it mean that i must pre-define the number of
> buckets ahead? if i add more region servers later on, I need to re-import
> all data again?
>
> Thanks!
>
>
> On Thursday, November 6, 2014, Shahab Yunus <sh...@gmail.com>
> wrote:
>
> > I think you have to make parallel multiple queries and combine the result
> > on client side. Something like this is doing in its implementation:
> >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> >
> > Regards,
> > Shahab
> >
> >
> > On Thu, Nov 6, 2014 at 10:05 PM, Kenneth Chan <ckhcap@gmail.com
> > <javascript:;>> wrote:
> >
> > > Hi,
> > >
> > > If my rowkey is:  hash(id) | timestamp
> > > and If i want to read all data within time range based on the timestamp
> > > (for all id), i can't use range scan in this case right?
> > >
> > > what would be the best/efficient way to retrieve all records within
> time
> > > range for this case?
> > >
> > > thanks for the help!
> > > Kenneth
> > >
> >
>

Re: range scan based middle of rowkey

Posted by Kenneth Chan <ck...@gmail.com>.

Hi thanks for the link!
with this approach, does it mean that i must pre-define the number of
buckets ahead? if i add more region servers later on, I need to re-import
all data again?

Thanks!


On Thursday, November 6, 2014, Shahab Yunus <sh...@gmail.com> wrote:

> I think you have to make parallel multiple queries and combine the result
> on client side. Something like this is doing in its implementation:
>
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>
> Regards,
> Shahab
>
>
> On Thu, Nov 6, 2014 at 10:05 PM, Kenneth Chan <ckhcap@gmail.com
> <javascript:;>> wrote:
>
> > Hi,
> >
> > If my rowkey is:  hash(id) | timestamp
> > and If i want to read all data within time range based on the timestamp
> > (for all id), i can't use range scan in this case right?
> >
> > what would be the best/efficient way to retrieve all records within time
> > range for this case?
> >
> > thanks for the help!
> > Kenneth
> >
>

Re: range scan based middle of rowkey

Posted by Shahab Yunus <sh...@gmail.com>.

I think you have to make parallel multiple queries and combine the result
on client side. Something like this is doing in its implementation:
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/

Regards,
Shahab

On Thu, Nov 6, 2014 at 10:05 PM, Kenneth Chan <ck...@gmail.com> wrote:

> Hi,
>
> If my rowkey is:  hash(id) | timestamp
> and If i want to read all data within time range based on the timestamp
> (for all id), i can't use range scan in this case right?
>
> what would be the best/efficient way to retrieve all records within time
> range for this case?
>
> thanks for the help!
> Kenneth
>