You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Nanheng Wu <na...@gmail.com> on 2011/01/28 05:54:06 UTC

Use loadtable.rb with compressed data?

Hi,

I am using hbase 0.20.6.  Is it possible for the loadtable.rb script
to create the table from compressed output? I have a MR job where the
reducer outputs Gzip compressed HFiles. When I ran loadtable.rb it
didn't have any complaints and seemed to update the meta data table
correctly. But when I tried to query against the table no data would
come back (scan show 0 zero etc). Does anyone know if it's possible?
or If I must create tables from compressed HFIles directly, what other
options do I have besides the script? Thanks!

Re: Row Keys

Posted by Dani Rayan <da...@gmail.com>.

In HBase the concept of "column qualifiers" is interesting, it can be
created on fly for a "column-family" So it is as good as tagging the data.
Hence, you can get all rows belonging to particular tag/qualifier using
rowscan. I'm not sure if this answers your query.

I know they are always sorted but if they are how do you know which row key
> belong to which data? Currently I use a row key of ID|Date
>



-Thanks,
Dani Rayan.
http://www.cc.gatech.edu/~iar3/ <http://www.cc.gatech.edu/%7Eiar3/>

P.S.  I missed "column-family"  in previous email

On Sat, Jan 29, 2011 at 1:07 AM, Dani Rayan <da...@gmail.com> wrote:

> Hey can explain your query with example ?
>
>
> I know they are always sorted but if they are how do you know which row key
>> belong to which data? Currently I use a row key of ID|Date
>>
>
> > I don't clearly understand "which data", there are few things like
> getFamilyMap etc. which allows you to get more info about the table.
>
> In HBase the concept of "column qualifiers" is interesting, it can be
> created on fly for a "column-qualifier" So it is as good as tagging the
> data. Hence, you can get all rows belonging to particular tag/qualifier
> using rowscan. I'm not sure if this answers your query.
>
> -Thanks,
> Dani Rayan.
> http://www.cc.gatech.edu/~iar3/ <http://www.cc.gatech.edu/%7Eiar3/>
>
> On Fri, Jan 28, 2011 at 3:45 PM, Peter Haidinyak <ph...@local.com>wrote:
>
>> I know they are always sorted but if they are how do you know which row
>> key belong to which data? Currently I use a row key of ID|Date so I always
>> know what the startrow and endrow should be. I know I'm missing something
>> really fundamental here. :-(
>>
>> Thanks
>>
>> -Pete
>>
>> -----Original Message-----
>> From: tsuna [mailto:tsunanet@gmail.com]
>> Sent: Friday, January 28, 2011 12:14 PM
>> To: user@hbase.apache.org
>> Subject: Re: Row Keys
>>
>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
>> wrote:
>> >        This is going to seem like a dumb question but it is recommended
>> that you use a random key to spread the insert/read load among your region
>> servers. My question is if I am using a scan with startrow and endrow  how
>> does that work with random row keys?
>>
>> The keys are always sorted.  So if you generate random keys, you'll
>> get your data back in a random order.
>> What is recommended depends on the specific problem you're trying to
>> solve.  But generally, one of the strengths of HBase is that the rows
>> are sorted, so sequential scanning is efficient (thanks to data
>> locality).
>>
>> --
>> Benoit "tsuna" Sigoure
>> Software Engineer @ www.StumbleUpon.com
>>
>
>

Re: Row Keys

Posted by Dani Rayan <da...@gmail.com>.

Hey can explain your query with example ?

I know they are always sorted but if they are how do you know which row key
> belong to which data? Currently I use a row key of ID|Date
>

> I don't clearly understand "which data", there are few things like
getFamilyMap etc. which allows you to get more info about the table.

In HBase the concept of "column qualifiers" is interesting, it can be
created on fly for a "column-qualifier" So it is as good as tagging the
data. Hence, you can get all rows belonging to particular tag/qualifier
using rowscan. I'm not sure if this answers your query.

-Thanks,
Dani Rayan.
http://www.cc.gatech.edu/~iar3/

On Fri, Jan 28, 2011 at 3:45 PM, Peter Haidinyak <ph...@local.com>wrote:

> I know they are always sorted but if they are how do you know which row key
> belong to which data? Currently I use a row key of ID|Date so I always know
> what the startrow and endrow should be. I know I'm missing something really
> fundamental here. :-(
>
> Thanks
>
> -Pete
>
> -----Original Message-----
> From: tsuna [mailto:tsunanet@gmail.com]
> Sent: Friday, January 28, 2011 12:14 PM
> To: user@hbase.apache.org
> Subject: Re: Row Keys
>
> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
> wrote:
> >        This is going to seem like a dumb question but it is recommended
> that you use a random key to spread the insert/read load among your region
> servers. My question is if I am using a scan with startrow and endrow  how
> does that work with random row keys?
>
> The keys are always sorted.  So if you generate random keys, you'll
> get your data back in a random order.
> What is recommended depends on the specific problem you're trying to
> solve.  But generally, one of the strengths of HBase is that the rows
> are sorted, so sequential scanning is efficient (thanks to data
> locality).
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>

RE: Row Keys

Posted by Peter Haidinyak <ph...@local.com>.

Great stuff, thanks.

-Pete

-----Original Message-----
From: Lars George [mailto:lars.george@gmail.com] 
Sent: Sunday, January 30, 2011 10:07 PM
To: user@hbase.apache.org
Subject: Re: Row Keys

Hi Pete,

Look into the Mozilla Socorro project
(http://code.google.com/p/socorro/) for how to "salt" the keys to get
better load balancing across sequential keys. The principle is to add
a salt, in this case a number reflecting the number of servers
available (some multiple of that to allow for growth) and then prefix
the sequential key with it so that writes are spread across all
servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
open N scanners where N is the number of distinct salt values and scan
each subset with them while eventually combining the result in client
code. Assuming you want to scan all values in January and you have a
salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
"0-201102010000", then another for "1-201101010000" to
"1-201102010000" and so on. Then do the scans (multithreaded for
example) and combine the results client side. The Socorro code shows
one way to implement this.

Lars


On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <ja...@cox.net> wrote:
> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
> '<date>|<id>|~' (tilde character) and this has worked for my data set.
> Unfortunately the key is not distributed very well. That is why I was
> wondering how you do a scan (using start and end row) with a random row key.
>
> Thanks
>
> -Pete
>
> PS. I use <date>|<id> since the id is variable length and this was my first
> attempt. I know have a months worth of data and for my next phase I will
> probably reverse the <date> <id> order since it will work either way.
>
>
> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com> wrote:
>
>> Hey,
>>
>> So variable length keys and lexographical sorting makes it a little
>> tricky to do Scans and get exactly what you want.  This has a lot to
>> do with the ascii table too, and the numerical values.  Let consult
>> (http://www.asciitable.com/) while we work this example through:
>>
>> Take a separation character of | as your code uses.  This is decimal
>> 124, placing it way above both the lower and upper case letters AND
>> numbers, that is good.
>>
>> Now you have something like this:
>>
>> 1234|a_string
>> 1234|other_string
>>
>> now we want to find all rows "belonging to" 1234, so we do a start row
>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>> might work, oh wait, here is another key:
>>
>> 12345|foo
>>
>> ok so '5' < '|' so it should short like so:
>> 1234|a_string
>> 1234|other_string
>> 12345|foo
>>
>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>> still "larger" than '12345|foo' so that row would be incorrectly
>> included in the scan results assuming we only want '1234' related
>> rows.
>>
>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>> outside of the control characters, space is the lowest character at
>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>> choose an end double quote as in '1234"' to define your 'stop row'.
>> Now you would be prohibited from using any character smaller than '33'
>> in your strings, which is kind of a non ideal solution.
>>
>> This is all pretty clumsy, and doesnt work great in these variable
>> length separated strings.
>>
>> The ultimate solution is to use the PrefixFilter, which is configured as
>> such:
>> byte[] start_row = Bytes.toBytes("1234|");
>> Scan s = new Scan(start_row);
>> s.setFilter(new PrefixFilter(start_row));
>> // do scan.
>>
>> that way no matter what sortability your separator is, you will get
>> the answer you want every time.
>>
>>
>>
>> Another way to do compound keys is to go pure-binary.  For example I
>> want a key that is 2 integers, so I can do this:
>> int part1 = ... ;
>> int part2 = ... ;
>> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>>
>> Now you can also search for all rows starting with 'target' like such:
>> int target = ... ;
>> // start key is 'target', stop key is 'target+1'
>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>
>> And you get exactly what you want, nothing more or less (all rows
>> starting with 'target').
>>
>> The lexicographic comparison is very tricky sometimes. One quick tip
>> is that if your numbers (longs, ints) are big endian encoded (all the
>> utilities in Bytes.java do so), then the lexicographic sorting is
>> equal to the numeric sorting.  Otherwise if you do strings you end up
>> with:
>> 1
>> 11
>> 2
>> 3
>>
>> and things are 'out of order'... if that is important, you can pad it
>> with 0s - dont forget to use the proper amount, which is 10 digits for
>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>
>> -ryan
>>
>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com>
>> wrote:
>>>
>>> Hi Pete,
>>>
>>> You're right. If you use random keys, you will never know the start /
>>> end keys for scan. What you really want to do is to deign the key that
>>> will distribute well for writes but also has the certain locality for
>>> scans.
>>>
>>> You probably have the ideal key already (ID|Date). If you don't make
>>> entire key to be random but just the ID part, you could get a good
>>> distribution at write time because writes for different IDs will be
>>> distributed across the regions, and you also could get a good scan
>>> performance when you scan between certain dates for a specific ID
>>> because rows for the ID will be stored together in one region.
>>>
>>> Thanks,
>>> Tatsuya
>>>
>>>
>>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>>>
>>>> I know they are always sorted but if they are how do you know which row
>>>> key belong to which data? Currently I use a row key of ID|Date so I always
>>>> know what the startrow and endrow should be. I know I'm missing something
>>>> really fundamental here. :-(
>>>>
>>>> Thanks
>>>>
>>>> -Pete
>>>>
>>>> -----Original Message-----
>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: Row Keys
>>>>
>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
>>>> wrote:
>>>>>
>>>>>       This is going to seem like a dumb question but it is recommended
>>>>> that you use a random key to spread the insert/read load among your region
>>>>> servers. My question is if I am using a scan with startrow and endrow  how
>>>>> does that work with random row keys?
>>>>
>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>> get your data back in a random order.
>>>> What is recommended depends on the specific problem you're trying to
>>>> solve.  But generally, one of the strengths of HBase is that the rows
>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>> locality).
>>>>
>>>> --
>>>> Benoit "tsuna" Sigoure
>>>> Software Engineer @ www.StumbleUpon.com
>>>>
>>>
>>>
>>>
>>> --
>>> 河野 達也
>>> Tatsuya Kawano (Mr.)
>>> Tokyo, Japan
>>>
>>> twitter: http://twitter.com/tatsuya6502
>>>
>
>

Re: Row Keys

Posted by Pete Haidinyak <ja...@cox.net>.

I want to do a scan of a subset of the data using startrow and endrow. If  
the keys are random I can't set a startrow/endrow, as far as I know. If I  
reverse the order of <date>|<id> for the row key I will get a better  
distribution. Unfortunately a large set of the data comes from just two  
ids.

-Pete

On Sun, 30 Jan 2011 22:10:07 -0800, Ryan Rawson <ry...@gmail.com> wrote:

> Hey,
>
> I don't understand the 'random scan' question... if you want to scan a
> random key, just scan! For example:
>
> byte [] random_key = generateRandomKeyUsingRandomNumberGenerator();
> Scan s = new Scan(random_key);
>
> But you must mean something else... perhaps you could illuminate me?
>
> -ryan
>
> On Sun, Jan 30, 2011 at 10:06 PM, Lars George <la...@gmail.com>  
> wrote:
>> Hi Pete,
>>
>> Look into the Mozilla Socorro project
>> (http://code.google.com/p/socorro/) for how to "salt" the keys to get
>> better load balancing across sequential keys. The principle is to add
>> a salt, in this case a number reflecting the number of servers
>> available (some multiple of that to allow for growth) and then prefix
>> the sequential key with it so that writes are spread across all
>> servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
>> open N scanners where N is the number of distinct salt values and scan
>> each subset with them while eventually combining the result in client
>> code. Assuming you want to scan all values in January and you have a
>> salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
>> "0-201102010000", then another for "1-201101010000" to
>> "1-201102010000" and so on. Then do the scans (multithreaded for
>> example) and combine the results client side. The Socorro code shows
>> one way to implement this.
>>
>> Lars
>>
>>
>> On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <ja...@cox.net>  
>> wrote:
>>> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
>>> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
>>> '<date>|<id>|~' (tilde character) and this has worked for my data set.
>>> Unfortunately the key is not distributed very well. That is why I was
>>> wondering how you do a scan (using start and end row) with a random  
>>> row key.
>>>
>>> Thanks
>>>
>>> -Pete
>>>
>>> PS. I use <date>|<id> since the id is variable length and this was my  
>>> first
>>> attempt. I know have a months worth of data and for my next phase I  
>>> will
>>> probably reverse the <date> <id> order since it will work either way.
>>>
>>>
>>> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com>  
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> So variable length keys and lexographical sorting makes it a little
>>>> tricky to do Scans and get exactly what you want.  This has a lot to
>>>> do with the ascii table too, and the numerical values.  Let consult
>>>> (http://www.asciitable.com/) while we work this example through:
>>>>
>>>> Take a separation character of | as your code uses.  This is decimal
>>>> 124, placing it way above both the lower and upper case letters AND
>>>> numbers, that is good.
>>>>
>>>> Now you have something like this:
>>>>
>>>> 1234|a_string
>>>> 1234|other_string
>>>>
>>>> now we want to find all rows "belonging to" 1234, so we do a start row
>>>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>>>> might work, oh wait, here is another key:
>>>>
>>>> 12345|foo
>>>>
>>>> ok so '5' < '|' so it should short like so:
>>>> 1234|a_string
>>>> 1234|other_string
>>>> 12345|foo
>>>>
>>>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>>>> still "larger" than '12345|foo' so that row would be incorrectly
>>>> included in the scan results assuming we only want '1234' related
>>>> rows.
>>>>
>>>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>>>> outside of the control characters, space is the lowest character at
>>>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>>>> choose an end double quote as in '1234"' to define your 'stop row'.
>>>> Now you would be prohibited from using any character smaller than '33'
>>>> in your strings, which is kind of a non ideal solution.
>>>>
>>>> This is all pretty clumsy, and doesnt work great in these variable
>>>> length separated strings.
>>>>
>>>> The ultimate solution is to use the PrefixFilter, which is configured  
>>>> as
>>>> such:
>>>> byte[] start_row = Bytes.toBytes("1234|");
>>>> Scan s = new Scan(start_row);
>>>> s.setFilter(new PrefixFilter(start_row));
>>>> // do scan.
>>>>
>>>> that way no matter what sortability your separator is, you will get
>>>> the answer you want every time.
>>>>
>>>>
>>>>
>>>> Another way to do compound keys is to go pure-binary.  For example I
>>>> want a key that is 2 integers, so I can do this:
>>>> int part1 = ... ;
>>>> int part2 = ... ;
>>>> byte[] row_key = Bytes.add(Bytes.toBytes(part1),  
>>>> Bytes.toBytes(part2));
>>>>
>>>> Now you can also search for all rows starting with 'target' like such:
>>>> int target = ... ;
>>>> // start key is 'target', stop key is 'target+1'
>>>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>>>
>>>> And you get exactly what you want, nothing more or less (all rows
>>>> starting with 'target').
>>>>
>>>> The lexicographic comparison is very tricky sometimes. One quick tip
>>>> is that if your numbers (longs, ints) are big endian encoded (all the
>>>> utilities in Bytes.java do so), then the lexicographic sorting is
>>>> equal to the numeric sorting.  Otherwise if you do strings you end up
>>>> with:
>>>> 1
>>>> 11
>>>> 2
>>>> 3
>>>>
>>>> and things are 'out of order'... if that is important, you can pad it
>>>> with 0s - dont forget to use the proper amount, which is 10 digits for
>>>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>>>
>>>> -ryan
>>>>
>>>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano  
>>>> <ta...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi Pete,
>>>>>
>>>>> You're right. If you use random keys, you will never know the start /
>>>>> end keys for scan. What you really want to do is to deign the key  
>>>>> that
>>>>> will distribute well for writes but also has the certain locality for
>>>>> scans.
>>>>>
>>>>> You probably have the ideal key already (ID|Date). If you don't make
>>>>> entire key to be random but just the ID part, you could get a good
>>>>> distribution at write time because writes for different IDs will be
>>>>> distributed across the regions, and you also could get a good scan
>>>>> performance when you scan between certain dates for a specific ID
>>>>> because rows for the ID will be stored together in one region.
>>>>>
>>>>> Thanks,
>>>>> Tatsuya
>>>>>
>>>>>
>>>>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>>>>>
>>>>>> I know they are always sorted but if they are how do you know which  
>>>>>> row
>>>>>> key belong to which data? Currently I use a row key of ID|Date so I  
>>>>>> always
>>>>>> know what the startrow and endrow should be. I know I'm missing  
>>>>>> something
>>>>>> really fundamental here. :-(
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -Pete
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>>>> To: user@hbase.apache.org
>>>>>> Subject: Re: Row Keys
>>>>>>
>>>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak  
>>>>>> <ph...@local.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>       This is going to seem like a dumb question but it is  
>>>>>>> recommended
>>>>>>> that you use a random key to spread the insert/read load among  
>>>>>>> your region
>>>>>>> servers. My question is if I am using a scan with startrow and  
>>>>>>> endrow  how
>>>>>>> does that work with random row keys?
>>>>>>
>>>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>>>> get your data back in a random order.
>>>>>> What is recommended depends on the specific problem you're trying to
>>>>>> solve.  But generally, one of the strengths of HBase is that the  
>>>>>> rows
>>>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>>>> locality).
>>>>>>
>>>>>> --
>>>>>> Benoit "tsuna" Sigoure
>>>>>> Software Engineer @ www.StumbleUpon.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> 河野 達也
>>>>> Tatsuya Kawano (Mr.)
>>>>> Tokyo, Japan
>>>>>
>>>>> twitter: http://twitter.com/tatsuya6502
>>>>>
>>>
>>>
>>

Re: Row Keys

Posted by Ryan Rawson <ry...@gmail.com>.

Hey,

I don't understand the 'random scan' question... if you want to scan a
random key, just scan! For example:

byte [] random_key = generateRandomKeyUsingRandomNumberGenerator();
Scan s = new Scan(random_key);

But you must mean something else... perhaps you could illuminate me?

-ryan

On Sun, Jan 30, 2011 at 10:06 PM, Lars George <la...@gmail.com> wrote:
> Hi Pete,
>
> Look into the Mozilla Socorro project
> (http://code.google.com/p/socorro/) for how to "salt" the keys to get
> better load balancing across sequential keys. The principle is to add
> a salt, in this case a number reflecting the number of servers
> available (some multiple of that to allow for growth) and then prefix
> the sequential key with it so that writes are spread across all
> servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
> open N scanners where N is the number of distinct salt values and scan
> each subset with them while eventually combining the result in client
> code. Assuming you want to scan all values in January and you have a
> salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
> "0-201102010000", then another for "1-201101010000" to
> "1-201102010000" and so on. Then do the scans (multithreaded for
> example) and combine the results client side. The Socorro code shows
> one way to implement this.
>
> Lars
>
>
> On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <ja...@cox.net> wrote:
>> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
>> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
>> '<date>|<id>|~' (tilde character) and this has worked for my data set.
>> Unfortunately the key is not distributed very well. That is why I was
>> wondering how you do a scan (using start and end row) with a random row key.
>>
>> Thanks
>>
>> -Pete
>>
>> PS. I use <date>|<id> since the id is variable length and this was my first
>> attempt. I know have a months worth of data and for my next phase I will
>> probably reverse the <date> <id> order since it will work either way.
>>
>>
>> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> So variable length keys and lexographical sorting makes it a little
>>> tricky to do Scans and get exactly what you want.  This has a lot to
>>> do with the ascii table too, and the numerical values.  Let consult
>>> (http://www.asciitable.com/) while we work this example through:
>>>
>>> Take a separation character of | as your code uses.  This is decimal
>>> 124, placing it way above both the lower and upper case letters AND
>>> numbers, that is good.
>>>
>>> Now you have something like this:
>>>
>>> 1234|a_string
>>> 1234|other_string
>>>
>>> now we want to find all rows "belonging to" 1234, so we do a start row
>>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>>> might work, oh wait, here is another key:
>>>
>>> 12345|foo
>>>
>>> ok so '5' < '|' so it should short like so:
>>> 1234|a_string
>>> 1234|other_string
>>> 12345|foo
>>>
>>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>>> still "larger" than '12345|foo' so that row would be incorrectly
>>> included in the scan results assuming we only want '1234' related
>>> rows.
>>>
>>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>>> outside of the control characters, space is the lowest character at
>>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>>> choose an end double quote as in '1234"' to define your 'stop row'.
>>> Now you would be prohibited from using any character smaller than '33'
>>> in your strings, which is kind of a non ideal solution.
>>>
>>> This is all pretty clumsy, and doesnt work great in these variable
>>> length separated strings.
>>>
>>> The ultimate solution is to use the PrefixFilter, which is configured as
>>> such:
>>> byte[] start_row = Bytes.toBytes("1234|");
>>> Scan s = new Scan(start_row);
>>> s.setFilter(new PrefixFilter(start_row));
>>> // do scan.
>>>
>>> that way no matter what sortability your separator is, you will get
>>> the answer you want every time.
>>>
>>>
>>>
>>> Another way to do compound keys is to go pure-binary.  For example I
>>> want a key that is 2 integers, so I can do this:
>>> int part1 = ... ;
>>> int part2 = ... ;
>>> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>>>
>>> Now you can also search for all rows starting with 'target' like such:
>>> int target = ... ;
>>> // start key is 'target', stop key is 'target+1'
>>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>>
>>> And you get exactly what you want, nothing more or less (all rows
>>> starting with 'target').
>>>
>>> The lexicographic comparison is very tricky sometimes. One quick tip
>>> is that if your numbers (longs, ints) are big endian encoded (all the
>>> utilities in Bytes.java do so), then the lexicographic sorting is
>>> equal to the numeric sorting.  Otherwise if you do strings you end up
>>> with:
>>> 1
>>> 11
>>> 2
>>> 3
>>>
>>> and things are 'out of order'... if that is important, you can pad it
>>> with 0s - dont forget to use the proper amount, which is 10 digits for
>>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>>
>>> -ryan
>>>
>>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com>
>>> wrote:
>>>>
>>>> Hi Pete,
>>>>
>>>> You're right. If you use random keys, you will never know the start /
>>>> end keys for scan. What you really want to do is to deign the key that
>>>> will distribute well for writes but also has the certain locality for
>>>> scans.
>>>>
>>>> You probably have the ideal key already (ID|Date). If you don't make
>>>> entire key to be random but just the ID part, you could get a good
>>>> distribution at write time because writes for different IDs will be
>>>> distributed across the regions, and you also could get a good scan
>>>> performance when you scan between certain dates for a specific ID
>>>> because rows for the ID will be stored together in one region.
>>>>
>>>> Thanks,
>>>> Tatsuya
>>>>
>>>>
>>>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>>>>
>>>>> I know they are always sorted but if they are how do you know which row
>>>>> key belong to which data? Currently I use a row key of ID|Date so I always
>>>>> know what the startrow and endrow should be. I know I'm missing something
>>>>> really fundamental here. :-(
>>>>>
>>>>> Thanks
>>>>>
>>>>> -Pete
>>>>>
>>>>> -----Original Message-----
>>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: Row Keys
>>>>>
>>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
>>>>> wrote:
>>>>>>
>>>>>>       This is going to seem like a dumb question but it is recommended
>>>>>> that you use a random key to spread the insert/read load among your region
>>>>>> servers. My question is if I am using a scan with startrow and endrow  how
>>>>>> does that work with random row keys?
>>>>>
>>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>>> get your data back in a random order.
>>>>> What is recommended depends on the specific problem you're trying to
>>>>> solve.  But generally, one of the strengths of HBase is that the rows
>>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>>> locality).
>>>>>
>>>>> --
>>>>> Benoit "tsuna" Sigoure
>>>>> Software Engineer @ www.StumbleUpon.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> 河野 達也
>>>> Tatsuya Kawano (Mr.)
>>>> Tokyo, Japan
>>>>
>>>> twitter: http://twitter.com/tatsuya6502
>>>>
>>
>>
>

Re: Row Keys

Posted by Lars George <la...@gmail.com>.

Hi Pete,

Look into the Mozilla Socorro project
(http://code.google.com/p/socorro/) for how to "salt" the keys to get
better load balancing across sequential keys. The principle is to add
a salt, in this case a number reflecting the number of servers
available (some multiple of that to allow for growth) and then prefix
the sequential key with it so that writes are spread across all
servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
open N scanners where N is the number of distinct salt values and scan
each subset with them while eventually combining the result in client
code. Assuming you want to scan all values in January and you have a
salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
"0-201102010000", then another for "1-201101010000" to
"1-201102010000" and so on. Then do the scans (multithreaded for
example) and combine the results client side. The Socorro code shows
one way to implement this.

Lars


On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <ja...@cox.net> wrote:
> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
> '<date>|<id>|~' (tilde character) and this has worked for my data set.
> Unfortunately the key is not distributed very well. That is why I was
> wondering how you do a scan (using start and end row) with a random row key.
>
> Thanks
>
> -Pete
>
> PS. I use <date>|<id> since the id is variable length and this was my first
> attempt. I know have a months worth of data and for my next phase I will
> probably reverse the <date> <id> order since it will work either way.
>
>
> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com> wrote:
>
>> Hey,
>>
>> So variable length keys and lexographical sorting makes it a little
>> tricky to do Scans and get exactly what you want.  This has a lot to
>> do with the ascii table too, and the numerical values.  Let consult
>> (http://www.asciitable.com/) while we work this example through:
>>
>> Take a separation character of | as your code uses.  This is decimal
>> 124, placing it way above both the lower and upper case letters AND
>> numbers, that is good.
>>
>> Now you have something like this:
>>
>> 1234|a_string
>> 1234|other_string
>>
>> now we want to find all rows "belonging to" 1234, so we do a start row
>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>> might work, oh wait, here is another key:
>>
>> 12345|foo
>>
>> ok so '5' < '|' so it should short like so:
>> 1234|a_string
>> 1234|other_string
>> 12345|foo
>>
>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>> still "larger" than '12345|foo' so that row would be incorrectly
>> included in the scan results assuming we only want '1234' related
>> rows.
>>
>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>> outside of the control characters, space is the lowest character at
>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>> choose an end double quote as in '1234"' to define your 'stop row'.
>> Now you would be prohibited from using any character smaller than '33'
>> in your strings, which is kind of a non ideal solution.
>>
>> This is all pretty clumsy, and doesnt work great in these variable
>> length separated strings.
>>
>> The ultimate solution is to use the PrefixFilter, which is configured as
>> such:
>> byte[] start_row = Bytes.toBytes("1234|");
>> Scan s = new Scan(start_row);
>> s.setFilter(new PrefixFilter(start_row));
>> // do scan.
>>
>> that way no matter what sortability your separator is, you will get
>> the answer you want every time.
>>
>>
>>
>> Another way to do compound keys is to go pure-binary.  For example I
>> want a key that is 2 integers, so I can do this:
>> int part1 = ... ;
>> int part2 = ... ;
>> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>>
>> Now you can also search for all rows starting with 'target' like such:
>> int target = ... ;
>> // start key is 'target', stop key is 'target+1'
>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>
>> And you get exactly what you want, nothing more or less (all rows
>> starting with 'target').
>>
>> The lexicographic comparison is very tricky sometimes. One quick tip
>> is that if your numbers (longs, ints) are big endian encoded (all the
>> utilities in Bytes.java do so), then the lexicographic sorting is
>> equal to the numeric sorting.  Otherwise if you do strings you end up
>> with:
>> 1
>> 11
>> 2
>> 3
>>
>> and things are 'out of order'... if that is important, you can pad it
>> with 0s - dont forget to use the proper amount, which is 10 digits for
>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>
>> -ryan
>>
>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com>
>> wrote:
>>>
>>> Hi Pete,
>>>
>>> You're right. If you use random keys, you will never know the start /
>>> end keys for scan. What you really want to do is to deign the key that
>>> will distribute well for writes but also has the certain locality for
>>> scans.
>>>
>>> You probably have the ideal key already (ID|Date). If you don't make
>>> entire key to be random but just the ID part, you could get a good
>>> distribution at write time because writes for different IDs will be
>>> distributed across the regions, and you also could get a good scan
>>> performance when you scan between certain dates for a specific ID
>>> because rows for the ID will be stored together in one region.
>>>
>>> Thanks,
>>> Tatsuya
>>>
>>>
>>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>>>
>>>> I know they are always sorted but if they are how do you know which row
>>>> key belong to which data? Currently I use a row key of ID|Date so I always
>>>> know what the startrow and endrow should be. I know I'm missing something
>>>> really fundamental here. :-(
>>>>
>>>> Thanks
>>>>
>>>> -Pete
>>>>
>>>> -----Original Message-----
>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: Row Keys
>>>>
>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
>>>> wrote:
>>>>>
>>>>>       This is going to seem like a dumb question but it is recommended
>>>>> that you use a random key to spread the insert/read load among your region
>>>>> servers. My question is if I am using a scan with startrow and endrow  how
>>>>> does that work with random row keys?
>>>>
>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>> get your data back in a random order.
>>>> What is recommended depends on the specific problem you're trying to
>>>> solve.  But generally, one of the strengths of HBase is that the rows
>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>> locality).
>>>>
>>>> --
>>>> Benoit "tsuna" Sigoure
>>>> Software Engineer @ www.StumbleUpon.com
>>>>
>>>
>>>
>>>
>>> --
>>> 河野 達也
>>> Tatsuya Kawano (Mr.)
>>> Tokyo, Japan
>>>
>>> twitter: http://twitter.com/tatsuya6502
>>>
>
>

Re: Row Keys

Posted by Pete Haidinyak <ja...@cox.net>.

Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
My start row is '<date>|<id>| ' (with a space ascii 32) and end row is  
'<date>|<id>|~' (tilde character) and this has worked for my data set.  
Unfortunately the key is not distributed very well. That is why I was  
wondering how you do a scan (using start and end row) with a random row  
key.

Thanks

-Pete

PS. I use <date>|<id> since the id is variable length and this was my  
first attempt. I know have a months worth of data and for my next phase I  
will probably reverse the <date> <id> order since it will work either way.


On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com> wrote:

> Hey,
>
> So variable length keys and lexographical sorting makes it a little
> tricky to do Scans and get exactly what you want.  This has a lot to
> do with the ascii table too, and the numerical values.  Let consult
> (http://www.asciitable.com/) while we work this example through:
>
> Take a separation character of | as your code uses.  This is decimal
> 124, placing it way above both the lower and upper case letters AND
> numbers, that is good.
>
> Now you have something like this:
>
> 1234|a_string
> 1234|other_string
>
> now we want to find all rows "belonging to" 1234, so we do a start row
> of '1234|', but what for the end key? Well, let's try... '1234}', that
> might work, oh wait, here is another key:
>
> 12345|foo
>
> ok so '5' < '|' so it should short like so:
> 1234|a_string
> 1234|other_string
> 12345|foo
>
> hmm well how does our end row compare? well '5' < '}' so '1234}' is
> still "larger" than '12345|foo' so that row would be incorrectly
> included in the scan results assuming we only want '1234' related
> rows.
>
> Ok, well maybe a better solution is to pick a lower ascii?  Well
> outside of the control characters, space is the lowest character at
> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
> choose an end double quote as in '1234"' to define your 'stop row'.
> Now you would be prohibited from using any character smaller than '33'
> in your strings, which is kind of a non ideal solution.
>
> This is all pretty clumsy, and doesnt work great in these variable
> length separated strings.
>
> The ultimate solution is to use the PrefixFilter, which is configured as  
> such:
> byte[] start_row = Bytes.toBytes("1234|");
> Scan s = new Scan(start_row);
> s.setFilter(new PrefixFilter(start_row));
> // do scan.
>
> that way no matter what sortability your separator is, you will get
> the answer you want every time.
>
>
>
> Another way to do compound keys is to go pure-binary.  For example I
> want a key that is 2 integers, so I can do this:
> int part1 = ... ;
> int part2 = ... ;
> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>
> Now you can also search for all rows starting with 'target' like such:
> int target = ... ;
> // start key is 'target', stop key is 'target+1'
> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>
> And you get exactly what you want, nothing more or less (all rows
> starting with 'target').
>
> The lexicographic comparison is very tricky sometimes. One quick tip
> is that if your numbers (longs, ints) are big endian encoded (all the
> utilities in Bytes.java do so), then the lexicographic sorting is
> equal to the numeric sorting.  Otherwise if you do strings you end up
> with:
> 1
> 11
> 2
> 3
>
> and things are 'out of order'... if that is important, you can pad it
> with 0s - dont forget to use the proper amount, which is 10 digits for
> ints, and 19 for longs.  Or consider using binary encoding as above.
>
> -ryan
>
> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com>  
> wrote:
>> Hi Pete,
>>
>> You're right. If you use random keys, you will never know the start /
>> end keys for scan. What you really want to do is to deign the key that
>> will distribute well for writes but also has the certain locality for
>> scans.
>>
>> You probably have the ideal key already (ID|Date). If you don't make
>> entire key to be random but just the ID part, you could get a good
>> distribution at write time because writes for different IDs will be
>> distributed across the regions, and you also could get a good scan
>> performance when you scan between certain dates for a specific ID
>> because rows for the ID will be stored together in one region.
>>
>> Thanks,
>> Tatsuya
>>
>>
>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>> I know they are always sorted but if they are how do you know which  
>>> row key belong to which data? Currently I use a row key of ID|Date so  
>>> I always know what the startrow and endrow should be. I know I'm  
>>> missing something really fundamental here. :-(
>>>
>>> Thanks
>>>
>>> -Pete
>>>
>>> -----Original Message-----
>>> From: tsuna [mailto:tsunanet@gmail.com]
>>> Sent: Friday, January 28, 2011 12:14 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: Row Keys
>>>
>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak  
>>> <ph...@local.com> wrote:
>>>>        This is going to seem like a dumb question but it is  
>>>> recommended that you use a random key to spread the insert/read load  
>>>> among your region servers. My question is if I am using a scan with  
>>>> startrow and endrow  how does that work with random row keys?
>>>
>>> The keys are always sorted.  So if you generate random keys, you'll
>>> get your data back in a random order.
>>> What is recommended depends on the specific problem you're trying to
>>> solve.  But generally, one of the strengths of HBase is that the rows
>>> are sorted, so sequential scanning is efficient (thanks to data
>>> locality).
>>>
>>> --
>>> Benoit "tsuna" Sigoure
>>> Software Engineer @ www.StumbleUpon.com
>>>
>>
>>
>>
>> --
>> 河野 達也
>> Tatsuya Kawano (Mr.)
>> Tokyo, Japan
>>
>> twitter: http://twitter.com/tatsuya6502
>>

Re: Row Keys

Posted by Ryan Rawson <ry...@gmail.com>.

Hey,

So variable length keys and lexographical sorting makes it a little
tricky to do Scans and get exactly what you want.  This has a lot to
do with the ascii table too, and the numerical values.  Let consult
(http://www.asciitable.com/) while we work this example through:

Take a separation character of | as your code uses.  This is decimal
124, placing it way above both the lower and upper case letters AND
numbers, that is good.

Now you have something like this:

1234|a_string
1234|other_string

now we want to find all rows "belonging to" 1234, so we do a start row
of '1234|', but what for the end key? Well, let's try... '1234}', that
might work, oh wait, here is another key:

12345|foo

ok so '5' < '|' so it should short like so:
1234|a_string
1234|other_string
12345|foo

hmm well how does our end row compare? well '5' < '}' so '1234}' is
still "larger" than '12345|foo' so that row would be incorrectly
included in the scan results assuming we only want '1234' related
rows.

Ok, well maybe a better solution is to pick a lower ascii?  Well
outside of the control characters, space is the lowest character at
32, 33 is '!' so perhaps ! would be a better choice.  So you could
choose an end double quote as in '1234"' to define your 'stop row'.
Now you would be prohibited from using any character smaller than '33'
in your strings, which is kind of a non ideal solution.

This is all pretty clumsy, and doesnt work great in these variable
length separated strings.

The ultimate solution is to use the PrefixFilter, which is configured as such:
byte[] start_row = Bytes.toBytes("1234|");
Scan s = new Scan(start_row);
s.setFilter(new PrefixFilter(start_row));
// do scan.

that way no matter what sortability your separator is, you will get
the answer you want every time.

Another way to do compound keys is to go pure-binary.  For example I
want a key that is 2 integers, so I can do this:
int part1 = ... ;
int part2 = ... ;
byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));

Now you can also search for all rows starting with 'target' like such:
int target = ... ;
// start key is 'target', stop key is 'target+1'
Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));

And you get exactly what you want, nothing more or less (all rows
starting with 'target').

The lexicographic comparison is very tricky sometimes. One quick tip
is that if your numbers (longs, ints) are big endian encoded (all the
utilities in Bytes.java do so), then the lexicographic sorting is
equal to the numeric sorting.  Otherwise if you do strings you end up
with:
1
11
2
3

and things are 'out of order'... if that is important, you can pad it
with 0s - dont forget to use the proper amount, which is 10 digits for
ints, and 19 for longs.  Or consider using binary encoding as above.

-ryan

On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com> wrote:
> Hi Pete,
>
> You're right. If you use random keys, you will never know the start /
> end keys for scan. What you really want to do is to deign the key that
> will distribute well for writes but also has the certain locality for
> scans.
>
> You probably have the ideal key already (ID|Date). If you don't make
> entire key to be random but just the ID part, you could get a good
> distribution at write time because writes for different IDs will be
> distributed across the regions, and you also could get a good scan
> performance when you scan between certain dates for a specific ID
> because rows for the ID will be stored together in one region.
>
> Thanks,
> Tatsuya
>
>
> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>> I know they are always sorted but if they are how do you know which row key belong to which data? Currently I use a row key of ID|Date so I always know what the startrow and endrow should be. I know I'm missing something really fundamental here. :-(
>>
>> Thanks
>>
>> -Pete
>>
>> -----Original Message-----
>> From: tsuna [mailto:tsunanet@gmail.com]
>> Sent: Friday, January 28, 2011 12:14 PM
>> To: user@hbase.apache.org
>> Subject: Re: Row Keys
>>
>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com> wrote:
>>>        This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?
>>
>> The keys are always sorted.  So if you generate random keys, you'll
>> get your data back in a random order.
>> What is recommended depends on the specific problem you're trying to
>> solve.  But generally, one of the strengths of HBase is that the rows
>> are sorted, so sequential scanning is efficient (thanks to data
>> locality).
>>
>> --
>> Benoit "tsuna" Sigoure
>> Software Engineer @ www.StumbleUpon.com
>>
>
>
>
> --
> 河野 達也
> Tatsuya Kawano (Mr.)
> Tokyo, Japan
>
> twitter: http://twitter.com/tatsuya6502
>

Re: Row Keys

Posted by Tatsuya Kawano <ta...@gmail.com>.

Hi Pete,

You're right. If you use random keys, you will never know the start /
end keys for scan. What you really want to do is to deign the key that
will distribute well for writes but also has the certain locality for
scans.

You probably have the ideal key already (ID|Date). If you don't make
entire key to be random but just the ID part, you could get a good
distribution at write time because writes for different IDs will be
distributed across the regions, and you also could get a good scan
performance when you scan between certain dates for a specific ID
because rows for the ID will be stored together in one region.

Thanks,
Tatsuya

2011/1/29 Peter Haidinyak <ph...@local.com>:
> I know they are always sorted but if they are how do you know which row key belong to which data? Currently I use a row key of ID|Date so I always know what the startrow and endrow should be. I know I'm missing something really fundamental here. :-(
>
> Thanks
>
> -Pete
>
> -----Original Message-----
> From: tsuna [mailto:tsunanet@gmail.com]
> Sent: Friday, January 28, 2011 12:14 PM
> To: user@hbase.apache.org
> Subject: Re: Row Keys
>
> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com> wrote:
>>        This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?
>
> The keys are always sorted.  So if you generate random keys, you'll
> get your data back in a random order.
> What is recommended depends on the specific problem you're trying to
> solve.  But generally, one of the strengths of HBase is that the rows
> are sorted, so sequential scanning is efficient (thanks to data
> locality).
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>

-- 
河野 達也
Tatsuya Kawano (Mr.)
Tokyo, Japan

twitter: http://twitter.com/tatsuya6502

RE: Row Keys

Posted by Peter Haidinyak <ph...@local.com>.

I know they are always sorted but if they are how do you know which row key belong to which data? Currently I use a row key of ID|Date so I always know what the startrow and endrow should be. I know I'm missing something really fundamental here. :-(

Thanks

-Pete

-----Original Message-----
From: tsuna [mailto:tsunanet@gmail.com] 
Sent: Friday, January 28, 2011 12:14 PM
To: user@hbase.apache.org
Subject: Re: Row Keys

On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com> wrote:
>        This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?

The keys are always sorted.  So if you generate random keys, you'll
get your data back in a random order.
What is recommended depends on the specific problem you're trying to
solve.  But generally, one of the strengths of HBase is that the rows
are sorted, so sequential scanning is efficient (thanks to data
locality).

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: Row Keys

Posted by tsuna <ts...@gmail.com>.

On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com> wrote:
>        This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?

The keys are always sorted.  So if you generate random keys, you'll
get your data back in a random order.
What is recommended depends on the specific problem you're trying to
solve.  But generally, one of the strengths of HBase is that the rows
are sorted, so sequential scanning is efficient (thanks to data
locality).

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com