You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Nanheng Wu <na...@gmail.com> on 2011/01/28 05:54:06 UTC

Use loadtable.rb with compressed data?

Hi,

I am using hbase 0.20.6.  Is it possible for the loadtable.rb script
to create the table from compressed output? I have a MR job where the
reducer outputs Gzip compressed HFiles. When I ran loadtable.rb it
didn't have any complaints and seemed to update the meta data table
correctly. But when I tried to query against the table no data would
come back (scan show 0 zero etc). Does anyone know if it's possible?
or If I must create tables from compressed HFIles directly, what other
options do I have besides the script? Thanks!

Re: Row Keys

Posted by Dani Rayan <da...@gmail.com>.
In HBase the concept of "column qualifiers" is interesting, it can be
created on fly for a "column-family" So it is as good as tagging the data.
Hence, you can get all rows belonging to particular tag/qualifier using
rowscan. I'm not sure if this answers your query.

I know they are always sorted but if they are how do you know which row key
> belong to which data? Currently I use a row key of ID|Date
>



-Thanks,
Dani Rayan.
http://www.cc.gatech.edu/~iar3/ <http://www.cc.gatech.edu/%7Eiar3/>

P.S.  I missed "column-family"  in previous email

On Sat, Jan 29, 2011 at 1:07 AM, Dani Rayan <da...@gmail.com> wrote:

> Hey can explain your query with example ?
>
>
> I know they are always sorted but if they are how do you know which row key
>> belong to which data? Currently I use a row key of ID|Date
>>
>
> > I don't clearly understand "which data", there are few things like
> getFamilyMap etc. which allows you to get more info about the table.
>
> In HBase the concept of "column qualifiers" is interesting, it can be
> created on fly for a "column-qualifier" So it is as good as tagging the
> data. Hence, you can get all rows belonging to particular tag/qualifier
> using rowscan. I'm not sure if this answers your query.
>
> -Thanks,
> Dani Rayan.
> http://www.cc.gatech.edu/~iar3/ <http://www.cc.gatech.edu/%7Eiar3/>
>
> On Fri, Jan 28, 2011 at 3:45 PM, Peter Haidinyak <ph...@local.com>wrote:
>
>> I know they are always sorted but if they are how do you know which row
>> key belong to which data? Currently I use a row key of ID|Date so I always
>> know what the startrow and endrow should be. I know I'm missing something
>> really fundamental here. :-(
>>
>> Thanks
>>
>> -Pete
>>
>> -----Original Message-----
>> From: tsuna [mailto:tsunanet@gmail.com]
>> Sent: Friday, January 28, 2011 12:14 PM
>> To: user@hbase.apache.org
>> Subject: Re: Row Keys
>>
>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
>> wrote:
>> >        This is going to seem like a dumb question but it is recommended
>> that you use a random key to spread the insert/read load among your region
>> servers. My question is if I am using a scan with startrow and endrow  how
>> does that work with random row keys?
>>
>> The keys are always sorted.  So if you generate random keys, you'll
>> get your data back in a random order.
>> What is recommended depends on the specific problem you're trying to
>> solve.  But generally, one of the strengths of HBase is that the rows
>> are sorted, so sequential scanning is efficient (thanks to data
>> locality).
>>
>> --
>> Benoit "tsuna" Sigoure
>> Software Engineer @ www.StumbleUpon.com
>>
>
>

Re: Row Keys

Posted by Dani Rayan <da...@gmail.com>.
Hey can explain your query with example ?

I know they are always sorted but if they are how do you know which row key
> belong to which data? Currently I use a row key of ID|Date
>

> I don't clearly understand "which data", there are few things like
getFamilyMap etc. which allows you to get more info about the table.

In HBase the concept of "column qualifiers" is interesting, it can be
created on fly for a "column-qualifier" So it is as good as tagging the
data. Hence, you can get all rows belonging to particular tag/qualifier
using rowscan. I'm not sure if this answers your query.

-Thanks,
Dani Rayan.
http://www.cc.gatech.edu/~iar3/

On Fri, Jan 28, 2011 at 3:45 PM, Peter Haidinyak <ph...@local.com>wrote:

> I know they are always sorted but if they are how do you know which row key
> belong to which data? Currently I use a row key of ID|Date so I always know
> what the startrow and endrow should be. I know I'm missing something really
> fundamental here. :-(
>
> Thanks
>
> -Pete
>
> -----Original Message-----
> From: tsuna [mailto:tsunanet@gmail.com]
> Sent: Friday, January 28, 2011 12:14 PM
> To: user@hbase.apache.org
> Subject: Re: Row Keys
>
> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
> wrote:
> >        This is going to seem like a dumb question but it is recommended
> that you use a random key to spread the insert/read load among your region
> servers. My question is if I am using a scan with startrow and endrow  how
> does that work with random row keys?
>
> The keys are always sorted.  So if you generate random keys, you'll
> get your data back in a random order.
> What is recommended depends on the specific problem you're trying to
> solve.  But generally, one of the strengths of HBase is that the rows
> are sorted, so sequential scanning is efficient (thanks to data
> locality).
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>

RE: Row Keys

Posted by Peter Haidinyak <ph...@local.com>.
Great stuff, thanks.

-Pete

-----Original Message-----
From: Lars George [mailto:lars.george@gmail.com] 
Sent: Sunday, January 30, 2011 10:07 PM
To: user@hbase.apache.org
Subject: Re: Row Keys

Hi Pete,

Look into the Mozilla Socorro project
(http://code.google.com/p/socorro/) for how to "salt" the keys to get
better load balancing across sequential keys. The principle is to add
a salt, in this case a number reflecting the number of servers
available (some multiple of that to allow for growth) and then prefix
the sequential key with it so that writes are spread across all
servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
open N scanners where N is the number of distinct salt values and scan
each subset with them while eventually combining the result in client
code. Assuming you want to scan all values in January and you have a
salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
"0-201102010000", then another for "1-201101010000" to
"1-201102010000" and so on. Then do the scans (multithreaded for
example) and combine the results client side. The Socorro code shows
one way to implement this.

Lars


On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <ja...@cox.net> wrote:
> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
> '<date>|<id>|~' (tilde character) and this has worked for my data set.
> Unfortunately the key is not distributed very well. That is why I was
> wondering how you do a scan (using start and end row) with a random row key.
>
> Thanks
>
> -Pete
>
> PS. I use <date>|<id> since the id is variable length and this was my first
> attempt. I know have a months worth of data and for my next phase I will
> probably reverse the <date> <id> order since it will work either way.
>
>
> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com> wrote:
>
>> Hey,
>>
>> So variable length keys and lexographical sorting makes it a little
>> tricky to do Scans and get exactly what you want.  This has a lot to
>> do with the ascii table too, and the numerical values.  Let consult
>> (http://www.asciitable.com/) while we work this example through:
>>
>> Take a separation character of | as your code uses.  This is decimal
>> 124, placing it way above both the lower and upper case letters AND
>> numbers, that is good.
>>
>> Now you have something like this:
>>
>> 1234|a_string
>> 1234|other_string
>>
>> now we want to find all rows "belonging to" 1234, so we do a start row
>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>> might work, oh wait, here is another key:
>>
>> 12345|foo
>>
>> ok so '5' < '|' so it should short like so:
>> 1234|a_string
>> 1234|other_string
>> 12345|foo
>>
>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>> still "larger" than '12345|foo' so that row would be incorrectly
>> included in the scan results assuming we only want '1234' related
>> rows.
>>
>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>> outside of the control characters, space is the lowest character at
>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>> choose an end double quote as in '1234"' to define your 'stop row'.
>> Now you would be prohibited from using any character smaller than '33'
>> in your strings, which is kind of a non ideal solution.
>>
>> This is all pretty clumsy, and doesnt work great in these variable
>> length separated strings.
>>
>> The ultimate solution is to use the PrefixFilter, which is configured as
>> such:
>> byte[] start_row = Bytes.toBytes("1234|");
>> Scan s = new Scan(start_row);
>> s.setFilter(new PrefixFilter(start_row));
>> // do scan.
>>
>> that way no matter what sortability your separator is, you will get
>> the answer you want every time.
>>
>>
>>
>> Another way to do compound keys is to go pure-binary.  For example I
>> want a key that is 2 integers, so I can do this:
>> int part1 = ... ;
>> int part2 = ... ;
>> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>>
>> Now you can also search for all rows starting with 'target' like such:
>> int target = ... ;
>> // start key is 'target', stop key is 'target+1'
>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>
>> And you get exactly what you want, nothing more or less (all rows
>> starting with 'target').
>>
>> The lexicographic comparison is very tricky sometimes. One quick tip
>> is that if your numbers (longs, ints) are big endian encoded (all the
>> utilities in Bytes.java do so), then the lexicographic sorting is
>> equal to the numeric sorting.  Otherwise if you do strings you end up
>> with:
>> 1
>> 11
>> 2
>> 3
>>
>> and things are 'out of order'... if that is important, you can pad it
>> with 0s - dont forget to use the proper amount, which is 10 digits for
>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>
>> -ryan
>>
>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com>
>> wrote:
>>>
>>> Hi Pete,
>>>
>>> You're right. If you use random keys, you will never know the start /
>>> end keys for scan. What you really want to do is to deign the key that
>>> will distribute well for writes but also has the certain locality for
>>> scans.
>>>
>>> You probably have the ideal key already (ID|Date). If you don't make
>>> entire key to be random but just the ID part, you could get a good
>>> distribution at write time because writes for different IDs will be
>>> distributed across the regions, and you also could get a good scan
>>> performance when you scan between certain dates for a specific ID
>>> because rows for the ID will be stored together in one region.
>>>
>>> Thanks,
>>> Tatsuya
>>>
>>>
>>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>>>
>>>> I know they are always sorted but if they are how do you know which row
>>>> key belong to which data? Currently I use a row key of ID|Date so I always
>>>> know what the startrow and endrow should be. I know I'm missing something
>>>> really fundamental here. :-(
>>>>
>>>> Thanks
>>>>
>>>> -Pete
>>>>
>>>> -----Original Message-----
>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: Row Keys
>>>>
>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
>>>> wrote:
>>>>>
>>>>>       This is going to seem like a dumb question but it is recommended
>>>>> that you use a random key to spread the insert/read load among your region
>>>>> servers. My question is if I am using a scan with startrow and endrow  how
>>>>> does that work with random row keys?
>>>>
>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>> get your data back in a random order.
>>>> What is recommended depends on the specific problem you're trying to
>>>> solve.  But generally, one of the strengths of HBase is that the rows
>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>> locality).
>>>>
>>>> --
>>>> Benoit "tsuna" Sigoure
>>>> Software Engineer @ www.StumbleUpon.com
>>>>
>>>
>>>
>>>
>>> --
>>> 河野 達也
>>> Tatsuya Kawano (Mr.)
>>> Tokyo, Japan
>>>
>>> twitter: http://twitter.com/tatsuya6502
>>>
>
>

Re: Row Keys

Posted by Pete Haidinyak <ja...@cox.net>.
I want to do a scan of a subset of the data using startrow and endrow. If  
the keys are random I can't set a startrow/endrow, as far as I know. If I  
reverse the order of <date>|<id> for the row key I will get a better  
distribution. Unfortunately a large set of the data comes from just two  
ids.

-Pete

On Sun, 30 Jan 2011 22:10:07 -0800, Ryan Rawson <ry...@gmail.com> wrote:

> Hey,
>
> I don't understand the 'random scan' question... if you want to scan a
> random key, just scan! For example:
>
> byte [] random_key = generateRandomKeyUsingRandomNumberGenerator();
> Scan s = new Scan(random_key);
>
> But you must mean something else... perhaps you could illuminate me?
>
> -ryan
>
> On Sun, Jan 30, 2011 at 10:06 PM, Lars George <la...@gmail.com>  
> wrote:
>> Hi Pete,
>>
>> Look into the Mozilla Socorro project
>> (http://code.google.com/p/socorro/) for how to "salt" the keys to get
>> better load balancing across sequential keys. The principle is to add
>> a salt, in this case a number reflecting the number of servers
>> available (some multiple of that to allow for growth) and then prefix
>> the sequential key with it so that writes are spread across all
>> servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
>> open N scanners where N is the number of distinct salt values and scan
>> each subset with them while eventually combining the result in client
>> code. Assuming you want to scan all values in January and you have a
>> salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
>> "0-201102010000", then another for "1-201101010000" to
>> "1-201102010000" and so on. Then do the scans (multithreaded for
>> example) and combine the results client side. The Socorro code shows
>> one way to implement this.
>>
>> Lars
>>
>>
>> On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <ja...@cox.net>  
>> wrote:
>>> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
>>> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
>>> '<date>|<id>|~' (tilde character) and this has worked for my data set.
>>> Unfortunately the key is not distributed very well. That is why I was
>>> wondering how you do a scan (using start and end row) with a random  
>>> row key.
>>>
>>> Thanks
>>>
>>> -Pete
>>>
>>> PS. I use <date>|<id> since the id is variable length and this was my  
>>> first
>>> attempt. I know have a months worth of data and for my next phase I  
>>> will
>>> probably reverse the <date> <id> order since it will work either way.
>>>
>>>
>>> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com>  
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> So variable length keys and lexographical sorting makes it a little
>>>> tricky to do Scans and get exactly what you want.  This has a lot to
>>>> do with the ascii table too, and the numerical values.  Let consult
>>>> (http://www.asciitable.com/) while we work this example through:
>>>>
>>>> Take a separation character of | as your code uses.  This is decimal
>>>> 124, placing it way above both the lower and upper case letters AND
>>>> numbers, that is good.
>>>>
>>>> Now you have something like this:
>>>>
>>>> 1234|a_string
>>>> 1234|other_string
>>>>
>>>> now we want to find all rows "belonging to" 1234, so we do a start row
>>>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>>>> might work, oh wait, here is another key:
>>>>
>>>> 12345|foo
>>>>
>>>> ok so '5' < '|' so it should short like so:
>>>> 1234|a_string
>>>> 1234|other_string
>>>> 12345|foo
>>>>
>>>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>>>> still "larger" than '12345|foo' so that row would be incorrectly
>>>> included in the scan results assuming we only want '1234' related
>>>> rows.
>>>>
>>>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>>>> outside of the control characters, space is the lowest character at
>>>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>>>> choose an end double quote as in '1234"' to define your 'stop row'.
>>>> Now you would be prohibited from using any character smaller than '33'
>>>> in your strings, which is kind of a non ideal solution.
>>>>
>>>> This is all pretty clumsy, and doesnt work great in these variable
>>>> length separated strings.
>>>>
>>>> The ultimate solution is to use the PrefixFilter, which is configured  
>>>> as
>>>> such:
>>>> byte[] start_row = Bytes.toBytes("1234|");
>>>> Scan s = new Scan(start_row);
>>>> s.setFilter(new PrefixFilter(start_row));
>>>> // do scan.
>>>>
>>>> that way no matter what sortability your separator is, you will get
>>>> the answer you want every time.
>>>>
>>>>
>>>>
>>>> Another way to do compound keys is to go pure-binary.  For example I
>>>> want a key that is 2 integers, so I can do this:
>>>> int part1 = ... ;
>>>> int part2 = ... ;
>>>> byte[] row_key = Bytes.add(Bytes.toBytes(part1),  
>>>> Bytes.toBytes(part2));
>>>>
>>>> Now you can also search for all rows starting with 'target' like such:
>>>> int target = ... ;
>>>> // start key is 'target', stop key is 'target+1'
>>>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>>>
>>>> And you get exactly what you want, nothing more or less (all rows
>>>> starting with 'target').
>>>>
>>>> The lexicographic comparison is very tricky sometimes. One quick tip
>>>> is that if your numbers (longs, ints) are big endian encoded (all the
>>>> utilities in Bytes.java do so), then the lexicographic sorting is
>>>> equal to the numeric sorting.  Otherwise if you do strings you end up
>>>> with:
>>>> 1
>>>> 11
>>>> 2
>>>> 3
>>>>
>>>> and things are 'out of order'... if that is important, you can pad it
>>>> with 0s - dont forget to use the proper amount, which is 10 digits for
>>>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>>>
>>>> -ryan
>>>>
>>>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano  
>>>> <ta...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi Pete,
>>>>>
>>>>> You're right. If you use random keys, you will never know the start /
>>>>> end keys for scan. What you really want to do is to deign the key  
>>>>> that
>>>>> will distribute well for writes but also has the certain locality for
>>>>> scans.
>>>>>
>>>>> You probably have the ideal key already (ID|Date). If you don't make
>>>>> entire key to be random but just the ID part, you could get a good
>>>>> distribution at write time because writes for different IDs will be
>>>>> distributed across the regions, and you also could get a good scan
>>>>> performance when you scan between certain dates for a specific ID
>>>>> because rows for the ID will be stored together in one region.
>>>>>
>>>>> Thanks,
>>>>> Tatsuya
>>>>>
>>>>>
>>>>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>>>>>
>>>>>> I know they are always sorted but if they are how do you know which  
>>>>>> row
>>>>>> key belong to which data? Currently I use a row key of ID|Date so I  
>>>>>> always
>>>>>> know what the startrow and endrow should be. I know I'm missing  
>>>>>> something
>>>>>> really fundamental here. :-(
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -Pete
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>>>> To: user@hbase.apache.org
>>>>>> Subject: Re: Row Keys
>>>>>>
>>>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak  
>>>>>> <ph...@local.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>       This is going to seem like a dumb question but it is  
>>>>>>> recommended
>>>>>>> that you use a random key to spread the insert/read load among  
>>>>>>> your region
>>>>>>> servers. My question is if I am using a scan with startrow and  
>>>>>>> endrow  how
>>>>>>> does that work with random row keys?
>>>>>>
>>>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>>>> get your data back in a random order.
>>>>>> What is recommended depends on the specific problem you're trying to
>>>>>> solve.  But generally, one of the strengths of HBase is that the  
>>>>>> rows
>>>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>>>> locality).
>>>>>>
>>>>>> --
>>>>>> Benoit "tsuna" Sigoure
>>>>>> Software Engineer @ www.StumbleUpon.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> 河野 達也
>>>>> Tatsuya Kawano (Mr.)
>>>>> Tokyo, Japan
>>>>>
>>>>> twitter: http://twitter.com/tatsuya6502
>>>>>
>>>
>>>
>>


Re: Row Keys

Posted by Ryan Rawson <ry...@gmail.com>.
Hey,

I don't understand the 'random scan' question... if you want to scan a
random key, just scan! For example:

byte [] random_key = generateRandomKeyUsingRandomNumberGenerator();
Scan s = new Scan(random_key);

But you must mean something else... perhaps you could illuminate me?

-ryan

On Sun, Jan 30, 2011 at 10:06 PM, Lars George <la...@gmail.com> wrote:
> Hi Pete,
>
> Look into the Mozilla Socorro project
> (http://code.google.com/p/socorro/) for how to "salt" the keys to get
> better load balancing across sequential keys. The principle is to add
> a salt, in this case a number reflecting the number of servers
> available (some multiple of that to allow for growth) and then prefix
> the sequential key with it so that writes are spread across all
> servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
> open N scanners where N is the number of distinct salt values and scan
> each subset with them while eventually combining the result in client
> code. Assuming you want to scan all values in January and you have a
> salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
> "0-201102010000", then another for "1-201101010000" to
> "1-201102010000" and so on. Then do the scans (multithreaded for
> example) and combine the results client side. The Socorro code shows
> one way to implement this.
>
> Lars
>
>
> On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <ja...@cox.net> wrote:
>> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
>> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
>> '<date>|<id>|~' (tilde character) and this has worked for my data set.
>> Unfortunately the key is not distributed very well. That is why I was
>> wondering how you do a scan (using start and end row) with a random row key.
>>
>> Thanks
>>
>> -Pete
>>
>> PS. I use <date>|<id> since the id is variable length and this was my first
>> attempt. I know have a months worth of data and for my next phase I will
>> probably reverse the <date> <id> order since it will work either way.
>>
>>
>> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> So variable length keys and lexographical sorting makes it a little
>>> tricky to do Scans and get exactly what you want.  This has a lot to
>>> do with the ascii table too, and the numerical values.  Let consult
>>> (http://www.asciitable.com/) while we work this example through:
>>>
>>> Take a separation character of | as your code uses.  This is decimal
>>> 124, placing it way above both the lower and upper case letters AND
>>> numbers, that is good.
>>>
>>> Now you have something like this:
>>>
>>> 1234|a_string
>>> 1234|other_string
>>>
>>> now we want to find all rows "belonging to" 1234, so we do a start row
>>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>>> might work, oh wait, here is another key:
>>>
>>> 12345|foo
>>>
>>> ok so '5' < '|' so it should short like so:
>>> 1234|a_string
>>> 1234|other_string
>>> 12345|foo
>>>
>>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>>> still "larger" than '12345|foo' so that row would be incorrectly
>>> included in the scan results assuming we only want '1234' related
>>> rows.
>>>
>>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>>> outside of the control characters, space is the lowest character at
>>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>>> choose an end double quote as in '1234"' to define your 'stop row'.
>>> Now you would be prohibited from using any character smaller than '33'
>>> in your strings, which is kind of a non ideal solution.
>>>
>>> This is all pretty clumsy, and doesnt work great in these variable
>>> length separated strings.
>>>
>>> The ultimate solution is to use the PrefixFilter, which is configured as
>>> such:
>>> byte[] start_row = Bytes.toBytes("1234|");
>>> Scan s = new Scan(start_row);
>>> s.setFilter(new PrefixFilter(start_row));
>>> // do scan.
>>>
>>> that way no matter what sortability your separator is, you will get
>>> the answer you want every time.
>>>
>>>
>>>
>>> Another way to do compound keys is to go pure-binary.  For example I
>>> want a key that is 2 integers, so I can do this:
>>> int part1 = ... ;
>>> int part2 = ... ;
>>> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>>>
>>> Now you can also search for all rows starting with 'target' like such:
>>> int target = ... ;
>>> // start key is 'target', stop key is 'target+1'
>>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>>
>>> And you get exactly what you want, nothing more or less (all rows
>>> starting with 'target').
>>>
>>> The lexicographic comparison is very tricky sometimes. One quick tip
>>> is that if your numbers (longs, ints) are big endian encoded (all the
>>> utilities in Bytes.java do so), then the lexicographic sorting is
>>> equal to the numeric sorting.  Otherwise if you do strings you end up
>>> with:
>>> 1
>>> 11
>>> 2
>>> 3
>>>
>>> and things are 'out of order'... if that is important, you can pad it
>>> with 0s - dont forget to use the proper amount, which is 10 digits for
>>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>>
>>> -ryan
>>>
>>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com>
>>> wrote:
>>>>
>>>> Hi Pete,
>>>>
>>>> You're right. If you use random keys, you will never know the start /
>>>> end keys for scan. What you really want to do is to deign the key that
>>>> will distribute well for writes but also has the certain locality for
>>>> scans.
>>>>
>>>> You probably have the ideal key already (ID|Date). If you don't make
>>>> entire key to be random but just the ID part, you could get a good
>>>> distribution at write time because writes for different IDs will be
>>>> distributed across the regions, and you also could get a good scan
>>>> performance when you scan between certain dates for a specific ID
>>>> because rows for the ID will be stored together in one region.
>>>>
>>>> Thanks,
>>>> Tatsuya
>>>>
>>>>
>>>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>>>>
>>>>> I know they are always sorted but if they are how do you know which row
>>>>> key belong to which data? Currently I use a row key of ID|Date so I always
>>>>> know what the startrow and endrow should be. I know I'm missing something
>>>>> really fundamental here. :-(
>>>>>
>>>>> Thanks
>>>>>
>>>>> -Pete
>>>>>
>>>>> -----Original Message-----
>>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: Row Keys
>>>>>
>>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
>>>>> wrote:
>>>>>>
>>>>>>       This is going to seem like a dumb question but it is recommended
>>>>>> that you use a random key to spread the insert/read load among your region
>>>>>> servers. My question is if I am using a scan with startrow and endrow  how
>>>>>> does that work with random row keys?
>>>>>
>>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>>> get your data back in a random order.
>>>>> What is recommended depends on the specific problem you're trying to
>>>>> solve.  But generally, one of the strengths of HBase is that the rows
>>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>>> locality).
>>>>>
>>>>> --
>>>>> Benoit "tsuna" Sigoure
>>>>> Software Engineer @ www.StumbleUpon.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> 河野 達也
>>>> Tatsuya Kawano (Mr.)
>>>> Tokyo, Japan
>>>>
>>>> twitter: http://twitter.com/tatsuya6502
>>>>
>>
>>
>

Re: Row Keys

Posted by Lars George <la...@gmail.com>.
Hi Pete,

Look into the Mozilla Socorro project
(http://code.google.com/p/socorro/) for how to "salt" the keys to get
better load balancing across sequential keys. The principle is to add
a salt, in this case a number reflecting the number of servers
available (some multiple of that to allow for growth) and then prefix
the sequential key with it so that writes are spread across all
servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
open N scanners where N is the number of distinct salt values and scan
each subset with them while eventually combining the result in client
code. Assuming you want to scan all values in January and you have a
salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
"0-201102010000", then another for "1-201101010000" to
"1-201102010000" and so on. Then do the scans (multithreaded for
example) and combine the results client side. The Socorro code shows
one way to implement this.

Lars


On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <ja...@cox.net> wrote:
> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
> '<date>|<id>|~' (tilde character) and this has worked for my data set.
> Unfortunately the key is not distributed very well. That is why I was
> wondering how you do a scan (using start and end row) with a random row key.
>
> Thanks
>
> -Pete
>
> PS. I use <date>|<id> since the id is variable length and this was my first
> attempt. I know have a months worth of data and for my next phase I will
> probably reverse the <date> <id> order since it will work either way.
>
>
> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com> wrote:
>
>> Hey,
>>
>> So variable length keys and lexographical sorting makes it a little
>> tricky to do Scans and get exactly what you want.  This has a lot to
>> do with the ascii table too, and the numerical values.  Let consult
>> (http://www.asciitable.com/) while we work this example through:
>>
>> Take a separation character of | as your code uses.  This is decimal
>> 124, placing it way above both the lower and upper case letters AND
>> numbers, that is good.
>>
>> Now you have something like this:
>>
>> 1234|a_string
>> 1234|other_string
>>
>> now we want to find all rows "belonging to" 1234, so we do a start row
>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>> might work, oh wait, here is another key:
>>
>> 12345|foo
>>
>> ok so '5' < '|' so it should short like so:
>> 1234|a_string
>> 1234|other_string
>> 12345|foo
>>
>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>> still "larger" than '12345|foo' so that row would be incorrectly
>> included in the scan results assuming we only want '1234' related
>> rows.
>>
>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>> outside of the control characters, space is the lowest character at
>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>> choose an end double quote as in '1234"' to define your 'stop row'.
>> Now you would be prohibited from using any character smaller than '33'
>> in your strings, which is kind of a non ideal solution.
>>
>> This is all pretty clumsy, and doesnt work great in these variable
>> length separated strings.
>>
>> The ultimate solution is to use the PrefixFilter, which is configured as
>> such:
>> byte[] start_row = Bytes.toBytes("1234|");
>> Scan s = new Scan(start_row);
>> s.setFilter(new PrefixFilter(start_row));
>> // do scan.
>>
>> that way no matter what sortability your separator is, you will get
>> the answer you want every time.
>>
>>
>>
>> Another way to do compound keys is to go pure-binary.  For example I
>> want a key that is 2 integers, so I can do this:
>> int part1 = ... ;
>> int part2 = ... ;
>> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>>
>> Now you can also search for all rows starting with 'target' like such:
>> int target = ... ;
>> // start key is 'target', stop key is 'target+1'
>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>
>> And you get exactly what you want, nothing more or less (all rows
>> starting with 'target').
>>
>> The lexicographic comparison is very tricky sometimes. One quick tip
>> is that if your numbers (longs, ints) are big endian encoded (all the
>> utilities in Bytes.java do so), then the lexicographic sorting is
>> equal to the numeric sorting.  Otherwise if you do strings you end up
>> with:
>> 1
>> 11
>> 2
>> 3
>>
>> and things are 'out of order'... if that is important, you can pad it
>> with 0s - dont forget to use the proper amount, which is 10 digits for
>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>
>> -ryan
>>
>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com>
>> wrote:
>>>
>>> Hi Pete,
>>>
>>> You're right. If you use random keys, you will never know the start /
>>> end keys for scan. What you really want to do is to deign the key that
>>> will distribute well for writes but also has the certain locality for
>>> scans.
>>>
>>> You probably have the ideal key already (ID|Date). If you don't make
>>> entire key to be random but just the ID part, you could get a good
>>> distribution at write time because writes for different IDs will be
>>> distributed across the regions, and you also could get a good scan
>>> performance when you scan between certain dates for a specific ID
>>> because rows for the ID will be stored together in one region.
>>>
>>> Thanks,
>>> Tatsuya
>>>
>>>
>>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>>>
>>>> I know they are always sorted but if they are how do you know which row
>>>> key belong to which data? Currently I use a row key of ID|Date so I always
>>>> know what the startrow and endrow should be. I know I'm missing something
>>>> really fundamental here. :-(
>>>>
>>>> Thanks
>>>>
>>>> -Pete
>>>>
>>>> -----Original Message-----
>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: Row Keys
>>>>
>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com>
>>>> wrote:
>>>>>
>>>>>       This is going to seem like a dumb question but it is recommended
>>>>> that you use a random key to spread the insert/read load among your region
>>>>> servers. My question is if I am using a scan with startrow and endrow  how
>>>>> does that work with random row keys?
>>>>
>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>> get your data back in a random order.
>>>> What is recommended depends on the specific problem you're trying to
>>>> solve.  But generally, one of the strengths of HBase is that the rows
>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>> locality).
>>>>
>>>> --
>>>> Benoit "tsuna" Sigoure
>>>> Software Engineer @ www.StumbleUpon.com
>>>>
>>>
>>>
>>>
>>> --
>>> 河野 達也
>>> Tatsuya Kawano (Mr.)
>>> Tokyo, Japan
>>>
>>> twitter: http://twitter.com/tatsuya6502
>>>
>
>

Re: Row Keys

Posted by Pete Haidinyak <ja...@cox.net>.
Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
My start row is '<date>|<id>| ' (with a space ascii 32) and end row is  
'<date>|<id>|~' (tilde character) and this has worked for my data set.  
Unfortunately the key is not distributed very well. That is why I was  
wondering how you do a scan (using start and end row) with a random row  
key.

Thanks

-Pete

PS. I use <date>|<id> since the id is variable length and this was my  
first attempt. I know have a months worth of data and for my next phase I  
will probably reverse the <date> <id> order since it will work either way.


On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ry...@gmail.com> wrote:

> Hey,
>
> So variable length keys and lexographical sorting makes it a little
> tricky to do Scans and get exactly what you want.  This has a lot to
> do with the ascii table too, and the numerical values.  Let consult
> (http://www.asciitable.com/) while we work this example through:
>
> Take a separation character of | as your code uses.  This is decimal
> 124, placing it way above both the lower and upper case letters AND
> numbers, that is good.
>
> Now you have something like this:
>
> 1234|a_string
> 1234|other_string
>
> now we want to find all rows "belonging to" 1234, so we do a start row
> of '1234|', but what for the end key? Well, let's try... '1234}', that
> might work, oh wait, here is another key:
>
> 12345|foo
>
> ok so '5' < '|' so it should short like so:
> 1234|a_string
> 1234|other_string
> 12345|foo
>
> hmm well how does our end row compare? well '5' < '}' so '1234}' is
> still "larger" than '12345|foo' so that row would be incorrectly
> included in the scan results assuming we only want '1234' related
> rows.
>
> Ok, well maybe a better solution is to pick a lower ascii?  Well
> outside of the control characters, space is the lowest character at
> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
> choose an end double quote as in '1234"' to define your 'stop row'.
> Now you would be prohibited from using any character smaller than '33'
> in your strings, which is kind of a non ideal solution.
>
> This is all pretty clumsy, and doesnt work great in these variable
> length separated strings.
>
> The ultimate solution is to use the PrefixFilter, which is configured as  
> such:
> byte[] start_row = Bytes.toBytes("1234|");
> Scan s = new Scan(start_row);
> s.setFilter(new PrefixFilter(start_row));
> // do scan.
>
> that way no matter what sortability your separator is, you will get
> the answer you want every time.
>
>
>
> Another way to do compound keys is to go pure-binary.  For example I
> want a key that is 2 integers, so I can do this:
> int part1 = ... ;
> int part2 = ... ;
> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>
> Now you can also search for all rows starting with 'target' like such:
> int target = ... ;
> // start key is 'target', stop key is 'target+1'
> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>
> And you get exactly what you want, nothing more or less (all rows
> starting with 'target').
>
> The lexicographic comparison is very tricky sometimes. One quick tip
> is that if your numbers (longs, ints) are big endian encoded (all the
> utilities in Bytes.java do so), then the lexicographic sorting is
> equal to the numeric sorting.  Otherwise if you do strings you end up
> with:
> 1
> 11
> 2
> 3
>
> and things are 'out of order'... if that is important, you can pad it
> with 0s - dont forget to use the proper amount, which is 10 digits for
> ints, and 19 for longs.  Or consider using binary encoding as above.
>
> -ryan
>
> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com>  
> wrote:
>> Hi Pete,
>>
>> You're right. If you use random keys, you will never know the start /
>> end keys for scan. What you really want to do is to deign the key that
>> will distribute well for writes but also has the certain locality for
>> scans.
>>
>> You probably have the ideal key already (ID|Date). If you don't make
>> entire key to be random but just the ID part, you could get a good
>> distribution at write time because writes for different IDs will be
>> distributed across the regions, and you also could get a good scan
>> performance when you scan between certain dates for a specific ID
>> because rows for the ID will be stored together in one region.
>>
>> Thanks,
>> Tatsuya
>>
>>
>> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>>> I know they are always sorted but if they are how do you know which  
>>> row key belong to which data? Currently I use a row key of ID|Date so  
>>> I always know what the startrow and endrow should be. I know I'm  
>>> missing something really fundamental here. :-(
>>>
>>> Thanks
>>>
>>> -Pete
>>>
>>> -----Original Message-----
>>> From: tsuna [mailto:tsunanet@gmail.com]
>>> Sent: Friday, January 28, 2011 12:14 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: Row Keys
>>>
>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak  
>>> <ph...@local.com> wrote:
>>>>        This is going to seem like a dumb question but it is  
>>>> recommended that you use a random key to spread the insert/read load  
>>>> among your region servers. My question is if I am using a scan with  
>>>> startrow and endrow  how does that work with random row keys?
>>>
>>> The keys are always sorted.  So if you generate random keys, you'll
>>> get your data back in a random order.
>>> What is recommended depends on the specific problem you're trying to
>>> solve.  But generally, one of the strengths of HBase is that the rows
>>> are sorted, so sequential scanning is efficient (thanks to data
>>> locality).
>>>
>>> --
>>> Benoit "tsuna" Sigoure
>>> Software Engineer @ www.StumbleUpon.com
>>>
>>
>>
>>
>> --
>> 河野 達也
>> Tatsuya Kawano (Mr.)
>> Tokyo, Japan
>>
>> twitter: http://twitter.com/tatsuya6502
>>


Re: Row Keys

Posted by Ryan Rawson <ry...@gmail.com>.
Hey,

So variable length keys and lexographical sorting makes it a little
tricky to do Scans and get exactly what you want.  This has a lot to
do with the ascii table too, and the numerical values.  Let consult
(http://www.asciitable.com/) while we work this example through:

Take a separation character of | as your code uses.  This is decimal
124, placing it way above both the lower and upper case letters AND
numbers, that is good.

Now you have something like this:

1234|a_string
1234|other_string

now we want to find all rows "belonging to" 1234, so we do a start row
of '1234|', but what for the end key? Well, let's try... '1234}', that
might work, oh wait, here is another key:

12345|foo

ok so '5' < '|' so it should short like so:
1234|a_string
1234|other_string
12345|foo

hmm well how does our end row compare? well '5' < '}' so '1234}' is
still "larger" than '12345|foo' so that row would be incorrectly
included in the scan results assuming we only want '1234' related
rows.

Ok, well maybe a better solution is to pick a lower ascii?  Well
outside of the control characters, space is the lowest character at
32, 33 is '!' so perhaps ! would be a better choice.  So you could
choose an end double quote as in '1234"' to define your 'stop row'.
Now you would be prohibited from using any character smaller than '33'
in your strings, which is kind of a non ideal solution.

This is all pretty clumsy, and doesnt work great in these variable
length separated strings.

The ultimate solution is to use the PrefixFilter, which is configured as such:
byte[] start_row = Bytes.toBytes("1234|");
Scan s = new Scan(start_row);
s.setFilter(new PrefixFilter(start_row));
// do scan.

that way no matter what sortability your separator is, you will get
the answer you want every time.



Another way to do compound keys is to go pure-binary.  For example I
want a key that is 2 integers, so I can do this:
int part1 = ... ;
int part2 = ... ;
byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));

Now you can also search for all rows starting with 'target' like such:
int target = ... ;
// start key is 'target', stop key is 'target+1'
Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));

And you get exactly what you want, nothing more or less (all rows
starting with 'target').

The lexicographic comparison is very tricky sometimes. One quick tip
is that if your numbers (longs, ints) are big endian encoded (all the
utilities in Bytes.java do so), then the lexicographic sorting is
equal to the numeric sorting.  Otherwise if you do strings you end up
with:
1
11
2
3

and things are 'out of order'... if that is important, you can pad it
with 0s - dont forget to use the proper amount, which is 10 digits for
ints, and 19 for longs.  Or consider using binary encoding as above.

-ryan

On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <ta...@gmail.com> wrote:
> Hi Pete,
>
> You're right. If you use random keys, you will never know the start /
> end keys for scan. What you really want to do is to deign the key that
> will distribute well for writes but also has the certain locality for
> scans.
>
> You probably have the ideal key already (ID|Date). If you don't make
> entire key to be random but just the ID part, you could get a good
> distribution at write time because writes for different IDs will be
> distributed across the regions, and you also could get a good scan
> performance when you scan between certain dates for a specific ID
> because rows for the ID will be stored together in one region.
>
> Thanks,
> Tatsuya
>
>
> 2011/1/29 Peter Haidinyak <ph...@local.com>:
>> I know they are always sorted but if they are how do you know which row key belong to which data? Currently I use a row key of ID|Date so I always know what the startrow and endrow should be. I know I'm missing something really fundamental here. :-(
>>
>> Thanks
>>
>> -Pete
>>
>> -----Original Message-----
>> From: tsuna [mailto:tsunanet@gmail.com]
>> Sent: Friday, January 28, 2011 12:14 PM
>> To: user@hbase.apache.org
>> Subject: Re: Row Keys
>>
>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com> wrote:
>>>        This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?
>>
>> The keys are always sorted.  So if you generate random keys, you'll
>> get your data back in a random order.
>> What is recommended depends on the specific problem you're trying to
>> solve.  But generally, one of the strengths of HBase is that the rows
>> are sorted, so sequential scanning is efficient (thanks to data
>> locality).
>>
>> --
>> Benoit "tsuna" Sigoure
>> Software Engineer @ www.StumbleUpon.com
>>
>
>
>
> --
> 河野 達也
> Tatsuya Kawano (Mr.)
> Tokyo, Japan
>
> twitter: http://twitter.com/tatsuya6502
>

Re: Row Keys

Posted by Tatsuya Kawano <ta...@gmail.com>.
Hi Pete,

You're right. If you use random keys, you will never know the start /
end keys for scan. What you really want to do is to deign the key that
will distribute well for writes but also has the certain locality for
scans.

You probably have the ideal key already (ID|Date). If you don't make
entire key to be random but just the ID part, you could get a good
distribution at write time because writes for different IDs will be
distributed across the regions, and you also could get a good scan
performance when you scan between certain dates for a specific ID
because rows for the ID will be stored together in one region.

Thanks,
Tatsuya


2011/1/29 Peter Haidinyak <ph...@local.com>:
> I know they are always sorted but if they are how do you know which row key belong to which data? Currently I use a row key of ID|Date so I always know what the startrow and endrow should be. I know I'm missing something really fundamental here. :-(
>
> Thanks
>
> -Pete
>
> -----Original Message-----
> From: tsuna [mailto:tsunanet@gmail.com]
> Sent: Friday, January 28, 2011 12:14 PM
> To: user@hbase.apache.org
> Subject: Re: Row Keys
>
> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com> wrote:
>>        This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?
>
> The keys are always sorted.  So if you generate random keys, you'll
> get your data back in a random order.
> What is recommended depends on the specific problem you're trying to
> solve.  But generally, one of the strengths of HBase is that the rows
> are sorted, so sequential scanning is efficient (thanks to data
> locality).
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>



-- 
河野 達也
Tatsuya Kawano (Mr.)
Tokyo, Japan

twitter: http://twitter.com/tatsuya6502

RE: Row Keys

Posted by Peter Haidinyak <ph...@local.com>.
I know they are always sorted but if they are how do you know which row key belong to which data? Currently I use a row key of ID|Date so I always know what the startrow and endrow should be. I know I'm missing something really fundamental here. :-(

Thanks

-Pete

-----Original Message-----
From: tsuna [mailto:tsunanet@gmail.com] 
Sent: Friday, January 28, 2011 12:14 PM
To: user@hbase.apache.org
Subject: Re: Row Keys

On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com> wrote:
>        This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?

The keys are always sorted.  So if you generate random keys, you'll
get your data back in a random order.
What is recommended depends on the specific problem you're trying to
solve.  But generally, one of the strengths of HBase is that the rows
are sorted, so sequential scanning is efficient (thanks to data
locality).

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: Row Keys

Posted by tsuna <ts...@gmail.com>.
On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <ph...@local.com> wrote:
>        This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?

The keys are always sorted.  So if you generate random keys, you'll
get your data back in a random order.
What is recommended depends on the specific problem you're trying to
solve.  But generally, one of the strengths of HBase is that the rows
are sorted, so sequential scanning is efficient (thanks to data
locality).

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Row Keys

Posted by Peter Haidinyak <ph...@local.com>.
Hi, 
	This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow  how does that work with random row keys?

Thanks

-Pete 

Re: Use loadtable.rb with compressed data?

Posted by Stack <st...@duboce.net>.
So, seems like in 0.20.6, we're not doing compression right.
St.Ack

On Fri, Jan 28, 2011 at 11:23 AM, Nanheng Wu <na...@gmail.com> wrote:
> Ah, sorry I should've read the usage. I ran it just now and the meta
> data dump threw the same error "Not in GZIP format"
>
> On Fri, Jan 28, 2011 at 10:51 AM, Stack <st...@duboce.net> wrote:
>> hfile metadata, the -m option?
>> St.Ack
>>
>> On Fri, Jan 28, 2011 at 10:41 AM, Nanheng Wu <na...@gmail.com> wrote:
>>> Sorry, by dumping the metadata did you mean running the same HFile
>>> tool on ".region" file in each region?
>>>
>>> On Fri, Jan 28, 2011 at 10:25 AM, Stack <st...@duboce.net> wrote:
>>>> If you dump the metadata, does it claim GZIP compressor?  If so, yeah,
>>>> seems to be mismatch between what data is and what metadata is.
>>>> St.Ack
>>>>
>>>> On Fri, Jan 28, 2011 at 9:58 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> Awesome. I ran it on one of the hfiles and got this:
>>>>> 11/01/28 09:57:15 INFO compress.CodecPool: Got brand-new decompressor
>>>>> java.io.IOException: Not in GZIP format
>>>>>        at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
>>>>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>>>>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>>>>>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>>>>>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>>>>>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>>>>>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>>>>>        at org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:168)
>>>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1013)
>>>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966)
>>>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291)
>>>>>        at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:1740)
>>>>>
>>>>> So the problem could be that HFile writer is not writing properly
>>>>> gzipped outputs?
>>>>>
>>>>>
>>>>> On Fri, Jan 28, 2011 at 9:41 AM, Stack <st...@duboce.net> wrote:
>>>>>> The section in 0.90 book on hfile tool should apply to 0.20.6:
>>>>>> http://hbase.apache.org/ch08s02.html#hfile_tool  It might help you w/
>>>>>> your explorations.
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>> On Fri, Jan 28, 2011 at 9:38 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>> Hi Stack,
>>>>>>>
>>>>>>>  Get doesn't work either. It was a fresh table created by
>>>>>>> loadtable.rb. Finally, the uncompressed version had the same number of
>>>>>>> regions (8 total). I totally understand you guys shouldn't be patching
>>>>>>> the older version, upgrading for me is an option but will be pretty
>>>>>>> painful. I wonder if I can figure something out by comparing the two
>>>>>>> version's Hfile. Thanks again!
>>>>>>>
>>>>>>> On Fri, Jan 28, 2011 at 9:14 AM, Stack <st...@duboce.net> wrote:
>>>>>>>> On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>> In the compressed case, there are 8 regions and the region start/end
>>>>>>>>> keys do line up. Which actually is confusing to me, how can hbase read
>>>>>>>>> the files if they are compressed? does each hfile have some metadata
>>>>>>>>> in it that has compression info?
>>>>>>>>
>>>>>>>> You got it.
>>>>>>>>
>>>>>>>>> Anyway, the regions are the same
>>>>>>>>> (numbers and boundaries are same) in both compressed and uncompressed
>>>>>>>>> version. So what else should I look into to fix this? Thanks again!
>>>>>>>>
>>>>>>>> You can't scan. Can you Get from the table at all?  Try getting start
>>>>>>>> key from a few of the regions you see in .META.
>>>>>>>>
>>>>>>>> Did this table preexist or was this a fresh creation?
>>>>>>>>
>>>>>>>> When you created this table uncompressed, how many regions was it?
>>>>>>>>
>>>>>>>> How about just running uncompressed while you are on 0.20.6?  We'd
>>>>>>>> rather be fixing bugs in the new stuff, not the version that we are
>>>>>>>> leaving behind?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use loadtable.rb with compressed data?

Posted by Nanheng Wu <na...@gmail.com>.
Ah, sorry I should've read the usage. I ran it just now and the meta
data dump threw the same error "Not in GZIP format"

On Fri, Jan 28, 2011 at 10:51 AM, Stack <st...@duboce.net> wrote:
> hfile metadata, the -m option?
> St.Ack
>
> On Fri, Jan 28, 2011 at 10:41 AM, Nanheng Wu <na...@gmail.com> wrote:
>> Sorry, by dumping the metadata did you mean running the same HFile
>> tool on ".region" file in each region?
>>
>> On Fri, Jan 28, 2011 at 10:25 AM, Stack <st...@duboce.net> wrote:
>>> If you dump the metadata, does it claim GZIP compressor?  If so, yeah,
>>> seems to be mismatch between what data is and what metadata is.
>>> St.Ack
>>>
>>> On Fri, Jan 28, 2011 at 9:58 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>> Awesome. I ran it on one of the hfiles and got this:
>>>> 11/01/28 09:57:15 INFO compress.CodecPool: Got brand-new decompressor
>>>> java.io.IOException: Not in GZIP format
>>>>        at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
>>>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>>>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>>>>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>>>>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>>>>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>>>>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>>>>        at org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:168)
>>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1013)
>>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966)
>>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291)
>>>>        at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:1740)
>>>>
>>>> So the problem could be that HFile writer is not writing properly
>>>> gzipped outputs?
>>>>
>>>>
>>>> On Fri, Jan 28, 2011 at 9:41 AM, Stack <st...@duboce.net> wrote:
>>>>> The section in 0.90 book on hfile tool should apply to 0.20.6:
>>>>> http://hbase.apache.org/ch08s02.html#hfile_tool  It might help you w/
>>>>> your explorations.
>>>>>
>>>>> St.Ack
>>>>>
>>>>> On Fri, Jan 28, 2011 at 9:38 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> Hi Stack,
>>>>>>
>>>>>>  Get doesn't work either. It was a fresh table created by
>>>>>> loadtable.rb. Finally, the uncompressed version had the same number of
>>>>>> regions (8 total). I totally understand you guys shouldn't be patching
>>>>>> the older version, upgrading for me is an option but will be pretty
>>>>>> painful. I wonder if I can figure something out by comparing the two
>>>>>> version's Hfile. Thanks again!
>>>>>>
>>>>>> On Fri, Jan 28, 2011 at 9:14 AM, Stack <st...@duboce.net> wrote:
>>>>>>> On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>> In the compressed case, there are 8 regions and the region start/end
>>>>>>>> keys do line up. Which actually is confusing to me, how can hbase read
>>>>>>>> the files if they are compressed? does each hfile have some metadata
>>>>>>>> in it that has compression info?
>>>>>>>
>>>>>>> You got it.
>>>>>>>
>>>>>>>> Anyway, the regions are the same
>>>>>>>> (numbers and boundaries are same) in both compressed and uncompressed
>>>>>>>> version. So what else should I look into to fix this? Thanks again!
>>>>>>>
>>>>>>> You can't scan. Can you Get from the table at all?  Try getting start
>>>>>>> key from a few of the regions you see in .META.
>>>>>>>
>>>>>>> Did this table preexist or was this a fresh creation?
>>>>>>>
>>>>>>> When you created this table uncompressed, how many regions was it?
>>>>>>>
>>>>>>> How about just running uncompressed while you are on 0.20.6?  We'd
>>>>>>> rather be fixing bugs in the new stuff, not the version that we are
>>>>>>> leaving behind?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> St.Ack
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use loadtable.rb with compressed data?

Posted by Stack <st...@duboce.net>.
hfile metadata, the -m option?
St.Ack

On Fri, Jan 28, 2011 at 10:41 AM, Nanheng Wu <na...@gmail.com> wrote:
> Sorry, by dumping the metadata did you mean running the same HFile
> tool on ".region" file in each region?
>
> On Fri, Jan 28, 2011 at 10:25 AM, Stack <st...@duboce.net> wrote:
>> If you dump the metadata, does it claim GZIP compressor?  If so, yeah,
>> seems to be mismatch between what data is and what metadata is.
>> St.Ack
>>
>> On Fri, Jan 28, 2011 at 9:58 AM, Nanheng Wu <na...@gmail.com> wrote:
>>> Awesome. I ran it on one of the hfiles and got this:
>>> 11/01/28 09:57:15 INFO compress.CodecPool: Got brand-new decompressor
>>> java.io.IOException: Not in GZIP format
>>>        at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
>>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>>>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>>>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>>>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>>>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>>>        at org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:168)
>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1013)
>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966)
>>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291)
>>>        at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:1740)
>>>
>>> So the problem could be that HFile writer is not writing properly
>>> gzipped outputs?
>>>
>>>
>>> On Fri, Jan 28, 2011 at 9:41 AM, Stack <st...@duboce.net> wrote:
>>>> The section in 0.90 book on hfile tool should apply to 0.20.6:
>>>> http://hbase.apache.org/ch08s02.html#hfile_tool  It might help you w/
>>>> your explorations.
>>>>
>>>> St.Ack
>>>>
>>>> On Fri, Jan 28, 2011 at 9:38 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> Hi Stack,
>>>>>
>>>>>  Get doesn't work either. It was a fresh table created by
>>>>> loadtable.rb. Finally, the uncompressed version had the same number of
>>>>> regions (8 total). I totally understand you guys shouldn't be patching
>>>>> the older version, upgrading for me is an option but will be pretty
>>>>> painful. I wonder if I can figure something out by comparing the two
>>>>> version's Hfile. Thanks again!
>>>>>
>>>>> On Fri, Jan 28, 2011 at 9:14 AM, Stack <st...@duboce.net> wrote:
>>>>>> On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>> In the compressed case, there are 8 regions and the region start/end
>>>>>>> keys do line up. Which actually is confusing to me, how can hbase read
>>>>>>> the files if they are compressed? does each hfile have some metadata
>>>>>>> in it that has compression info?
>>>>>>
>>>>>> You got it.
>>>>>>
>>>>>>> Anyway, the regions are the same
>>>>>>> (numbers and boundaries are same) in both compressed and uncompressed
>>>>>>> version. So what else should I look into to fix this? Thanks again!
>>>>>>
>>>>>> You can't scan. Can you Get from the table at all?  Try getting start
>>>>>> key from a few of the regions you see in .META.
>>>>>>
>>>>>> Did this table preexist or was this a fresh creation?
>>>>>>
>>>>>> When you created this table uncompressed, how many regions was it?
>>>>>>
>>>>>> How about just running uncompressed while you are on 0.20.6?  We'd
>>>>>> rather be fixing bugs in the new stuff, not the version that we are
>>>>>> leaving behind?
>>>>>>
>>>>>> Thanks,
>>>>>> St.Ack
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use loadtable.rb with compressed data?

Posted by Nanheng Wu <na...@gmail.com>.
Sorry, by dumping the metadata did you mean running the same HFile
tool on ".region" file in each region?

On Fri, Jan 28, 2011 at 10:25 AM, Stack <st...@duboce.net> wrote:
> If you dump the metadata, does it claim GZIP compressor?  If so, yeah,
> seems to be mismatch between what data is and what metadata is.
> St.Ack
>
> On Fri, Jan 28, 2011 at 9:58 AM, Nanheng Wu <na...@gmail.com> wrote:
>> Awesome. I ran it on one of the hfiles and got this:
>> 11/01/28 09:57:15 INFO compress.CodecPool: Got brand-new decompressor
>> java.io.IOException: Not in GZIP format
>>        at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>>        at org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:168)
>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1013)
>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966)
>>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291)
>>        at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:1740)
>>
>> So the problem could be that HFile writer is not writing properly
>> gzipped outputs?
>>
>>
>> On Fri, Jan 28, 2011 at 9:41 AM, Stack <st...@duboce.net> wrote:
>>> The section in 0.90 book on hfile tool should apply to 0.20.6:
>>> http://hbase.apache.org/ch08s02.html#hfile_tool  It might help you w/
>>> your explorations.
>>>
>>> St.Ack
>>>
>>> On Fri, Jan 28, 2011 at 9:38 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>> Hi Stack,
>>>>
>>>>  Get doesn't work either. It was a fresh table created by
>>>> loadtable.rb. Finally, the uncompressed version had the same number of
>>>> regions (8 total). I totally understand you guys shouldn't be patching
>>>> the older version, upgrading for me is an option but will be pretty
>>>> painful. I wonder if I can figure something out by comparing the two
>>>> version's Hfile. Thanks again!
>>>>
>>>> On Fri, Jan 28, 2011 at 9:14 AM, Stack <st...@duboce.net> wrote:
>>>>> On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> In the compressed case, there are 8 regions and the region start/end
>>>>>> keys do line up. Which actually is confusing to me, how can hbase read
>>>>>> the files if they are compressed? does each hfile have some metadata
>>>>>> in it that has compression info?
>>>>>
>>>>> You got it.
>>>>>
>>>>>> Anyway, the regions are the same
>>>>>> (numbers and boundaries are same) in both compressed and uncompressed
>>>>>> version. So what else should I look into to fix this? Thanks again!
>>>>>
>>>>> You can't scan. Can you Get from the table at all?  Try getting start
>>>>> key from a few of the regions you see in .META.
>>>>>
>>>>> Did this table preexist or was this a fresh creation?
>>>>>
>>>>> When you created this table uncompressed, how many regions was it?
>>>>>
>>>>> How about just running uncompressed while you are on 0.20.6?  We'd
>>>>> rather be fixing bugs in the new stuff, not the version that we are
>>>>> leaving behind?
>>>>>
>>>>> Thanks,
>>>>> St.Ack
>>>>>
>>>>
>>>
>>
>

Re: Use loadtable.rb with compressed data?

Posted by Stack <st...@duboce.net>.
If you dump the metadata, does it claim GZIP compressor?  If so, yeah,
seems to be mismatch between what data is and what metadata is.
St.Ack

On Fri, Jan 28, 2011 at 9:58 AM, Nanheng Wu <na...@gmail.com> wrote:
> Awesome. I ran it on one of the hfiles and got this:
> 11/01/28 09:57:15 INFO compress.CodecPool: Got brand-new decompressor
> java.io.IOException: Not in GZIP format
>        at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>        at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>        at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>        at org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:168)
>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1013)
>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966)
>        at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291)
>        at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:1740)
>
> So the problem could be that HFile writer is not writing properly
> gzipped outputs?
>
>
> On Fri, Jan 28, 2011 at 9:41 AM, Stack <st...@duboce.net> wrote:
>> The section in 0.90 book on hfile tool should apply to 0.20.6:
>> http://hbase.apache.org/ch08s02.html#hfile_tool  It might help you w/
>> your explorations.
>>
>> St.Ack
>>
>> On Fri, Jan 28, 2011 at 9:38 AM, Nanheng Wu <na...@gmail.com> wrote:
>>> Hi Stack,
>>>
>>>  Get doesn't work either. It was a fresh table created by
>>> loadtable.rb. Finally, the uncompressed version had the same number of
>>> regions (8 total). I totally understand you guys shouldn't be patching
>>> the older version, upgrading for me is an option but will be pretty
>>> painful. I wonder if I can figure something out by comparing the two
>>> version's Hfile. Thanks again!
>>>
>>> On Fri, Jan 28, 2011 at 9:14 AM, Stack <st...@duboce.net> wrote:
>>>> On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> In the compressed case, there are 8 regions and the region start/end
>>>>> keys do line up. Which actually is confusing to me, how can hbase read
>>>>> the files if they are compressed? does each hfile have some metadata
>>>>> in it that has compression info?
>>>>
>>>> You got it.
>>>>
>>>>> Anyway, the regions are the same
>>>>> (numbers and boundaries are same) in both compressed and uncompressed
>>>>> version. So what else should I look into to fix this? Thanks again!
>>>>
>>>> You can't scan. Can you Get from the table at all?  Try getting start
>>>> key from a few of the regions you see in .META.
>>>>
>>>> Did this table preexist or was this a fresh creation?
>>>>
>>>> When you created this table uncompressed, how many regions was it?
>>>>
>>>> How about just running uncompressed while you are on 0.20.6?  We'd
>>>> rather be fixing bugs in the new stuff, not the version that we are
>>>> leaving behind?
>>>>
>>>> Thanks,
>>>> St.Ack
>>>>
>>>
>>
>

Re: Use loadtable.rb with compressed data?

Posted by Nanheng Wu <na...@gmail.com>.
Awesome. I ran it on one of the hfiles and got this:
11/01/28 09:57:15 INFO compress.CodecPool: Got brand-new decompressor
java.io.IOException: Not in GZIP format
	at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
	at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
	at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
	at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
	at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
	at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
	at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
	at org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:168)
	at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1013)
	at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966)
	at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291)
	at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:1740)

So the problem could be that HFile writer is not writing properly
gzipped outputs?


On Fri, Jan 28, 2011 at 9:41 AM, Stack <st...@duboce.net> wrote:
> The section in 0.90 book on hfile tool should apply to 0.20.6:
> http://hbase.apache.org/ch08s02.html#hfile_tool  It might help you w/
> your explorations.
>
> St.Ack
>
> On Fri, Jan 28, 2011 at 9:38 AM, Nanheng Wu <na...@gmail.com> wrote:
>> Hi Stack,
>>
>>  Get doesn't work either. It was a fresh table created by
>> loadtable.rb. Finally, the uncompressed version had the same number of
>> regions (8 total). I totally understand you guys shouldn't be patching
>> the older version, upgrading for me is an option but will be pretty
>> painful. I wonder if I can figure something out by comparing the two
>> version's Hfile. Thanks again!
>>
>> On Fri, Jan 28, 2011 at 9:14 AM, Stack <st...@duboce.net> wrote:
>>> On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>> In the compressed case, there are 8 regions and the region start/end
>>>> keys do line up. Which actually is confusing to me, how can hbase read
>>>> the files if they are compressed? does each hfile have some metadata
>>>> in it that has compression info?
>>>
>>> You got it.
>>>
>>>> Anyway, the regions are the same
>>>> (numbers and boundaries are same) in both compressed and uncompressed
>>>> version. So what else should I look into to fix this? Thanks again!
>>>
>>> You can't scan. Can you Get from the table at all?  Try getting start
>>> key from a few of the regions you see in .META.
>>>
>>> Did this table preexist or was this a fresh creation?
>>>
>>> When you created this table uncompressed, how many regions was it?
>>>
>>> How about just running uncompressed while you are on 0.20.6?  We'd
>>> rather be fixing bugs in the new stuff, not the version that we are
>>> leaving behind?
>>>
>>> Thanks,
>>> St.Ack
>>>
>>
>

Re: Use loadtable.rb with compressed data?

Posted by Stack <st...@duboce.net>.
The section in 0.90 book on hfile tool should apply to 0.20.6:
http://hbase.apache.org/ch08s02.html#hfile_tool  It might help you w/
your explorations.

St.Ack

On Fri, Jan 28, 2011 at 9:38 AM, Nanheng Wu <na...@gmail.com> wrote:
> Hi Stack,
>
>  Get doesn't work either. It was a fresh table created by
> loadtable.rb. Finally, the uncompressed version had the same number of
> regions (8 total). I totally understand you guys shouldn't be patching
> the older version, upgrading for me is an option but will be pretty
> painful. I wonder if I can figure something out by comparing the two
> version's Hfile. Thanks again!
>
> On Fri, Jan 28, 2011 at 9:14 AM, Stack <st...@duboce.net> wrote:
>> On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
>>> In the compressed case, there are 8 regions and the region start/end
>>> keys do line up. Which actually is confusing to me, how can hbase read
>>> the files if they are compressed? does each hfile have some metadata
>>> in it that has compression info?
>>
>> You got it.
>>
>>> Anyway, the regions are the same
>>> (numbers and boundaries are same) in both compressed and uncompressed
>>> version. So what else should I look into to fix this? Thanks again!
>>
>> You can't scan. Can you Get from the table at all?  Try getting start
>> key from a few of the regions you see in .META.
>>
>> Did this table preexist or was this a fresh creation?
>>
>> When you created this table uncompressed, how many regions was it?
>>
>> How about just running uncompressed while you are on 0.20.6?  We'd
>> rather be fixing bugs in the new stuff, not the version that we are
>> leaving behind?
>>
>> Thanks,
>> St.Ack
>>
>

Re: Use loadtable.rb with compressed data?

Posted by Nanheng Wu <na...@gmail.com>.
Hi Stack,

  Get doesn't work either. It was a fresh table created by
loadtable.rb. Finally, the uncompressed version had the same number of
regions (8 total). I totally understand you guys shouldn't be patching
the older version, upgrading for me is an option but will be pretty
painful. I wonder if I can figure something out by comparing the two
version's Hfile. Thanks again!

On Fri, Jan 28, 2011 at 9:14 AM, Stack <st...@duboce.net> wrote:
> On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
>> In the compressed case, there are 8 regions and the region start/end
>> keys do line up. Which actually is confusing to me, how can hbase read
>> the files if they are compressed? does each hfile have some metadata
>> in it that has compression info?
>
> You got it.
>
>> Anyway, the regions are the same
>> (numbers and boundaries are same) in both compressed and uncompressed
>> version. So what else should I look into to fix this? Thanks again!
>
> You can't scan. Can you Get from the table at all?  Try getting start
> key from a few of the regions you see in .META.
>
> Did this table preexist or was this a fresh creation?
>
> When you created this table uncompressed, how many regions was it?
>
> How about just running uncompressed while you are on 0.20.6?  We'd
> rather be fixing bugs in the new stuff, not the version that we are
> leaving behind?
>
> Thanks,
> St.Ack
>

Re: Use loadtable.rb with compressed data?

Posted by Stack <st...@duboce.net>.
On Thu, Jan 27, 2011 at 9:35 PM, Nanheng Wu <na...@gmail.com> wrote:
> In the compressed case, there are 8 regions and the region start/end
> keys do line up. Which actually is confusing to me, how can hbase read
> the files if they are compressed? does each hfile have some metadata
> in it that has compression info?

You got it.

> Anyway, the regions are the same
> (numbers and boundaries are same) in both compressed and uncompressed
> version. So what else should I look into to fix this? Thanks again!

You can't scan. Can you Get from the table at all?  Try getting start
key from a few of the regions you see in .META.

Did this table preexist or was this a fresh creation?

When you created this table uncompressed, how many regions was it?

How about just running uncompressed while you are on 0.20.6?  We'd
rather be fixing bugs in the new stuff, not the version that we are
leaving behind?

Thanks,
St.Ack

Re: Use loadtable.rb with compressed data?

Posted by Nanheng Wu <na...@gmail.com>.
In the compressed case, there are 8 regions and the region start/end
keys do line up. Which actually is confusing to me, how can hbase read
the files if they are compressed? does each hfile have some metadata
in it that has compression info? Anyway, the regions are the same
(numbers and boundaries are same) in both compressed and uncompressed
version. So what else should I look into to fix this? Thanks again!

On Thu, Jan 27, 2011 at 9:24 PM, Stack <st...@duboce.net> wrote:
> On Thu, Jan 27, 2011 at 9:08 PM, Nanheng Wu <na...@gmail.com> wrote:
>> Hi Stack, thanks for the answers! I am reasonably sure the
>> partitioning is OK because I just ran the same MR job with compression
>> turned off and everything works. I'd like to move to 0.90 but for the
>> short term I am stuck with 0.20. Is there anything I can do, maybe
>> copy some files from the 0.90 branch and tweak them to run on 0.20?
>> Please advice. thank you!
>>
>
> Don't try backporting.  You'll end up really hating us if you try to do that.
>
> I was off in my first answer.  We read metadata from the files.  Maybe
> when stuff is compressed we are doing something dumb in loadtable.rb
> though we're reading metadata, not keyvalues.   Do the regions look
> right?  The ones in .META.?    Do endkey and startkeys match up as you
> move from one region to the next?  How many regions are there?  If
> same data and it worked previously -- did the previous run have same
> amount of data?  It didn't all fit into one region when you ran it
> uncompressed? -- then it would seem to point at a issue w/ our loading
> compressed files.
>
> St.Ack
>

Re: Use loadtable.rb with compressed data?

Posted by Stack <st...@duboce.net>.
On Thu, Jan 27, 2011 at 9:08 PM, Nanheng Wu <na...@gmail.com> wrote:
> Hi Stack, thanks for the answers! I am reasonably sure the
> partitioning is OK because I just ran the same MR job with compression
> turned off and everything works. I'd like to move to 0.90 but for the
> short term I am stuck with 0.20. Is there anything I can do, maybe
> copy some files from the 0.90 branch and tweak them to run on 0.20?
> Please advice. thank you!
>

Don't try backporting.  You'll end up really hating us if you try to do that.

I was off in my first answer.  We read metadata from the files.  Maybe
when stuff is compressed we are doing something dumb in loadtable.rb
though we're reading metadata, not keyvalues.   Do the regions look
right?  The ones in .META.?    Do endkey and startkeys match up as you
move from one region to the next?  How many regions are there?  If
same data and it worked previously -- did the previous run have same
amount of data?  It didn't all fit into one region when you ran it
uncompressed? -- then it would seem to point at a issue w/ our loading
compressed files.

St.Ack

Re: Use loadtable.rb with compressed data?

Posted by Nanheng Wu <na...@gmail.com>.
Hi Stack, thanks for the answers! I am reasonably sure the
partitioning is OK because I just ran the same MR job with compression
turned off and everything works. I'd like to move to 0.90 but for the
short term I am stuck with 0.20. Is there anything I can do, maybe
copy some files from the 0.90 branch and tweak them to run on 0.20?
Please advice. thank you!

On Thu, Jan 27, 2011 at 9:04 PM, Stack <st...@duboce.net> wrote:
> loadtable.rb doesn't care about file content; it just moves files and
> updates .META.
>
> You sure you did the partitioning correctly?  Not seeing anything
> would come of incorrectly done partitioner.  There may also have been
> a bug in partitioner around this time.  Can you move to 0.90.0?  Bulk
> uploader is much improved there (It was rewritten between 0.20.6 and
> 0.90.0 and the new implementation has been given a much better
> airing).
>
> Yours,
> St.Ack
>
> On Thu, Jan 27, 2011 at 8:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>> Hi,
>>
>> I am using hbase 0.20.6.  Is it possible for the loadtable.rb script
>> to create the table from compressed output? I have a MR job where the
>> reducer outputs Gzip compressed HFiles. When I ran loadtable.rb it
>> didn't have any complaints and seemed to update the meta data table
>> correctly. But when I tried to query against the table no data would
>> come back (scan show 0 zero etc). Does anyone know if it's possible?
>> or If I must create tables from compressed HFIles directly, what other
>> options do I have besides the script? Thanks!
>>
>

Re: Use loadtable.rb with compressed data?

Posted by Stack <st...@duboce.net>.
loadtable.rb doesn't care about file content; it just moves files and
updates .META.

You sure you did the partitioning correctly?  Not seeing anything
would come of incorrectly done partitioner.  There may also have been
a bug in partitioner around this time.  Can you move to 0.90.0?  Bulk
uploader is much improved there (It was rewritten between 0.20.6 and
0.90.0 and the new implementation has been given a much better
airing).

Yours,
St.Ack

On Thu, Jan 27, 2011 at 8:54 PM, Nanheng Wu <na...@gmail.com> wrote:
> Hi,
>
> I am using hbase 0.20.6.  Is it possible for the loadtable.rb script
> to create the table from compressed output? I have a MR job where the
> reducer outputs Gzip compressed HFiles. When I ran loadtable.rb it
> didn't have any complaints and seemed to update the meta data table
> correctly. But when I tried to query against the table no data would
> come back (scan show 0 zero etc). Does anyone know if it's possible?
> or If I must create tables from compressed HFIles directly, what other
> options do I have besides the script? Thanks!
>