You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Pankaj Misra <pa...@impetus.co.in> on 2012/09/25 08:15:48 UTC

HBase BatchMutations - HOT Region Problem

Dear All,

I am using HBASE 0.94.1 with Hadoop 0.23.1. I have written a multi-threaded thrift client to load the data into HBASE using BatchMutations. The size of each batch is 1000 rows and the table in HBASE is split into 10 regions. The rows are increasing incrementally(0...999999) with offsets applied for each of the threads(0..99999, 100000...199999, 200000...299999, ...), so in theory every thread is expected to write in different region. The individual regions are wide, i.e. every region is expected to store about 100000 rows, so this makes it a total of 1000000 rows across all the regions.

I am using thrift server/client and only 1 region server as per the default HBase setup.

So if I spawn 10 threads with offsets applied accordingly I was expecting the regions to be getting parallely filled up which does not seem to be the case. All the inserts pile into the the same region which make the writes inefficient due to frequent compacting cycles blocking all the threads. If the threads would have been writing to different regions, this problem could have been much smaller.

I am not sure if I am missing out on anything, any ideas would be very helpful.

Thanks and Regards
Pankaj Misra

________________________________

Impetus Ranked in the Top 50 India's Best Companies to Work For 2012.

Impetus webcast 'Designing a Test Automation Framework for Multi-vendor Interoperable Systems' available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Multiple ColumnPrefixFilter

Posted by Shagun Agarwal <sh...@yahoo-inc.com>.
Hi All,

HBase has a filter called MultipleColumnPrefixFilter which behaves like ColumnPrefixFilter but allows specifying multiple prefixes. 
Example: Find all columns in a row and family that start with "abc" or "xyz".
However i could not find any filter which can return all columns in a row that start with "abc" and "xyz".
Do I need to do in memory processing or HBase has any in build mechanis to achieve this.

Thanks
Shagun

RE: HBase BatchMutations - HOT Region Problem

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.
Hi Pankaj

If your threads are generating data (0..99999, 100000...199999, 200000...299999, ...) of this format, your splits also should be like
0.....10000
10000...20000
20000...30000 and so on right? May be am missing something here.

But the data generation that creates the rowkey and the pre split regions' start and endkey have the same format right.

Means if the start and endkeys is created using Bytes.toString("xxx") then the rowkey should also be of that format.  If it is Bytes.toInt(xxx) then the internal byte representation may
Be different.  

Regards
Ram

> -----Original Message-----
> From: Pankaj Misra [mailto:pankaj.misra@impetus.co.in]
> Sent: Tuesday, September 25, 2012 12:42 PM
> To: user@hbase.apache.org
> Subject: RE: HBase BatchMutations - HOT Region Problem
> 
> Please find attached the table split and the snapshot below.
> 
> Start Key                       End Key
>                                 199999
> 199999                  333332
> 333332                  00000000004ccccb
> 00000000004ccccb        666664
> 666664                  00000000007ffffd
> 00000000007ffffd        999996
> 999996                  0000000000b3332f
> 0000000000b3332f        0000000000ccccc8
> 0000000000ccccc8        0000000000e66661
> 0000000000e66661
> 
> As can be seen from the snapshot, the last region being filled up alone
> with all the data, containing the keys which do not belong the that
> range as well.
> 
> One doubt that I do have however is the way the keys are being
> generated the client side. The keys are generated incrementally per
> thread and add to the offset. This is then converted to its string
> representation and written as ByteBuffer. So converting an integer key
> to its String form and then writing it as a ByteBuffer could be a
> problem?
> 
> 
> Thanks and Regards
> Pankaj Misra
> 
> 
> ________________________________________
> From: Anoop Sam John [anoopsj@huawei.com]
> Sent: Tuesday, September 25, 2012 12:18 PM
> To: user@hbase.apache.org
> Subject: RE: HBase BatchMutations - HOT Region Problem
> 
> Your table is presplit. Can you give the splitkeys that you have used?
> 
> -Anoop-
> ________________________________________
> From: Pankaj Misra [pankaj.misra@impetus.co.in]
> Sent: Tuesday, September 25, 2012 11:45 AM
> To: user@hbase.apache.org
> Subject: HBase BatchMutations - HOT Region Problem
> 
> Dear All,
> 
> I am using HBASE 0.94.1 with Hadoop 0.23.1. I have written a multi-
> threaded thrift client to load the data into HBASE using
> BatchMutations. The size of each batch is 1000 rows and the table in
> HBASE is split into 10 regions. The rows are increasing
> incrementally(0...999999) with offsets applied for each of the
> threads(0..99999, 100000...199999, 200000...299999, ...), so in theory
> every thread is expected to write in different region. The individual
> regions are wide, i.e. every region is expected to store about 100000
> rows, so this makes it a total of 1000000 rows across all the regions.
> 
> I am using thrift server/client and only 1 region server as per the
> default HBase setup.
> 
> So if I spawn 10 threads with offsets applied accordingly I was
> expecting the regions to be getting parallely filled up which does not
> seem to be the case. All the inserts pile into the the same region
> which make the writes inefficient due to frequent compacting cycles
> blocking all the threads. If the threads would have been writing to
> different regions, this problem could have been much smaller.
> 
> I am not sure if I am missing out on anything, any ideas would be very
> helpful.
> 
> Thanks and Regards
> Pankaj Misra
> 
> ________________________________
> 
> Impetus Ranked in the Top 50 India's Best Companies to Work For 2012.
> 
> Impetus webcast 'Designing a Test Automation Framework for Multi-vendor
> Interoperable Systems' available at http://lf1.me/0E/.
> 
> 
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or
> guarantee, that the integrity of this communication has been maintained
> nor that the communication is free of errors, virus, interception or
> interference.
> 
> ________________________________
> 
> Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.
> 
> Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor
> Interoperable Systems’ available at http://lf1.me/0E/.
> 
> 
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or
> guarantee, that the integrity of this communication has been maintained
> nor that the communication is free of errors, virus, interception or
> interference.


RE: HBase BatchMutations - HOT Region Problem

Posted by Pankaj Misra <pa...@impetus.co.in>.
Hi Anoop,

Thanks Anoop.

I am creating the splits using the hex split example in the HBase documentation. I am specifically passing the splits during table creation. The leading zeros were lost in pasting from some of the key ranges as the spreadsheet took them to be numbers while assumed the other values to be text.

All key ranges are having consistent size with leading zeros. I am parting them again with the careful consideration of not losing the leading zeros this time.

StartKey                        EndKey
                                0000000000199999
0000000000199999        0000000000333332
0000000000333332        00000000004ccccb
00000000004ccccb        0000000000666664
0000000000666664        00000000007ffffd
00000000007ffffd        0000000000999996
0000000000999996        0000000000b3332f
0000000000b3332f        0000000000ccccc8
0000000000ccccc8        0000000000e66661
0000000000e66661


public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
                  byte[][] splits = new byte[numRegions-1][];
                  BigInteger lowestKey = new BigInteger(startKey, 16);
                  BigInteger highestKey = new BigInteger(endKey, 16);
                  BigInteger range = highestKey.subtract(lowestKey);
                  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
                  lowestKey = lowestKey.add(regionIncrement);
                  for(int i=0; i < numRegions-1;i++) {
                    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
                    byte[] b = String.format("%016x", key).getBytes();
                    splits[i] = b;
                  }
                  return splits;
                }

After few more insights I did realize as was indicated by RamKrishna, that the formatting of the keys was causing this behavior. The basis/format on which the split is done should be consistent with the key generation format as well. In this particular case while the split was happening based on the hex values of the key, and additional formatting is being done by padding it with 0 to make it a 16 byte start/end key. Likewise, the same formatting is to be applied while generating the key during records insertion. If the formatting is not consistent, the hash values are different, hence I was not getting what I was expecting. With the changes made, I was able to get distributions across the regions.

I thank you all for all the help, much appreciate it.

Thanks and Regards
Pankaj Misra

________________________________________
From: Anoop Sam John [anoopsj@huawei.com]
Sent: Tuesday, September 25, 2012 4:05 PM
To: user@hbase.apache.org
Subject: RE: HBase BatchMutations - HOT Region Problem

Hi
There is a util class Bytes available in HBase and there is toBytes(int) using which u can convert an int to byte[]
In the split keys why leading zeros for some region keys? How you have made the splits? U have passed explicitely the splits or splitkey creation done by HBase code? How you have changed the byte[] keys into hex format to paste below?

-Anoop-

________________________________________
From: Pankaj Misra [pankaj.misra@impetus.co.in]
Sent: Tuesday, September 25, 2012 12:41 PM
To: user@hbase.apache.org
Subject: RE: HBase BatchMutations - HOT Region Problem

Please find attached the table split and the snapshot below.

Start Key                       End Key
                                199999
199999                  333332
333332                  00000000004ccccb
00000000004ccccb        666664
666664                  00000000007ffffd
00000000007ffffd        999996
999996                  0000000000b3332f
0000000000b3332f        0000000000ccccc8
0000000000ccccc8        0000000000e66661
0000000000e66661

As can be seen from the snapshot, the last region being filled up alone with all the data, containing the keys which do not belong the that range as well.

One doubt that I do have however is the way the keys are being generated the client side. The keys are generated incrementally per thread and add to the offset. This is then converted to its string representation and written as ByteBuffer. So converting an integer key to its String form and then writing it as a ByteBuffer could be a problem?


Thanks and Regards
Pankaj Misra


________________________________________
From: Anoop Sam John [anoopsj@huawei.com]
Sent: Tuesday, September 25, 2012 12:18 PM
To: user@hbase.apache.org
Subject: RE: HBase BatchMutations - HOT Region Problem

Your table is presplit. Can you give the splitkeys that you have used?

-Anoop-
________________________________________
From: Pankaj Misra [pankaj.misra@impetus.co.in]
Sent: Tuesday, September 25, 2012 11:45 AM
To: user@hbase.apache.org
Subject: HBase BatchMutations - HOT Region Problem

Dear All,

I am using HBASE 0.94.1 with Hadoop 0.23.1. I have written a multi-threaded thrift client to load the data into HBASE using BatchMutations. The size of each batch is 1000 rows and the table in HBASE is split into 10 regions. The rows are increasing incrementally(0...999999) with offsets applied for each of the threads(0..99999, 100000...199999, 200000...299999, ...), so in theory every thread is expected to write in different region. The individual regions are wide, i.e. every region is expected to store about 100000 rows, so this makes it a total of 1000000 rows across all the regions.

I am using thrift server/client and only 1 region server as per the default HBase setup.

So if I spawn 10 threads with offsets applied accordingly I was expecting the regions to be getting parallely filled up which does not seem to be the case. All the inserts pile into the the same region which make the writes inefficient due to frequent compacting cycles blocking all the threads. If the threads would have been writing to different regions, this problem could have been much smaller.

I am not sure if I am missing out on anything, any ideas would be very helpful.

Thanks and Regards
Pankaj Misra

________________________________

Impetus Ranked in the Top 50 India's Best Companies to Work For 2012.

Impetus webcast 'Designing a Test Automation Framework for Multi-vendor Interoperable Systems' available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

________________________________

Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.

Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor Interoperable Systems’ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

________________________________

Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.

Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor Interoperable Systems’ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

RE: HBase BatchMutations - HOT Region Problem

Posted by Anoop Sam John <an...@huawei.com>.
Hi
There is a util class Bytes available in HBase and there is toBytes(int) using which u can convert an int to byte[]
In the split keys why leading zeros for some region keys? How you have made the splits? U have passed explicitely the splits or splitkey creation done by HBase code? How you have changed the byte[] keys into hex format to paste below? 

-Anoop-

________________________________________
From: Pankaj Misra [pankaj.misra@impetus.co.in]
Sent: Tuesday, September 25, 2012 12:41 PM
To: user@hbase.apache.org
Subject: RE: HBase BatchMutations - HOT Region Problem

Please find attached the table split and the snapshot below.

Start Key                       End Key
                                199999
199999                  333332
333332                  00000000004ccccb
00000000004ccccb        666664
666664                  00000000007ffffd
00000000007ffffd        999996
999996                  0000000000b3332f
0000000000b3332f        0000000000ccccc8
0000000000ccccc8        0000000000e66661
0000000000e66661

As can be seen from the snapshot, the last region being filled up alone with all the data, containing the keys which do not belong the that range as well.

One doubt that I do have however is the way the keys are being generated the client side. The keys are generated incrementally per thread and add to the offset. This is then converted to its string representation and written as ByteBuffer. So converting an integer key to its String form and then writing it as a ByteBuffer could be a problem?


Thanks and Regards
Pankaj Misra


________________________________________
From: Anoop Sam John [anoopsj@huawei.com]
Sent: Tuesday, September 25, 2012 12:18 PM
To: user@hbase.apache.org
Subject: RE: HBase BatchMutations - HOT Region Problem

Your table is presplit. Can you give the splitkeys that you have used?

-Anoop-
________________________________________
From: Pankaj Misra [pankaj.misra@impetus.co.in]
Sent: Tuesday, September 25, 2012 11:45 AM
To: user@hbase.apache.org
Subject: HBase BatchMutations - HOT Region Problem

Dear All,

I am using HBASE 0.94.1 with Hadoop 0.23.1. I have written a multi-threaded thrift client to load the data into HBASE using BatchMutations. The size of each batch is 1000 rows and the table in HBASE is split into 10 regions. The rows are increasing incrementally(0...999999) with offsets applied for each of the threads(0..99999, 100000...199999, 200000...299999, ...), so in theory every thread is expected to write in different region. The individual regions are wide, i.e. every region is expected to store about 100000 rows, so this makes it a total of 1000000 rows across all the regions.

I am using thrift server/client and only 1 region server as per the default HBase setup.

So if I spawn 10 threads with offsets applied accordingly I was expecting the regions to be getting parallely filled up which does not seem to be the case. All the inserts pile into the the same region which make the writes inefficient due to frequent compacting cycles blocking all the threads. If the threads would have been writing to different regions, this problem could have been much smaller.

I am not sure if I am missing out on anything, any ideas would be very helpful.

Thanks and Regards
Pankaj Misra

________________________________

Impetus Ranked in the Top 50 India's Best Companies to Work For 2012.

Impetus webcast 'Designing a Test Automation Framework for Multi-vendor Interoperable Systems' available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

________________________________

Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.

Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor Interoperable Systems’ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

RE: HBase BatchMutations - HOT Region Problem

Posted by Pankaj Misra <pa...@impetus.co.in>.
Please find attached the table split and the snapshot below.

Start Key                       End Key
                                199999
199999                  333332
333332                  00000000004ccccb
00000000004ccccb        666664
666664                  00000000007ffffd
00000000007ffffd        999996
999996                  0000000000b3332f
0000000000b3332f        0000000000ccccc8
0000000000ccccc8        0000000000e66661
0000000000e66661

As can be seen from the snapshot, the last region being filled up alone with all the data, containing the keys which do not belong the that range as well.

One doubt that I do have however is the way the keys are being generated the client side. The keys are generated incrementally per thread and add to the offset. This is then converted to its string representation and written as ByteBuffer. So converting an integer key to its String form and then writing it as a ByteBuffer could be a problem?


Thanks and Regards
Pankaj Misra


________________________________________
From: Anoop Sam John [anoopsj@huawei.com]
Sent: Tuesday, September 25, 2012 12:18 PM
To: user@hbase.apache.org
Subject: RE: HBase BatchMutations - HOT Region Problem

Your table is presplit. Can you give the splitkeys that you have used?

-Anoop-
________________________________________
From: Pankaj Misra [pankaj.misra@impetus.co.in]
Sent: Tuesday, September 25, 2012 11:45 AM
To: user@hbase.apache.org
Subject: HBase BatchMutations - HOT Region Problem

Dear All,

I am using HBASE 0.94.1 with Hadoop 0.23.1. I have written a multi-threaded thrift client to load the data into HBASE using BatchMutations. The size of each batch is 1000 rows and the table in HBASE is split into 10 regions. The rows are increasing incrementally(0...999999) with offsets applied for each of the threads(0..99999, 100000...199999, 200000...299999, ...), so in theory every thread is expected to write in different region. The individual regions are wide, i.e. every region is expected to store about 100000 rows, so this makes it a total of 1000000 rows across all the regions.

I am using thrift server/client and only 1 region server as per the default HBase setup.

So if I spawn 10 threads with offsets applied accordingly I was expecting the regions to be getting parallely filled up which does not seem to be the case. All the inserts pile into the the same region which make the writes inefficient due to frequent compacting cycles blocking all the threads. If the threads would have been writing to different regions, this problem could have been much smaller.

I am not sure if I am missing out on anything, any ideas would be very helpful.

Thanks and Regards
Pankaj Misra

________________________________

Impetus Ranked in the Top 50 India's Best Companies to Work For 2012.

Impetus webcast 'Designing a Test Automation Framework for Multi-vendor Interoperable Systems' available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

________________________________

Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.

Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor Interoperable Systems’ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

RE: HBase BatchMutations - HOT Region Problem

Posted by Anoop Sam John <an...@huawei.com>.
Your table is presplit. Can you give the splitkeys that you have used?

-Anoop-
________________________________________
From: Pankaj Misra [pankaj.misra@impetus.co.in]
Sent: Tuesday, September 25, 2012 11:45 AM
To: user@hbase.apache.org
Subject: HBase BatchMutations - HOT Region Problem

Dear All,

I am using HBASE 0.94.1 with Hadoop 0.23.1. I have written a multi-threaded thrift client to load the data into HBASE using BatchMutations. The size of each batch is 1000 rows and the table in HBASE is split into 10 regions. The rows are increasing incrementally(0...999999) with offsets applied for each of the threads(0..99999, 100000...199999, 200000...299999, ...), so in theory every thread is expected to write in different region. The individual regions are wide, i.e. every region is expected to store about 100000 rows, so this makes it a total of 1000000 rows across all the regions.

I am using thrift server/client and only 1 region server as per the default HBase setup.

So if I spawn 10 threads with offsets applied accordingly I was expecting the regions to be getting parallely filled up which does not seem to be the case. All the inserts pile into the the same region which make the writes inefficient due to frequent compacting cycles blocking all the threads. If the threads would have been writing to different regions, this problem could have been much smaller.

I am not sure if I am missing out on anything, any ideas would be very helpful.

Thanks and Regards
Pankaj Misra

________________________________

Impetus Ranked in the Top 50 India's Best Companies to Work For 2012.

Impetus webcast 'Designing a Test Automation Framework for Multi-vendor Interoperable Systems' available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.