You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Udo Offermann <ud...@zfabrik.de> on 2021/04/28 11:56:27 UTC

Regionserver reports RegionTooBusyException on import

Hello everybody

We are migrating from HBase 1.0 to HBase 2.2.5 and observe problem importing data to the new HBase 2 cluster. The HBase clusters are connected to a SAN.
For the import we are using the standard HBbase Import (i.e. no bulk import).

We tested the import several times at the HBase 1.0 cluster and never faced any problems.

The problem we observe is : org.apache.hadoop.hbase.RegionTooBusyException
In the log files of the region servers we found
 
regionserver.MemStoreFlusher: ... has too many store files

It seems that other people faced similar problems like described in this blog post: https://gbif.blogspot.com/2012/07/optimizing-writes-in-hbase.html
However the provided solution does not help in our case (especially increasing hbase.hstore.blockingStoreFiles).

In fact the overall problem seems to be that the Import mappers are too fast for the region servers so that they cannot flush and compact the HFiles in time, even if they stop accepting further writes when 
the value of hbase.hstore.blockingStoreFiles is exceeded.

Increasing hbase.hstore.blockingStoreFiles means hat the region server is allowed to keep more HFiles but as long as the write throughput of the mappers is that high, the region server will never be able to flush and compact the written data in time so that in the end the region servers are too busy and finally treated as crashed!

IMHO it comes simply to the point that the incoming rate (mapper write operations) > processing rate (writing to MemStore, Flushes and Compations) which leads always into disaster - if I remember correctly my queues lecture at the university ;-) 

We also found in the logs lots of "Slow sync cost“ so we also turned of WAL files for the import: 

yarn jar $HBASE_HOME/lib/hbase-mapreduce-2.2.5.jar import -Dimport.wal.durability=SKIP_WAL …
which eliminated the „Slow sync cost“ messages but it didn’t solve our overall problem.

So my question is: isn’t there a way to somehow slow down the import mapper so that the incoming rate < region server’s processing rate?
Are there other possibilities that we can try. One thing that might help (at least for the import scenario) is using bulk import but the question is whether other scenarios with a high write load will lead to similar problems!

Best regards
Udo

Re: Regionserver reports RegionTooBusyException on import

Posted by Udo Offermann <ud...@zfabrik.de>.

> HBase is essentially do what you're asking. By throwing the RegionTooBusyException, the client is pushed into a retry loop. The client will pause before it retries, increase the amount of time it waits the next time (by some function, I forget exactly what), and then retry the same operation.
Just for your information - I think I found it:

org.apache.hadoop.hbase.client.AsyncRequestFutureImpl:

/**
 * Log as much info as possible, and, if there is something to replay,
 * submit it again after a back off sleep.
 */
private void resubmit(ServerName oldServer, List<Action> toReplay,
                      int numAttempt, int failureCount, Throwable throwable) {
…


Udo Offermann 

ZFabrik Software GmbH & Co. KG
Lammstrasse 2, 69190 Walldorf

tel:      +49 6227 3984 255
fax:     +49 6227 3984 254
email: udo.offermann@zfabrik.de <ma...@zfabrik.de>
www:  z2-environment.net <http://z2-environment.net/>

> Am 07.05.2021 um 01:39 schrieb Josh Elser <el...@apache.org>:
> 
>>> You were able to work around the durability concerns by skipping the WAL (never forget that this means your data in HBase is *not* guaranteed to be there).
>> We’re already doing this. This is actually not a problem for us, because we verify the data after the import (using our own restore-test mapreduce report).
> 
> Yes, I was summarizing what you had said to then make sure you understood the implications of what you had done. Good to hear you are verifying this.
> 
>>> Of course, you can also change your application (the Import m/r job) such that you can inject sleeps, but I assume you don't want to do that. We don't expose an option in that job (to my knowledge) that would inject slowdowns.
>> That’s funny - I was just talking about this with my colleague more in jest. But would it be possible that the MemStore realizes that the incoming write rate is higher than the flushing rate and slow down the write requests a little bit?
>> That means putting the „sleep“ into MemStore as a kind of an adaptive congestion control: MemStore could measure the incoming rate and the flushing rate and add some sleeps on demand...
> 
> HBase is essentially do what you're asking. By throwing the RegionTooBusyException, the client is pushed into a retry loop. The client will pause before it retries, increase the amount of time it waits the next time (by some function, I forget exactly what), and then retry the same operation.
> 
> The problem you're facing is that the default configuration is insufficient for the load and/or hardware that you're throwing at HBase.
> 
> The other thing you should be asking yourself is if you have a hotspot in your table design which is causing the load to not be evenly spread across all RegionServers.

Re: Regionserver reports RegionTooBusyException on import

Posted by Josh Elser <el...@apache.org>.

>> You were able to work around the durability concerns by skipping the WAL (never forget that this means your data in HBase is *not* guaranteed to be there).
> We’re already doing this. This is actually not a problem for us, because we verify the data after the import (using our own restore-test mapreduce report).

Yes, I was summarizing what you had said to then make sure you 
understood the implications of what you had done. Good to hear you are 
verifying this.

>> Of course, you can also change your application (the Import m/r job) such that you can inject sleeps, but I assume you don't want to do that. We don't expose an option in that job (to my knowledge) that would inject slowdowns.
> 
> That’s funny - I was just talking about this with my colleague more in jest. But would it be possible that the MemStore realizes that the incoming write rate is higher than the flushing rate and slow down the write requests a little bit?
> That means putting the „sleep“ into MemStore as a kind of an adaptive congestion control: MemStore could measure the incoming rate and the flushing rate and add some sleeps on demand...

HBase is essentially do what you're asking. By throwing the 
RegionTooBusyException, the client is pushed into a retry loop. The 
client will pause before it retries, increase the amount of time it 
waits the next time (by some function, I forget exactly what), and then 
retry the same operation.

The problem you're facing is that the default configuration is 
insufficient for the load and/or hardware that you're throwing at HBase.

The other thing you should be asking yourself is if you have a hotspot 
in your table design which is causing the load to not be evenly spread 
across all RegionServers.

Re: Regionserver reports RegionTooBusyException on import

Posted by Udo Offermann <ud...@zfabrik.de>.

Thank you, Josh, for your valuable advice.

> You were able to work around the durability concerns by skipping the WAL (never forget that this means your data in HBase is *not* guaranteed to be there).
We’re already doing this. This is actually not a problem for us, because we verify the data after the import (using our own restore-test mapreduce report).

Today we ran another test with the default settings but limited the maximum number of mappers to 50% of what yarn usually uses. The first portion was uploaded successfully, but we can still see org.apache.hadoop.hbase.RegionTooBusyException in the log files. However no mappers have been killed so far. Tomorrow I will get the results. 

Tuning the compaction parameters is another good idea that I will check out. 

> Of course, you can also change your application (the Import m/r job) such that you can inject sleeps, but I assume you don't want to do that. We don't expose an option in that job (to my knowledge) that would inject slowdowns.

That’s funny - I was just talking about this with my colleague more in jest. But would it be possible that the MemStore realizes that the incoming write rate is higher than the flushing rate and slow down the write requests a little bit?
That means putting the „sleep“ into MemStore as a kind of an adaptive congestion control: MemStore could measure the incoming rate and the flushing rate and add some sleeps on demand...



> Am 06.05.2021 um 18:00 schrieb Josh Elser <el...@apache.org>:
> 
> Your analysis seems pretty accurate so far. Ultimately, it sounds like your SAN is the bottleneck here.
> 
> You were able to work around the durability concerns by skipping the WAL (never forget that this means your data in HBase is *not* guaranteed to be there).
> 
> It sounds like compactions are the next bottleneck for you. Specifically, your compactions can't complete fast enough to drive down the number of storefiles you have.
> 
> You have two straightforward approaches to try:
> 1. Increase the number of compaction threads inside your regionserver. hbase.regionserver.thread.compaction.small is likely the one you want to increase. Eventually, you may need to also increase hbase.regionserver.thread.compaction.large
> 
> 2. Increase the hbase.client.retries.number to a larger value and/or increase hbase.client.pause so that the client will retry more times before giving up or wait longer in-between retry attempts
> 
> Of course, you can also change your application (the Import m/r job) such that you can inject sleeps, but I assume you don't want to do that. We don't expose an option in that job (to my knowledge) that would inject slowdowns.
> 
> On 4/28/21 7:56 AM, Udo Offermann wrote:
>> Hello everybody
>> We are migrating from HBase 1.0 to HBase 2.2.5 and observe problem importing data to the new HBase 2 cluster. The HBase clusters are connected to a SAN.
>> For the import we are using the standard HBbase Import (i.e. no bulk import).
>> We tested the import several times at the HBase 1.0 cluster and never faced any problems.
>> The problem we observe is : org.apache.hadoop.hbase.RegionTooBusyException
>> In the log files of the region servers we found
>>  regionserver.MemStoreFlusher: ... has too many store files
>> It seems that other people faced similar problems like described in this blog post: https://gbif.blogspot.com/2012/07/optimizing-writes-in-hbase.html
>> However the provided solution does not help in our case (especially increasing hbase.hstore.blockingStoreFiles).
>> In fact the overall problem seems to be that the Import mappers are too fast for the region servers so that they cannot flush and compact the HFiles in time, even if they stop accepting further writes when
>> the value of hbase.hstore.blockingStoreFiles is exceeded.
>> Increasing hbase.hstore.blockingStoreFiles means hat the region server is allowed to keep more HFiles but as long as the write throughput of the mappers is that high, the region server will never be able to flush and compact the written data in time so that in the end the region servers are too busy and finally treated as crashed!
>> IMHO it comes simply to the point that the incoming rate (mapper write operations) > processing rate (writing to MemStore, Flushes and Compations) which leads always into disaster - if I remember correctly my queues lecture at the university ;-)
>> We also found in the logs lots of "Slow sync cost“ so we also turned of WAL files for the import:
>> yarn jar $HBASE_HOME/lib/hbase-mapreduce-2.2.5.jar import -Dimport.wal.durability=SKIP_WAL …
>> which eliminated the „Slow sync cost“ messages but it didn’t solve our overall problem.
>> So my question is: isn’t there a way to somehow slow down the import mapper so that the incoming rate < region server’s processing rate?
>> Are there other possibilities that we can try. One thing that might help (at least for the import scenario) is using bulk import but the question is whether other scenarios with a high write load will lead to similar problems!
>> Best regards
>> Udo

Re: Regionserver reports RegionTooBusyException on import

Posted by Josh Elser <el...@apache.org>.

Your analysis seems pretty accurate so far. Ultimately, it sounds like 
your SAN is the bottleneck here.

You were able to work around the durability concerns by skipping the WAL 
(never forget that this means your data in HBase is *not* guaranteed to 
be there).

It sounds like compactions are the next bottleneck for you. 
Specifically, your compactions can't complete fast enough to drive down 
the number of storefiles you have.

You have two straightforward approaches to try:
1. Increase the number of compaction threads inside your regionserver. 
hbase.regionserver.thread.compaction.small is likely the one you want to 
increase. Eventually, you may need to also increase 
hbase.regionserver.thread.compaction.large

2. Increase the hbase.client.retries.number to a larger value and/or 
increase hbase.client.pause so that the client will retry more times 
before giving up or wait longer in-between retry attempts

Of course, you can also change your application (the Import m/r job) 
such that you can inject sleeps, but I assume you don't want to do that. 
We don't expose an option in that job (to my knowledge) that would 
inject slowdowns.

On 4/28/21 7:56 AM, Udo Offermann wrote:
> Hello everybody
> 
> We are migrating from HBase 1.0 to HBase 2.2.5 and observe problem importing data to the new HBase 2 cluster. The HBase clusters are connected to a SAN.
> For the import we are using the standard HBbase Import (i.e. no bulk import).
> 
> We tested the import several times at the HBase 1.0 cluster and never faced any problems.
> 
> The problem we observe is : org.apache.hadoop.hbase.RegionTooBusyException
> In the log files of the region servers we found
>   
> regionserver.MemStoreFlusher: ... has too many store files
> 
> It seems that other people faced similar problems like described in this blog post: https://gbif.blogspot.com/2012/07/optimizing-writes-in-hbase.html
> However the provided solution does not help in our case (especially increasing hbase.hstore.blockingStoreFiles).
> 
> In fact the overall problem seems to be that the Import mappers are too fast for the region servers so that they cannot flush and compact the HFiles in time, even if they stop accepting further writes when
> the value of hbase.hstore.blockingStoreFiles is exceeded.
> 
> Increasing hbase.hstore.blockingStoreFiles means hat the region server is allowed to keep more HFiles but as long as the write throughput of the mappers is that high, the region server will never be able to flush and compact the written data in time so that in the end the region servers are too busy and finally treated as crashed!
> 
> IMHO it comes simply to the point that the incoming rate (mapper write operations) > processing rate (writing to MemStore, Flushes and Compations) which leads always into disaster - if I remember correctly my queues lecture at the university ;-)
> 
> We also found in the logs lots of "Slow sync cost“ so we also turned of WAL files for the import:
> 
> yarn jar $HBASE_HOME/lib/hbase-mapreduce-2.2.5.jar import -Dimport.wal.durability=SKIP_WAL …
> which eliminated the „Slow sync cost“ messages but it didn’t solve our overall problem.
> 
> So my question is: isn’t there a way to somehow slow down the import mapper so that the incoming rate < region server’s processing rate?
> Are there other possibilities that we can try. One thing that might help (at least for the import scenario) is using bulk import but the question is whether other scenarios with a high write load will lead to similar problems!
> 
> Best regards
> Udo
> 
> 
> 
> 
> 
> 
>