You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2020/05/31 04:12:00 UTC
[jira] [Resolved] (HBASE-24454) BucketCache disabled instantly before error duration toleration is reached due to timing issue

     [ https://issues.apache.org/jira/browse/HBASE-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Stack resolved HBASE-24454.
-----------------------------------
    Fix Version/s: 2.2.6
                   1.7.0
                   2.3.0
                   3.0.0-alpha-1
     Hadoop Flags: Reviewed
         Assignee: Michael Stack  (was: Jacob LeBlanc)
       Resolution: Fixed

Pushed on branch-2.2+ and to branch-1.



> BucketCache disabled instantly before error duration toleration is reached due to timing issue
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-24454
>                 URL: https://issues.apache.org/jira/browse/HBASE-24454
>             Project: HBase
>          Issue Type: Bug
>          Components: BucketCache
>    Affects Versions: 1.4.10
>         Environment: We saw this in HBase 1.4.10 (EMR 5.28.1), but I believe the newest code still has this problem.
>            Reporter: Jacob LeBlanc
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.3.0, 1.7.0, 2.2.6
>
>
> We saw an instance where BucketCache was disabled after only two IO error were encountered at nearly the same time. The following shows all errors from a region server log for the 2020-05-26 17:00 hour. Notice that there are no other errors until the 17:14:50 at which time the BucketCache is disabled because it claims duration time has exceeded 60000 ms:
> $ grep ERROR hbase-hbase-regionserver-ip-172-20-113-147.log.2020-05-26-17
>  2020-05-26 17:14:50,396 ERROR [hfile-prefetch-1589117924884] bucket.BucketCache: Failed reading block 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
>  2020-05-26 17:14:50,397 ERROR [regionserver/ip-172-20-113-147.us-west-2.compute.internal/172.20.113.147:16020-BucketCacheWriter-0] bucket.BucketCache: Failed syncing IO engine
>  2020-05-26 17:14:50,400 ERROR [hfile-prefetch-1589117924884] bucket.BucketCache: Failed reading block 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
>  2020-05-26 17:14:50,401 ERROR [hfile-prefetch-1589117924884] bucket.BucketCache: IO errors duration time has exceeded 60000ms, disabling cache, please check your IOEngine
> The region server is very busy so it should be constantly getting successful reads and writes in the bucket cache (it is not as though there was some long ago error and then no successful IO to clear the error flag).
> We are running a busy EMR cluster backed by S3. A bucketcache getting disabled is a huge performance issue because all reads must go to S3.
> Looking at the code, I believe I've found a timing issue. Here is the code for checking and setting the ioErrorStartTime (taken from BucketCache.java):
>  
> {code:java}
>   /**
>    * Check whether we tolerate IO error this time. If the duration of IOEngine
>    * throwing errors exceeds ioErrorsDurationTimeTolerated, we will disable the
>    * cache
>    */
>   private void checkIOErrorIsTolerated() {
>     long now = EnvironmentEdgeManager.currentTime();
>     if (this.ioErrorStartTime > 0) {
>       if (cacheEnabled && (now - ioErrorStartTime) > this.ioErrorsTolerationDuration) {
>         LOG.error("IO errors duration time has exceeded " + ioErrorsTolerationDuration +
>           "ms, disabling cache, please check your IOEngine");
>         disableCache();
>       }
>     } else {
>       this.ioErrorStartTime = now;
>     }
>   }
> {code}
>  
> And here is the code for clearing the ioErrorStartTime when a successful read or write is done:
> {code:java}
>   if (this.ioErrorStartTime > 0) {
>     ioErrorStartTime = -1;
>   }
> {code}
> Notice that that if ioErrorStartTime is set to -1 after the first if statement in checkIOErrorIsTolerated but before (now - ioErrorStartTime), then (now - ioErrorStartTime) will evaluate to (now + 1) resulting in the code thinking it has exceeded ioErrorsTolerationDuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)