You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2020/05/27 04:11:01 UTC
[GitHub] [hbase] VicoWu commented on a change in pull request #1764: HBASE-24420 Avoid Meaningless Retry Attempts in Unrecoverable Failure

VicoWu commented on a change in pull request #1764:
URL: https://github.com/apache/hbase/pull/1764#discussion_r430826279



##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java
##########
@@ -879,13 +881,21 @@ public void bulkHFile(ColumnFamilyDescriptorBuilder builder, FileStatus hfileSta
       }
 
       int maxRetries = getConf().getInt(HConstants.BULKLOAD_MAX_RETRIES_NUMBER, 10);
-      maxRetries = Math.max(maxRetries, startEndKeys.size() + 1);
+
+      /**
+       * For the first attempt, we make maxRetries with the configured maximum retry number
+       * As long as we find that region number changed, we setup maxRetries to region number
+       * But if we find that the region is not changed, then the maxRetries should be still
+       * be configured BULKLOAD_MAX_RETRIES_NUMBER to avoid meaningless retry attempts
+       */
+      if(count != 0 && previousRegionNum != startEndKeys.size() )

Review comment:
       @anoopsjohn 
   Yes, i agree with you , my above solution is not generic;
   After totally review the whole calling path from my business code , to the HBase client retry logic and then finally to the server processing logic, I think things are clear now and the fix becomes more easier;
   Firstly, in the RegionServer side, the critical unrecoverable exception is swallowed and I think this is incorrect:
   [SecureBulkLoadManager.java#297](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L297), it eat all exceptions thrown from `region.bulkLoadHFiles`, this is obviously incorrect. I checked the commit history of this `catch` clause and it seems that it is a long-history commit. Whether or not exceptions are thrown will impact the Client side retry logic, which means, only when exception is thrown, the client side could get the error information(Refer [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397)
   ) and set up the retry policy;
   
   Secondly, the exception in my incident is not processed correctly: [SecureBulkLoadManager.java#L395](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L395)
   The exception information is 
   `java.lang.IllegalArgumentException: Wrong FS: hdfs://warehousestore/tmp/mdm_user_segments_2798fd21-0961-43ff-8bd2-dcf0180c6918/h/d7a380a0cf1f4adbb459e9f0d0cf66c`
   So I think all HDFS related code in the method should all be catched and thrown as IOException, instead of just rename() failed issue;
   
   After above 2 issues are processed correctly, then client side will get the error information correctly at  [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397) and then  from the code, we could find that user could control the retry logic by configuration `hbase.client.retries.number` and `hbase.bulkload.retries.retryOnIOException`, instead of falling into retry disaster;
   

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java
##########
@@ -879,13 +881,21 @@ public void bulkHFile(ColumnFamilyDescriptorBuilder builder, FileStatus hfileSta
       }
 
       int maxRetries = getConf().getInt(HConstants.BULKLOAD_MAX_RETRIES_NUMBER, 10);
-      maxRetries = Math.max(maxRetries, startEndKeys.size() + 1);
+
+      /**
+       * For the first attempt, we make maxRetries with the configured maximum retry number
+       * As long as we find that region number changed, we setup maxRetries to region number
+       * But if we find that the region is not changed, then the maxRetries should be still
+       * be configured BULKLOAD_MAX_RETRIES_NUMBER to avoid meaningless retry attempts
+       */
+      if(count != 0 && previousRegionNum != startEndKeys.size() )

Review comment:
       @anoopsjohn 
   Yes, i agree with you , my above solution is not generic;
   After totally review the whole calling path from 1) my business code , to 2) the HBase client retry logic and then finally to 3)the server processing logic, I think things are clear now and the fix becomes more easier;
   Firstly, in the RegionServer side, the critical unrecoverable exception is swallowed and I think this is incorrect:
   [SecureBulkLoadManager.java#297](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L297), it eat all exceptions thrown from `region.bulkLoadHFiles`, this is obviously incorrect. I checked the commit history of this `catch` clause and it seems that it is a long-history commit. Whether or not exceptions are thrown will impact the Client side retry logic, which means, only when exception is thrown, the client side could get the error information(Refer [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397)
   ) and set up the retry policy;
   
   Secondly, the exception in my incident is not processed correctly: [SecureBulkLoadManager.java#L395](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L395)
   The exception information is 
   `java.lang.IllegalArgumentException: Wrong FS: hdfs://warehousestore/tmp/mdm_user_segments_2798fd21-0961-43ff-8bd2-dcf0180c6918/h/d7a380a0cf1f4adbb459e9f0d0cf66c`
   So I think all HDFS related code in the method should all be catched and thrown as IOException, instead of just rename() failed issue;
   
   After above 2 issues are processed correctly, then client side will get the error information correctly at  [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397) and then  from the code, we could find that user could control the retry logic by configuration `hbase.client.retries.number` and `hbase.bulkload.retries.retryOnIOException`, instead of falling into retry disaster;
   

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java
##########
@@ -879,13 +881,21 @@ public void bulkHFile(ColumnFamilyDescriptorBuilder builder, FileStatus hfileSta
       }
 
       int maxRetries = getConf().getInt(HConstants.BULKLOAD_MAX_RETRIES_NUMBER, 10);
-      maxRetries = Math.max(maxRetries, startEndKeys.size() + 1);
+
+      /**
+       * For the first attempt, we make maxRetries with the configured maximum retry number
+       * As long as we find that region number changed, we setup maxRetries to region number
+       * But if we find that the region is not changed, then the maxRetries should be still
+       * be configured BULKLOAD_MAX_RETRIES_NUMBER to avoid meaningless retry attempts
+       */
+      if(count != 0 && previousRegionNum != startEndKeys.size() )

Review comment:
       @anoopsjohn 
   Yes, i agree with you , my above solution is not generic;
   After totally review the whole calling path from 1) my business code , to 2) the HBase client retry logic and then finally to 3)the server processing logic, I think things are clear now and the fix becomes more easier;
   -----
   Firstly, in the RegionServer side, the critical unrecoverable exception is swallowed and I think this is incorrect:
   [SecureBulkLoadManager.java#297](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L297), it eat all exceptions thrown from `region.bulkLoadHFiles`, this is obviously incorrect. I checked the commit history of this `catch` clause and it seems that it is a long-history commit. Whether or not exceptions are thrown will impact the Client side retry logic, which means, only when exception is thrown, the client side could get the error information(Refer [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397)
   ) and set up the retry policy;
   
   ------
   Secondly, the exception in my incident is not processed correctly: [SecureBulkLoadManager.java#L395](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L395)
   The exception information is 
   `java.lang.IllegalArgumentException: Wrong FS: hdfs://warehousestore/tmp/mdm_user_segments_2798fd21-0961-43ff-8bd2-dcf0180c6918/h/d7a380a0cf1f4adbb459e9f0d0cf66c`
   So I think all HDFS related code in the method should all be catched and thrown as IOException, instead of just rename() failed issue;
   
   -----
   After above 2 issues are processed correctly, then client side will get the error information correctly at  [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397) and then  from the code, we could find that user could control the retry logic by configuration `hbase.client.retries.number` and `hbase.bulkload.retries.retryOnIOException`, instead of falling into retry disaster;
   

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java
##########
@@ -879,13 +881,21 @@ public void bulkHFile(ColumnFamilyDescriptorBuilder builder, FileStatus hfileSta
       }
 
       int maxRetries = getConf().getInt(HConstants.BULKLOAD_MAX_RETRIES_NUMBER, 10);
-      maxRetries = Math.max(maxRetries, startEndKeys.size() + 1);
+
+      /**
+       * For the first attempt, we make maxRetries with the configured maximum retry number
+       * As long as we find that region number changed, we setup maxRetries to region number
+       * But if we find that the region is not changed, then the maxRetries should be still
+       * be configured BULKLOAD_MAX_RETRIES_NUMBER to avoid meaningless retry attempts
+       */
+      if(count != 0 && previousRegionNum != startEndKeys.size() )

Review comment:
       @anoopsjohn 
   Yes, i agree with you , my above solution is not generic;
   After totally review the whole calling path from 1) my business code , to 2) the HBase client retry logic and then finally to 3)the server processing logic, I think things are clear now and the fix becomes more easier;
   
   -----
   Firstly, in the RegionServer side, the critical unrecoverable exception is swallowed and I think this is incorrect:
   [SecureBulkLoadManager.java#297](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L297), it eat all exceptions thrown from `region.bulkLoadHFiles`, this is obviously incorrect. I checked the commit history of this `catch` clause and it seems that it is a long-history commit. Whether or not exceptions are thrown will impact the Client side retry logic, which means, only when exception is thrown, the client side could get the error information(Refer [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397)
   ) and set up the retry policy;
   
   ------
   Secondly, the exception in my incident is not processed correctly: [SecureBulkLoadManager.java#L395](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L395)
   The exception information is 
   `java.lang.IllegalArgumentException: Wrong FS: hdfs://warehousestore/tmp/mdm_user_segments_2798fd21-0961-43ff-8bd2-dcf0180c6918/h/d7a380a0cf1f4adbb459e9f0d0cf66c`
   So I think all HDFS related code in the method should all be catched and thrown as IOException, instead of just rename() failed issue;
   
   -----
   After above 2 issues are processed correctly, then client side will get the error information correctly at  [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397) and then  from the code, we could find that user could control the retry logic by configuration `hbase.client.retries.number` and `hbase.bulkload.retries.retryOnIOException`, instead of falling into retry disaster;
   

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java
##########
@@ -879,13 +881,21 @@ public void bulkHFile(ColumnFamilyDescriptorBuilder builder, FileStatus hfileSta
       }
 
       int maxRetries = getConf().getInt(HConstants.BULKLOAD_MAX_RETRIES_NUMBER, 10);
-      maxRetries = Math.max(maxRetries, startEndKeys.size() + 1);
+
+      /**
+       * For the first attempt, we make maxRetries with the configured maximum retry number
+       * As long as we find that region number changed, we setup maxRetries to region number
+       * But if we find that the region is not changed, then the maxRetries should be still
+       * be configured BULKLOAD_MAX_RETRIES_NUMBER to avoid meaningless retry attempts
+       */
+      if(count != 0 && previousRegionNum != startEndKeys.size() )

Review comment:
       @anoopsjohn 
   cc @saintstack 
   Yes, i agree with you , my above solution is not generic;
   After totally review the whole calling path from 1) my business code , to 2) the HBase client retry logic and then finally to 3)the server processing logic, I think things are clear now and the fix becomes more easier;
   
   -----
   Firstly, in the RegionServer side, the critical unrecoverable exception is swallowed and I think this is incorrect:
   [SecureBulkLoadManager.java#297](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L297), it eat all exceptions thrown from `region.bulkLoadHFiles`, this is obviously incorrect. I checked the commit history of this `catch` clause and it seems that it is a long-history commit. Whether or not exceptions are thrown will impact the Client side retry logic, which means, only when exception is thrown, the client side could get the error information(Refer [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397)
   ) and set up the retry policy;
   
   ------
   Secondly, the exception in my incident is not processed correctly: [SecureBulkLoadManager.java#L395](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java#L395)
   The exception information is 
   `java.lang.IllegalArgumentException: Wrong FS: hdfs://warehousestore/tmp/mdm_user_segments_2798fd21-0961-43ff-8bd2-dcf0180c6918/h/d7a380a0cf1f4adbb459e9f0d0cf66c`
   So I think all HDFS related code in the method should all be catched and thrown as IOException, instead of just rename() failed issue;
   
   -----
   After above 2 issues are processed correctly, then client side will get the error information correctly at  [BulkLoadHFilesTool.java#396](https://github.com/apache/hbase/blob/a9205f8f4d98ee672c0c6aa9cafa5ef2afc6aab5/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/BulkLoadHFilesTool.java#L397) and then  from the code, we could find that user could control the retry logic by configuration `hbase.client.retries.number` and `hbase.bulkload.retries.retryOnIOException`, instead of falling into retry disaster;
   

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java
##########
@@ -382,22 +382,27 @@ public String prepareBulkLoad(final byte[] family, final String srcPath, boolean
         throw new IOException("Path does not reference a file: " + p);
       }
 
-      // Check to see if the source and target filesystems are the same
-      if (!FSUtils.isSameHdfs(conf, srcFs, fs)) {
-        LOG.debug("Bulk-load file " + srcPath + " is on different filesystem than " +
+      try {

Review comment:
       The whole code with HDFS should be all considered as IOException when error happened, instead of just throwing IOException when rename failed;

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SecureBulkLoadManager.java
##########
@@ -291,10 +291,10 @@ private boolean isUserReferenced(UserGroupInformation ugi) {
             //We call bulkLoadHFiles as requesting user
             //To enable access prior to staging
             return region.bulkLoadHFiles(familyPaths, true,
-                new SecureBulkLoadListener(fs, bulkToken, conf), request.getCopyFile(),
+              new SecureBulkLoadListener(fs, bulkToken, conf), request.getCopyFile(),
               clusterIds, request.getReplicate());
           } catch (Exception e) {
-            LOG.error("Failed to complete bulk load", e);
+            throw new IOException(e);

Review comment:
       Exception cannot be simply swallowed, this will make client fall into possible infinite retry;




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org