You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2020/07/10 04:34:04 UTC

[GitHub] [hadoop-ozone] runzhiwang opened a new pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

runzhiwang opened a new pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185


   ## What changes were proposed in this pull request?
   
   **What's problem ?**
   Datanode creates more than 20K Datanode State Machine Thread, then OOM happened.
   ![image](https://user-images.githubusercontent.com/51938049/87116288-fdb44c00-c2a7-11ea-80b0-e3f77e1fe3ab.png)
   
   **What's the reason ?**
   20K Datanode State Machine Thread were created by newCachedThreadPool
   ![image](https://user-images.githubusercontent.com/51938049/87116364-2fc5ae00-c2a8-11ea-98c9-604aa027c349.png)
   
   Almost all of them were wait lock.
   ![image](https://user-images.githubusercontent.com/51938049/87116509-8cc16400-c2a8-11ea-82d5-c23dfa5438f0.png)
   
   Only one Datanode State Machine Thread got the lock, and block when submitRequest. Because this thread was blocked and can not free the lock, newCachedThreadPool will create new thread infinitely.
   ![image](https://user-images.githubusercontent.com/51938049/87116744-2ab52e80-c2a9-11ea-8ba1-b57e1cade46d.png)
   
   **How to fix ?**
   1. Avoid use newCachedThreadPool, because it will create new thread infinitely, if no thread available in pool.
   2. Cancel future when task time out.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-3933
   
   ## How was this patch tested?
   
   Existed UT.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] codecov-commenter commented on pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#issuecomment-660814028


   # [Codecov](https://codecov.io/gh/apache/hadoop-ozone/pull/1185?src=pr&el=h1) Report
   > Merging [#1185](https://codecov.io/gh/apache/hadoop-ozone/pull/1185?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hadoop-ozone/commit/f8fcc4760b227308c4a72f06f2e35864e69a1f22&el=desc) will **increase** coverage by `2.93%`.
   > The diff coverage is `74.96%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/graphs/tree.svg?width=650&height=150&src=pr&token=5YeeptJMby)](https://codecov.io/gh/apache/hadoop-ozone/pull/1185?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #1185      +/-   ##
   ============================================
   + Coverage     70.56%   73.49%   +2.93%     
   - Complexity     9427    10048     +621     
   ============================================
     Files           965      978      +13     
     Lines         49063    49918     +855     
     Branches       4803     4848      +45     
   ============================================
   + Hits          34620    36689    +2069     
   + Misses        12137    10925    -1212     
   + Partials       2306     2304       -2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/hadoop-ozone/pull/1185?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../apache/hadoop/hdds/conf/HddsPrometheusConfig.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3AvaGRkcy9jb25mL0hkZHNQcm9tZXRoZXVzQ29uZmlnLmphdmE=) | `50.00% <ø> (ø)` | `2.00 <0.00> (ø)` | |
   | [...op/hdds/utils/LegacyHadoopConfigurationSource.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3AvaGRkcy91dGlscy9MZWdhY3lIYWRvb3BDb25maWd1cmF0aW9uU291cmNlLmphdmE=) | `50.00% <ø> (ø)` | `5.00 <0.00> (ø)` | |
   | [.../java/org/apache/hadoop/ozone/OzoneConfigKeys.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3Avb3pvbmUvT3pvbmVDb25maWdLZXlzLmphdmE=) | `100.00% <ø> (ø)` | `1.00 <0.00> (ø)` | |
   | [...main/java/org/apache/hadoop/ozone/OzoneConsts.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3Avb3pvbmUvT3pvbmVDb25zdHMuamF2YQ==) | `85.71% <ø> (+1.50%)` | `1.00 <0.00> (ø)` | |
   | [...g/apache/hadoop/hdds/conf/ConfigurationSource.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvY29uZmlnL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3AvaGRkcy9jb25mL0NvbmZpZ3VyYXRpb25Tb3VyY2UuamF2YQ==) | `22.72% <ø> (ø)` | `7.00 <0.00> (ø)` | |
   | [...rg/apache/hadoop/ozone/HddsDatanodeHttpServer.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvY29udGFpbmVyLXNlcnZpY2Uvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2hhZG9vcC9vem9uZS9IZGRzRGF0YW5vZGVIdHRwU2VydmVyLmphdmE=) | `100.00% <ø> (+28.57%)` | `13.00 <0.00> (+4.00)` | |
   | [...mon/transport/server/ratis/XceiverServerRatis.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvY29udGFpbmVyLXNlcnZpY2Uvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2hhZG9vcC9vem9uZS9jb250YWluZXIvY29tbW9uL3RyYW5zcG9ydC9zZXJ2ZXIvcmF0aXMvWGNlaXZlclNlcnZlclJhdGlzLmphdmE=) | `88.10% <0.00%> (-4.50%)` | `62.00 <0.00> (+5.00)` | :arrow_down: |
   | [...iner/ozoneimpl/ContainerScrubberConfiguration.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvY29udGFpbmVyLXNlcnZpY2Uvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2hhZG9vcC9vem9uZS9jb250YWluZXIvb3pvbmVpbXBsL0NvbnRhaW5lclNjcnViYmVyQ29uZmlndXJhdGlvbi5qYXZh) | `81.81% <ø> (-18.19%)` | `7.00 <0.00> (-1.00)` | |
   | [...org/apache/hadoop/hdds/server/http/HttpConfig.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvZnJhbWV3b3JrL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3AvaGRkcy9zZXJ2ZXIvaHR0cC9IdHRwQ29uZmlnLmphdmE=) | `76.47% <0.00%> (ø)` | `1.00 <0.00> (ø)` | |
   | [.../hadoop/hdds/scm/container/ReplicationManager.java](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree#diff-aGFkb29wLWhkZHMvc2VydmVyLXNjbS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaGFkb29wL2hkZHMvc2NtL2NvbnRhaW5lci9SZXBsaWNhdGlvbk1hbmFnZXIuamF2YQ==) | `87.35% <0.00%> (-0.70%)` | `102.00 <0.00> (ø)` | |
   | ... and [453 more](https://codecov.io/gh/apache/hadoop-ozone/pull/1185/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/hadoop-ozone/pull/1185?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/hadoop-ozone/pull/1185?src=pr&el=footer). Last update [25d3e52...2c381fa](https://codecov.io/gh/apache/hadoop-ozone/pull/1185?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] ChenSammi commented on pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#issuecomment-657437337


   Don't find any unit tests here. Would you please add some?  @runzhiwang 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] ChenSammi commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r456200241



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##########
@@ -415,7 +431,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
       if (this.isEntering()) {
         task.onEnter();
       }
-      task.execute(service);
+
+      if (isThreadPoolAvailable(service)) {
+        task.execute(service);
+      } else {
+        LOG.warn("No available thread in pool, duration:" +

Review comment:
       There is a  EndpointStateMachine#logIfNeeded function.   We can leverage this function to reduce the same warning logs logged.  @runzhiwang




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] ChenSammi removed a comment on pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
ChenSammi removed a comment on pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#issuecomment-657437337


   Don't find any unit tests here. Would you please add some new unit test?  @runzhiwang 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] ChenSammi commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454191790



##########
File path: hadoop-hdds/common/src/main/resources/ozone-default.xml
##########
@@ -268,6 +268,13 @@
       datanode periodically send node report to SCM. Unit could be
       defined with postfix (ns,ms,s,m,h,d)</description>
   </property>
+  <property>
+    <name>hdds.statemachine.endpoint.task.thread.count</name>
+    <value>2</value>
+    <tag>OZONE, DATANODE, MANAGEMENT</tag>
+    <description>Maximum number of threads in the thread pool that Datanode
+      will use for GETVERSION, REGISTER, HEARTBEAT to SCMs.</description>
+  </property>

Review comment:
       Can we add more details about what's the proper value for this proerpty if user want to change it? 

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/datanode/RunningDatanodeState.java
##########
@@ -200,22 +203,31 @@ public void execute(ExecutorService executor) {
   @Override
   public DatanodeStateMachine.DatanodeStates
       await(long duration, TimeUnit timeUnit)
-      throws InterruptedException, ExecutionException, TimeoutException {
+      throws InterruptedException {
     int count = connectionManager.getValues().size();
     int returned = 0;
-    long timeLeft = timeUnit.toMillis(duration);
+    long durationMS = timeUnit.toMillis(duration);
     long startTime = Time.monotonicNow();
-    List<Future<EndPointStates>> results = new LinkedList<>();
-
-    while (returned < count && timeLeft > 0) {
-      Future<EndPointStates> result =
-          ecs.poll(timeLeft, TimeUnit.MILLISECONDS);
-      if (result != null) {
-        results.add(result);
-        returned++;
+    Set<Future<EndPointStates>> results = new HashSet<>();
+
+    while (returned < count
+        && (durationMS - (Time.monotonicNow() - startTime))> 0) {
+      for (Future<EndPointStates> future : futures) {
+        if (future != null && future.isDone() && !results.contains(future)) {
+          results.add(future);
+          returned++;
+        }
       }
-      timeLeft = timeLeft - (Time.monotonicNow() - startTime);
+
+      Thread.sleep(durationMS / 10);
     }
+
+    for (Future<EndPointStates> future : futures) {
+      if (future != null && !future.isDone()) {
+        future.cancel(true);

Review comment:
       Instead of cancel the future, I would suggest to skip the next time execution, and let the current future finish it.  Also, please add LOG if this case happens. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] ChenSammi edited a comment on pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
ChenSammi edited a comment on pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#issuecomment-657437337


   Don't find any unit tests here. Would you please add some new unit test?  @runzhiwang 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#issuecomment-656509377


   I will check the failed ut.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r458571774



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##########
@@ -434,9 +438,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
 
       if (isThreadPoolAvailable(service)) {
         task.execute(service);
+        threadPoolNotAvailableCount.set(0);
       } else {
-        LOG.warn("No available thread in pool, duration:" +
-            unit.toMillis(time));
+        threadPoolNotAvailableCount.incrementAndGet();

Review comment:
       Updated




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r457074883



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##########
@@ -415,7 +431,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
       if (this.isEntering()) {
         task.onEnter();
       }
-      task.execute(service);
+
+      if (isThreadPoolAvailable(service)) {
+        task.execute(service);
+      } else {
+        LOG.warn("No available thread in pool, duration:" +

Review comment:
       @ChenSammi @xiaoyuyao  Thanks for review. I have updated the PR, because  EndpointStateMachine#logIfNeeded can not be accessed there, and it's can not be refactored to use, so I did not use it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454900546



##########
File path: hadoop-hdds/common/src/main/resources/ozone-default.xml
##########
@@ -268,6 +268,13 @@
       datanode periodically send node report to SCM. Unit could be
       defined with postfix (ns,ms,s,m,h,d)</description>
   </property>
+  <property>
+    <name>hdds.statemachine.endpoint.task.thread.count</name>
+    <value>2</value>
+    <tag>OZONE, DATANODE, MANAGEMENT</tag>
+    <description>Maximum number of threads in the thread pool that Datanode
+      will use for GETVERSION, REGISTER, HEARTBEAT to SCMs.</description>
+  </property>

Review comment:
       I remove this config, and get thread pool size by add number of scm and recon.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454194380



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/datanode/RunningDatanodeState.java
##########
@@ -200,22 +203,31 @@ public void execute(ExecutorService executor) {
   @Override
   public DatanodeStateMachine.DatanodeStates
       await(long duration, TimeUnit timeUnit)
-      throws InterruptedException, ExecutionException, TimeoutException {
+      throws InterruptedException {
     int count = connectionManager.getValues().size();
     int returned = 0;
-    long timeLeft = timeUnit.toMillis(duration);
+    long durationMS = timeUnit.toMillis(duration);
     long startTime = Time.monotonicNow();
-    List<Future<EndPointStates>> results = new LinkedList<>();
-
-    while (returned < count && timeLeft > 0) {
-      Future<EndPointStates> result =
-          ecs.poll(timeLeft, TimeUnit.MILLISECONDS);
-      if (result != null) {
-        results.add(result);
-        returned++;
+    Set<Future<EndPointStates>> results = new HashSet<>();
+
+    while (returned < count
+        && (durationMS - (Time.monotonicNow() - startTime))> 0) {
+      for (Future<EndPointStates> future : futures) {
+        if (future != null && future.isDone() && !results.contains(future)) {
+          results.add(future);
+          returned++;
+        }
       }
-      timeLeft = timeLeft - (Time.monotonicNow() - startTime);
+
+      Thread.sleep(durationMS / 10);
     }
+
+    for (Future<EndPointStates> future : futures) {
+      if (future != null && !future.isDone()) {
+        future.cancel(true);

Review comment:
       Maybe the future can not finish forever, because the rpc call blocked forever, as we found in production cluster.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454194380



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/datanode/RunningDatanodeState.java
##########
@@ -200,22 +203,31 @@ public void execute(ExecutorService executor) {
   @Override
   public DatanodeStateMachine.DatanodeStates
       await(long duration, TimeUnit timeUnit)
-      throws InterruptedException, ExecutionException, TimeoutException {
+      throws InterruptedException {
     int count = connectionManager.getValues().size();
     int returned = 0;
-    long timeLeft = timeUnit.toMillis(duration);
+    long durationMS = timeUnit.toMillis(duration);
     long startTime = Time.monotonicNow();
-    List<Future<EndPointStates>> results = new LinkedList<>();
-
-    while (returned < count && timeLeft > 0) {
-      Future<EndPointStates> result =
-          ecs.poll(timeLeft, TimeUnit.MILLISECONDS);
-      if (result != null) {
-        results.add(result);
-        returned++;
+    Set<Future<EndPointStates>> results = new HashSet<>();
+
+    while (returned < count
+        && (durationMS - (Time.monotonicNow() - startTime))> 0) {
+      for (Future<EndPointStates> future : futures) {
+        if (future != null && future.isDone() && !results.contains(future)) {
+          results.add(future);
+          returned++;
+        }
       }
-      timeLeft = timeLeft - (Time.monotonicNow() - startTime);
+
+      Thread.sleep(durationMS / 10);
     }
+
+    for (Future<EndPointStates> future : futures) {
+      if (future != null && !future.isDone()) {
+        future.cancel(true);

Review comment:
       Maybe the future can not finish forever, because the rpc call blocked forever, as we found in production cluster.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] xiaoyuyao commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
xiaoyuyao commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r456153310



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##########
@@ -415,7 +431,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
       if (this.isEntering()) {
         task.onEnter();
       }
-      task.execute(service);
+
+      if (isThreadPoolAvailable(service)) {
+        task.execute(service);
+      } else {
+        LOG.warn("No available thread in pool, duration:" +

Review comment:
       In a similar case of endpoint task locked up, will we end up with flooding of warning logs here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] ChenSammi commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r458560904



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##########
@@ -434,9 +438,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
 
       if (isThreadPoolAvailable(service)) {
         task.execute(service);
+        threadPoolNotAvailableCount.set(0);
       } else {
-        LOG.warn("No available thread in pool, duration:" +
-            unit.toMillis(time));
+        threadPoolNotAvailableCount.incrementAndGet();

Review comment:
       Can we switch theset two statement? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] ChenSammi merged pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
ChenSammi merged pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r456195255



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##########
@@ -415,7 +431,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit)
       if (this.isEntering()) {
         task.onEnter();
       }
-      task.execute(service);
+
+      if (isThreadPoolAvailable(service)) {
+        task.execute(service);
+      } else {
+        LOG.warn("No available thread in pool, duration:" +

Review comment:
       @xiaoyuyao Thanks for review and reasonable suggestion. How about add metrics for thread pool available and unavailable times ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] ChenSammi commented on pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#issuecomment-662429401


   LGTM + 1.  Thanks @runzhiwang for the contribution and @xiaoyuyao for the review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org