You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "adoroszlai (via GitHub)" <gi...@apache.org> on 2023/02/11 11:07:42 UTC

[GitHub] [ozone] adoroszlai opened a new pull request, #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

adoroszlai opened a new pull request, #4269:
URL: https://github.com/apache/ozone/pull/4269

   ## What changes were proposed in this pull request?
   
   Attempt to fix `ManagedChannelImpl` leaks related to EC.
   
   1. Revert parts of the previous attempt (#4215).  Rationale: Invalidating the client removes it from the cache.  Doing so upon `close()` defeats the purpose of the cache.
      * However, keep `cleanup` call in `BlockOutputStream#close` regardless of `bufferPool` is empty or not.
   2. Fixes in prod. code:
      * `ECReconstructionCoordinator` did not close output streams if error happened during write (`try-finally` only covered `putBlock`, not `write`)
      * `ECKeyOutputStream#markStreamClosed` (called on exceptions) cleared the list of output stream entries without closing them.  It also set the `closed` flag, which prevented the real `close()` from doing any cleanup.
      * `BlockOutputStreamEntry#cleanup` initializes stream if needed (`checkStream`), then cleans it up.  `ECBlockOutputStreamEntry` overrides `checkStream` to create multiple streams but cleanup in parent class still happens for only one stream.  Fix it by overriding `cleanup`.
      * `BlockInputStream#acquireClient` did not check if client is already created, hence may leak previous one if called multiple times.
      * `BlockInputStream#getChunkInfos` released client only on failure, delaying release until `close()` on happy path.  Since the client is no longer needed in any case, it can be released early.
      * Fix the mismatch in `BlockInputStream`, where the client was acquired for "read", but released for "write".  Since `invalidateClient` is false, this does not make any real difference.
   3. Some fixes in test code to close streams/clients (part of that is from the previous patch).  The warnings due to these unclosed streams/clients may hide real problems.  There are many-many more, filed HDDS-7958 to keep this patch manageable in size.
   
   https://issues.apache.org/jira/browse/HDDS-7931
   
   ## How was this patch tested?
   
   Ran several tests:
   
   ```
   mvn clean test -am -pl :ozone-integration-test -Dtest='Test*OutputStream*,Test*InputStream*,TestECContainerRecovery,TestOzoneRpcClient,TestSecureOzoneRpcClient,TestContainerCommandsEC,TestOzoneFileSystem,TestRootedOzoneFileSystem'
   ```
   
   Verified no shutdown warning about `ManagedChannelImpl` related to `XceiverClientGrpc` appears in the logs.
   
   Regular CI:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/4143279616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "kerneltime (via GitHub)" <gi...@apache.org>.
kerneltime commented on PR #4269:
URL: https://github.com/apache/ozone/pull/4269#issuecomment-1426962546

   Change looks good, need to go over hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java before giving it a +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] umamaheswararao commented on pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "umamaheswararao (via GitHub)" <gi...@apache.org>.
umamaheswararao commented on PR #4269:
URL: https://github.com/apache/ozone/pull/4269#issuecomment-1433192028

   Thanks for working on this and cleanup @adoroszlai 
   I spent some time on this to review, LGTM. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #4269:
URL: https://github.com/apache/ozone/pull/4269#discussion_r1104363279


##########
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java:
##########
@@ -548,7 +553,7 @@ private void addStripeToQueue(ECChunkBuffers stripe) throws IOException {
   private boolean flushStripeFromQueue() throws IOException {
     try {
       ECChunkBuffers stripe = ecStripeQueue.take();
-      while (!(stripe instanceof EOFDummyStripe)) {
+      while (!closing && !(stripe instanceof EOFDummyStripe)) {

Review Comment:
   It guess it can only be closing=true at this point if there was a previous write exception that got thrown back to the caller. At that point, the caller needs to call close(). If they try to write again, they should see errors as the stream is "closing".
   
   However I do wonder if we should check for and flush queued stripes in the close() method when closing = true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on code in PR #4269:
URL: https://github.com/apache/ozone/pull/4269#discussion_r1105345807


##########
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/KeyOutputStream.java:
##########
@@ -335,7 +335,7 @@ private void handleException(BlockOutputStreamEntry streamEntry,
       excludeList.addPipeline(pipelineId);
     }
     // just clean up the current stream.
-    streamEntry.cleanup(!retryFailure);
+    streamEntry.cleanup(retryFailure);

Review Comment:
   retryFailure is used to invalidate client in cache, recently changed for (HDDS-7838. gRPC channel created block input/output stream not shutdown properly),  so do this revert change will also work for both case and do need invalidate stream in cache for this ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai merged pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai merged PR #4269:
URL: https://github.com/apache/ozone/pull/4269


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a diff in pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on code in PR #4269:
URL: https://github.com/apache/ozone/pull/4269#discussion_r1105473968


##########
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/KeyOutputStream.java:
##########
@@ -335,7 +335,7 @@ private void handleException(BlockOutputStreamEntry streamEntry,
       excludeList.addPipeline(pipelineId);
     }
     // just clean up the current stream.
-    streamEntry.cleanup(!retryFailure);
+    streamEntry.cleanup(retryFailure);

Review Comment:
   Yes, this part of the revert is intentional.  `retryFailure` name has two possible meanings: "retry the failure" or "retry has failed".  I had assumed it is the first one, but closer look at the code revelead it is the second one.  So the original code was OK, we should invalidate the client on `retryFailure` (HDDS-959).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #4269:
URL: https://github.com/apache/ozone/pull/4269#issuecomment-1433004156

   Thanks @kaijchen, @kerneltime, @sodonnel, @sumitagrawl for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #4269:
URL: https://github.com/apache/ozone/pull/4269#discussion_r1104368480


##########
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java:
##########
@@ -487,23 +490,23 @@ public void close() throws IOException {
     }
     closed = true;
     try {
-      // If stripe buffer is not empty, encode and flush the stripe.
-      if (ecChunkBufferCache.getFirstDataCell().position() > 0) {
-        generateParityCells();
-        addStripeToQueue(ecChunkBufferCache);
-      }
-      // Send EOF mark to flush thread.
-      addStripeToQueue(new EOFDummyStripe());
+      if (!closing) {

Review Comment:
   I wonder if we should still at least try to flush the "stripeQueue" even if the state is closing = true?
   
   When we do writes, the stripe gets buffered in the queue, so there might be some buffered data ready to write out. However closing = true is an error state, indicating something unexpected happened during a previous write, so the completeness of the current stripe buffer may not be known or good.
   
   I am not 100% sure what the best approach is - trying to write out the buffered data could run into problems too, due to the same exception that caused the stream to go into the closing state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kaijchen commented on a diff in pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #4269:
URL: https://github.com/apache/ozone/pull/4269#discussion_r1104054335


##########
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java:
##########
@@ -548,7 +553,7 @@ private void addStripeToQueue(ECChunkBuffers stripe) throws IOException {
   private boolean flushStripeFromQueue() throws IOException {
     try {
       ECChunkBuffers stripe = ecStripeQueue.take();
-      while (!(stripe instanceof EOFDummyStripe)) {
+      while (!closing && !(stripe instanceof EOFDummyStripe)) {

Review Comment:
   There might be some stripes buffered in `ecStripeQueue`, and `EOFDummyStripe` is used as signal of there is no more stripes left. Is it intended to drop the buffered stripes when `closing`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #4269: HDDS-7931. EC: ManagedChannelImpl not cleaned up properly

Posted by "kerneltime (via GitHub)" <gi...@apache.org>.
kerneltime commented on PR #4269:
URL: https://github.com/apache/ozone/pull/4269#issuecomment-1428361288

   cc @sumitagrawl can you please take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org