You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@celeborn.apache.org by "FMX (via GitHub)" <gi...@apache.org> on 2023/02/23 12:51:20 UTC

[GitHub] [incubator-celeborn] FMX opened a new pull request, #1270: [CELEBORN-333][FLINK] Fix memory leak.

FMX opened a new pull request, #1270:
URL: https://github.com/apache/incubator-celeborn/pull/1270

   ### What changes were proposed in this pull request?
   1. Solve buffer dispatcher memory leak in worker.
   2. Solve transportFrameDecoder memory leak in client.
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   NO.
   
   
   ### How was this patch tested?
   UT and cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] RexXiong commented on a diff in pull request #1270: [CELEBORN-335][FLINK] Fix memory leak.

Posted by "RexXiong (via GitHub)" <gi...@apache.org>.
RexXiong commented on code in PR #1270:
URL: https://github.com/apache/incubator-celeborn/pull/1270#discussion_r1115752825


##########
client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/network/TransportFrameDecoderWithBufferSupplier.java:
##########
@@ -144,6 +148,9 @@ public void channelRead(ChannelHandlerContext ctx, Object data) {
           }
         }
       }
+    } catch (Exception e) {

Review Comment:
   Do this need to notify the server?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] FMX closed pull request #1270: [CELEBORN-335][FLINK] Fix memory leak.

Posted by "FMX (via GitHub)" <gi...@apache.org>.
FMX closed pull request #1270: [CELEBORN-335][FLINK] Fix memory leak.
URL: https://github.com/apache/incubator-celeborn/pull/1270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] RexXiong commented on a diff in pull request #1270: [CELEBORN-335][FLINK] Fix memory leak.

Posted by "RexXiong (via GitHub)" <gi...@apache.org>.
RexXiong commented on code in PR #1270:
URL: https://github.com/apache/incubator-celeborn/pull/1270#discussion_r1117927255


##########
common/src/main/java/org/apache/celeborn/common/network/server/memory/ReadBufferDispatcher.java:
##########
@@ -48,7 +48,7 @@ public void addBufferRequest(ReadBufferRequest request) {
 
   public void recycle(ByteBuf buf) {
     int bufferSize = buf.capacity();
-    buf.release();
+    buf.release(buf.refCnt());
     memoryManager.changeReadBufferCounter(-1 * bufferSize);

Review Comment:
   We can delete buf.retain() when allocators new buffer as we always retain it when we use the buffer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] FMX commented on a diff in pull request #1270: [CELEBORN-335][FLINK] Fix memory leak.

Posted by "FMX (via GitHub)" <gi...@apache.org>.
FMX commented on code in PR #1270:
URL: https://github.com/apache/incubator-celeborn/pull/1270#discussion_r1116852226


##########
client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/network/TransportFrameDecoderWithBufferSupplier.java:
##########
@@ -144,6 +148,9 @@ public void channelRead(ChannelHandlerContext ctx, Object data) {
           }
         }
       }
+    } catch (Exception e) {
+      logger.error(
+          "decode msg failed type {} curMsg {} size {}", curType, curMsg, nettyBuf.readableBytes());

Review Comment:
   Yes,it should.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] FMX commented on a diff in pull request #1270: [CELEBORN-335][FLINK] Fix memory leak.

Posted by "FMX (via GitHub)" <gi...@apache.org>.
FMX commented on code in PR #1270:
URL: https://github.com/apache/incubator-celeborn/pull/1270#discussion_r1116853303


##########
client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/network/TransportFrameDecoderWithBufferSupplier.java:
##########
@@ -144,6 +148,9 @@ public void channelRead(ChannelHandlerContext ctx, Object data) {
           }
         }
       }
+    } catch (Exception e) {

Review Comment:
   NO. Because read data won't expect rpc response. And other RPC decode failure means that the data is broken. We might can not send a response to the server.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] codecov[bot] commented on pull request #1270: [CELEBORN-335][FLINK] Fix memory leak.

Posted by "codecov[bot] (via GitHub)" <gi...@apache.org>.
codecov[bot] commented on PR #1270:
URL: https://github.com/apache/incubator-celeborn/pull/1270#issuecomment-1441736573

   # [Codecov](https://codecov.io/gh/apache/incubator-celeborn/pull/1270?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1270](https://codecov.io/gh/apache/incubator-celeborn/pull/1270?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e64f639) into [main](https://codecov.io/gh/apache/incubator-celeborn/commit/af9e8366c917c9093b47b2fd8230e366f777af1f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (af9e836) will **increase** coverage by `0.03%`.
   > The diff coverage is `0.00%`.
   
   > :exclamation: Current head e64f639 differs from pull request most recent head a200f7d. Consider uploading reports for the commit a200f7d to get more accurate results
   
   ```diff
   @@             Coverage Diff              @@
   ##               main    #1270      +/-   ##
   ============================================
   + Coverage     27.08%   27.11%   +0.03%     
   - Complexity      810      813       +3     
   ============================================
     Files           215      215              
     Lines         18368    18368              
     Branches       1999     1999              
   ============================================
   + Hits           4973     4978       +5     
   + Misses        13067    13065       -2     
   + Partials        328      325       -3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-celeborn/pull/1270?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...on/network/server/memory/ReadBufferDispatcher.java](https://codecov.io/gh/apache/incubator-celeborn/pull/1270?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jZWxlYm9ybi9jb21tb24vbmV0d29yay9zZXJ2ZXIvbWVtb3J5L1JlYWRCdWZmZXJEaXNwYXRjaGVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...oy/worker/congestcontrol/CongestionController.java](https://codecov.io/gh/apache/incubator-celeborn/pull/1270?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-d29ya2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jZWxlYm9ybi9zZXJ2aWNlL2RlcGxveS93b3JrZXIvY29uZ2VzdGNvbnRyb2wvQ29uZ2VzdGlvbkNvbnRyb2xsZXIuamF2YQ==) | `76.86% <0.00%> (-0.92%)` | :arrow_down: |
   | [...celeborn/service/deploy/master/SlotsAllocator.java](https://codecov.io/gh/apache/incubator-celeborn/pull/1270?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-bWFzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jZWxlYm9ybi9zZXJ2aWNlL2RlcGxveS9tYXN0ZXIvU2xvdHNBbGxvY2F0b3IuamF2YQ==) | `71.73% <0.00%> (+2.46%)` | :arrow_up: |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] zhongqiangczq commented on a diff in pull request #1270: [CELEBORN-335][FLINK] Fix memory leak.

Posted by "zhongqiangczq (via GitHub)" <gi...@apache.org>.
zhongqiangczq commented on code in PR #1270:
URL: https://github.com/apache/incubator-celeborn/pull/1270#discussion_r1116781258


##########
client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/network/TransportFrameDecoderWithBufferSupplier.java:
##########
@@ -144,6 +148,9 @@ public void channelRead(ChannelHandlerContext ctx, Object data) {
           }
         }
       }
+    } catch (Exception e) {
+      logger.error(
+          "decode msg failed type {} curMsg {} size {}", curType, curMsg, nettyBuf.readableBytes());

Review Comment:
   logger.error("", nettyBuf.readableBytes() => logger.error("", nettyBuf.readableBytes(),e)?, we can see the exception cause



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org