You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/31 11:40:19 UTC

[GitHub] [spark] 012huang opened a new pull request #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

012huang opened a new pull request #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060
 
 
   ### What changes were proposed in this pull request?
   An app finished abnormal sometimes may cause shuffe service memory leak. In one of our production cases, the app failed for Stage cancelled as SparkContext has already shut down. the strange is there are still requests for fetch shuffle data and cause error in server side as below:
   ```
   2019-12-08 22:23:33,375 ERROR server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(132)) - Error opening block StreamChunkId{streamId=1902064894814, chunkIndex=0} for request from /10.221.115.175:38582
   java.lang.RuntimeException: Executor is not registered (appId=application_1574499669561_954327, execId=4514)
   ```
   the client sie also show corresponding log like this:
   ```
   org.apache.spark.shuffle.FetchFailedException: Failure while fetching StreamChunkId{streamId=1902064894814, chunkIndex=0}: java.lang.RuntimeException: Executor is not registered (appId=application_1574499669561_954327, execId=4514)
   ```
   in some cases, the request for `OpenBlocks` is still on the fly. In the code `ExternalShuffleBlockHandler#handleMessage`, it will register a `StreamState` to `OneForOneStreamManager#streams`, then reply an success response to client unconditionally , the client receive the response and then fire `ChunkFetchRequest` to fetch chunk, but at this time, the app has got event `APPLICATION_STOP` and executed `ExternalShuffleService#applicationRemoved` method to clean the app's `ExecutorShuffleInfo`, this made `Executor is not registered` error happended. even though when the client channel is closing, the `TransportRequestHandler#channelInactive` was called to clean the StreamState with relate channel, but when cleanning the `StreamState buffter`, it also lookup `ManagedBuffer` with` appId` and `execId` info which have been cleaned in executors object. we can also find the log:  `StreamManager connectionTerminated() callback failed` in NM's log file.
   
   so, when an `OpenBlocks` request come, we should lookup `ExternalShuffleBlockResolver#executors` , if the realted app is exited, we should not registering a `StreamState` and just close the client (or reply an special message to client and in client side to handle it). and when an app get `APPLICATION_STOP` to call `applicationRemoved`, we should clean the the related `streamState` before `ExecutorShuffleInfo` has been cleaned, this is what the PR changes and prevents the shuffle service memory leak.
   
   ### Why are the changes needed?
   The external shuffle service memory leak has a great impact on cluster with dynanic on and may cause NM crash.
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   with existing ut

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060#issuecomment-569916413
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] 012huang commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

Posted by GitBox <gi...@apache.org>.
012huang commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060#issuecomment-571666482
 
 
   the shuffle service still exists memory leak in 2.4.3 and I am informed some other users have facing the problem.
   The shuffle module have a great change since 3.0 and I have't go through it completely, I will take time to work against master for that, thank you for your reply.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] vanzin commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

Posted by GitBox <gi...@apache.org>.
vanzin commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060#issuecomment-571251176
 
 
   Why is this against 2.4 and not master? If the problem does not exist in master, please explain why, and why you're not backporting whatever fixed the issue in master instead.
   
   This sounds similar to SPARK-26604 which says is fixed in 2.4.1, but the bug you filed says 2.4.3.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] 012huang commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

Posted by GitBox <gi...@apache.org>.
012huang commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060#issuecomment-571005574
 
 
   cc @viirya  @dongjoon-hyun , can you help review this? thanks

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060#issuecomment-569916599
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] fbrams commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

Posted by GitBox <gi...@apache.org>.
fbrams commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060#issuecomment-586567758
 
 
   Just for a better understanding, if wenn are currently dealing with that bug in our environment:
   * what ist NM?
   * dynamics on - are you referring to "spark.dynamicAllocation.enabled" ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060#issuecomment-569916413
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] 012huang closed pull request #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak

Posted by GitBox <gi...@apache.org>.
012huang closed pull request #27060: [SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak
URL: https://github.com/apache/spark/pull/27060
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org