You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2023/03/14 09:27:00 UTC

[jira] [Assigned] (SPARK-42784) Fix the problem of incomplete creation of subdirectories in push merged localDir

     [ https://issues.apache.org/jira/browse/SPARK-42784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-42784:
------------------------------------

    Assignee: Apache Spark

> Fix the problem of incomplete creation of subdirectories in push merged localDir
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-42784
>                 URL: https://issues.apache.org/jira/browse/SPARK-42784
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 3.3.2
>            Reporter: Fencheng Mei
>            Assignee: Apache Spark
>            Priority: Major
>
> After we massively enabled push-based shuffle in our production environment, we found some warn messages appearing in the server-side log messages.
> the warning log like:
> ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) failed.
> java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize merged shuffle partition for appId application_1671244879475_44020960 shuffleId 3 shuffleMergeId 0 reduceId 935.
> After investigation, we identified the triggering mechanism of the bug。
> The driver requested two different containers on the same physical machine. During the creation of the 'push-merged' directory in the first container (container_1), the mergeDir was created first, then the subDir were created based on the value of the "spark.diskStore.subDirectories" parameter. However, the resources of container_1 were preempted during the creation of the sub-directories, resulting in subDir not being created (only part of it was created ). As the mergeDir still existed, the second container (container_2) was unable to create further subDir (as it assumed that all directories had already been created).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org