You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chandni Singh (Jira)" <ji...@apache.org> on 2022/06/30 21:07:00 UTC

[jira] [Created] (SPARK-39647) Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when NodeManager hasn't been restarted

Chandni Singh created SPARK-39647:
-------------------------------------

             Summary: Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when NodeManager hasn't been restarted
                 Key: SPARK-39647
                 URL: https://issues.apache.org/jira/browse/SPARK-39647
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 3.2.0
            Reporter: Chandni Singh


We saw these exceptions during block push:
{code:java}
22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block shuffle_170_568_174, and will not retry (0 retries)
org.apache.spark.network.shuffle.BlockPushException: !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration
	at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92)
	at org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300)
	at org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290)
	at org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312)
	at org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168)

22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174 to BlockManagerId(, node-x, 7337, None) failed.
{code}
Note: The NodeManager on node-x (node against which this exception was seen) was not restarted.

The reason this happened is because the executor registers the block manager with {{BlockManagerMaster}} before it registers with the ESS. In push-based shuffle, a block manager is selected by the driver as a merger for the shuffle push. However, the ESS on that node can successfully merge the block only if it has received the metadata about merged directories from the local executor (sent when the local executor registers with the ESS). If this local executor registration is delayed, but the ESS host got picked up as a merger then it will fail to merge the blocks pushed to it which is what happened here.

The local executor on node-x is executor 754 and the block manager registration happened at 13:28:11
{code:java}
22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has registered (new total is 1200)

22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None)
{code}
The application got registered with shuffle server at node-x at 13:29:40
{code:java}
2022-06-24 13:29:40,343 INFO org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/, /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/, /grid/c/tmp/yarn/] for application application_1653753500486_3193550
 {code}

node-x was selected as a merger by the driver after 13:28:11 and when the executors started pushing to it, all those pushes failed until 13:29:40

We can fix by having the executor register with ESS before it registers the block manager with the {{BlockManagerMaster}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org