You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/11/29 14:28:00 UTC

[jira] [Commented] (SPARK-37488) With enough resources, the task may still be permanently pending

    [ https://issues.apache.org/jira/browse/SPARK-37488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450492#comment-17450492 ] 

Apache Spark commented on SPARK-37488:
--------------------------------------

User 'guiyanakuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34743

> With enough resources, the task may still be permanently pending
> ----------------------------------------------------------------
>
>                 Key: SPARK-37488
>                 URL: https://issues.apache.org/jira/browse/SPARK-37488
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 3.0.3, 3.1.2, 3.2.0
>         Environment: Spark 3.1.2,Default Configuration
>            Reporter: Yiqun Zhang
>            Priority: Major
>
> {code:java}
> // The online environment is actually hive partition data imported to tidb, the code logic can be simplified as follows
>     SparkSession testApp = SparkSession.builder()
>         .master("local[*]")
>         .appName("test app")
>         .enableHiveSupport()
>         .getOrCreate();
>     Dataset<Row> dataset = testApp.sql("select * from default.test where dt = '20211129'");
>     dataset.persist(StorageLevel.MEMORY_AND_DISK());
>     dataset.count();
> {code}
> I have observed that tasks are permanently blocked and reruns can always be reproduced.
> Since it is only reproducible online, I use the arthas runtime to see the status of the function entries and returns within the TaskSetManager.
> https://gist.github.com/guiyanakuang/431584f191645513552a937d16ae8fbd
> NODE_LOCAL level, because the persist function is called, the pendingTasks.forHost has a collection of pending tasks, but it points to the machine where the block of partitioned data is located, and since the only resource spark gets is the driver. In this case, it cannot be scheduled. getAllowedLocalityLevel gives the wrong runlevel, so it cannot be run with TaskLocality.Any
> The task pending permanently because the scheduling time is very short and it is too late to raise the runlevel with a timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org