You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/10/04 07:48:00 UTC
[jira] [Assigned] (SPARK-40647) DAGScheduler should fail job until all related running tasks have been killed

     [ https://issues.apache.org/jira/browse/SPARK-40647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-40647:
------------------------------------

    Assignee: Apache Spark

> DAGScheduler should fail job until all related running tasks have been killed
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-40647
>                 URL: https://issues.apache.org/jira/browse/SPARK-40647
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.2.2
>            Reporter: Wechar
>            Assignee: Apache Spark
>            Priority: Major
>
> *Issue Description*
> The staging directory within table location is not removed when {{CTAS}} fails sometimes.
> It is a trouble if the new table is a Managed Table when we want to recreate it.
> *Root Cause*
> SchedulerBackend kills tasks via {{KillTask}} message which is asynchronous, so we may have failed a job but the tasks are still running and create the tmp file. Even if the running tasks will failed and delete the generated file finally, but the temporary directory was left.
> *Solution*
> Before killing a job, we should make sure that all related running tasks have been killed.
> *How to Reproduce*
> Step 1: create a source table and insert data to make the file number exceeds 20 on HDFS
> {code:sql}
> -- create source table
> CREATE TABLE IF NOT EXISTS default.test_wechar
> (name string)
> PARTITIONED BY (grass_date date)
> STORED AS PARQUET
> -- insert data 24 times
> insert into default.test_wechar partition (grass_date='2022-09-03')
> select uuid()
> lateral view explode(sequence(1,2000)) as temp_view;
> {code}
> Step 2: create a new path for new table and setQuota to 20
> {code:bash}
> $hadoop fs -count -q hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp
>           20              19            none             inf            1            0                  0 hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp
> {code}
> Step 3: create new table from source table
> {code:sql}
> create table if not exists default.test_wechar_tmp 
> location 'hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp' 
> as select * from default.test_wechar;
> {code}
>  
> Step 4: check the location of new table after the job failed
> {code:bash}
> $hadoop fs -ls /user/weiqiang.yu/tmp/test_wechar_tmp/*
> Found 1 items
> drwxrwxr-x   - weiqiang.yu weiqiang.yu          0 2022-10-04 12:56 /user/weiqiang.yu/tmp/test_wechar_tmp/.hive-staging_hive_2022-10-04_12-56-21_545_2745177084386740362-1/-ext-10000
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org