You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:16:11 UTC
[jira] [Resolved] (SPARK-19273) Stage is not retay when shuffle
file is lost
[ https://issues.apache.org/jira/browse/SPARK-19273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-19273.
----------------------------------
Resolution: Incomplete
> Stage is not retay when shuffle file is lost
> ---------------------------------------------
>
> Key: SPARK-19273
> URL: https://issues.apache.org/jira/browse/SPARK-19273
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: xukun
> Priority: Major
> Labels: bulk-closed
>
> Execute "insert into table select * from a join b on a.xx = b.xx". In Shuffle stage ,we delete shuffle file, shuffle stage will not retry and job fail because task fail 4 times.
> detail:
> create two external table use tpcds:
> {quote}
> create external table date_dim
> (
> d_date_sk int,
> d_date_id string,
> d_date string,
> d_month_seq int,
> d_week_seq int,
> d_quarter_seq int,
> d_year int,
> d_dow int,
> d_moy int,
> d_dom int,
> d_qoy int,
> d_fy_year int,
> d_fy_quarter_seq int,
> d_fy_week_seq int,
> d_day_name string,
> d_quarter_name string,
> d_holiday string,
> d_weekend string,
> d_following_holiday string,
> d_first_dom int,
> d_last_dom int,
> d_same_day_ly int,
> d_same_day_lq int,
> d_current_day string,
> d_current_week string,
> d_current_month string,
> d_current_quarter string,
> d_current_year string
> )
> row format delimited fields terminated by '|'
> location 'path1';
> create external table web_sales
> (
> ws_sold_date_sk int,
> ws_sold_time_sk int,
> ws_ship_date_sk int,
> ws_item_sk int,
> ws_bill_customer_sk int,
> ws_bill_cdemo_sk int,
> ws_bill_hdemo_sk int,
> ws_bill_addr_sk int,
> ws_ship_customer_sk int,
> ws_ship_cdemo_sk int,
> ws_ship_hdemo_sk int,
> ws_ship_addr_sk int,
> ws_web_page_sk int,
> ws_web_site_sk int,
> ws_ship_mode_sk int,
> ws_warehouse_sk int,
> ws_promo_sk int,
> ws_order_number int,
> ws_quantity int,
> ws_wholesale_cost float,
> ws_list_price float,
> ws_sales_price float,
> ws_ext_discount_amt float,
> ws_ext_sales_price float,
> ws_ext_wholesale_cost float,
> ws_ext_list_price float,
> ws_ext_tax float,
> ws_coupon_amt float,
> ws_ext_ship_cost float,
> ws_net_paid float,
> ws_net_paid_inc_tax float,
> ws_net_paid_inc_ship float,
> ws_net_paid_inc_ship_tax float,
> ws_net_profit float
> )
> row format delimited fields terminated by '|'
> location 'path2';
> {quote}
> then execute sql like this:
> {quote}
> create table web_sales1
> (
> ws_sold_date_sk int,
> ws_sold_time_sk int,
> ws_ship_date_sk int,
> ws_item_sk int,
> ws_bill_customer_sk int,
> ws_bill_cdemo_sk int,
> ws_bill_hdemo_sk int,
> ws_bill_addr_sk int,
> ws_ship_customer_sk int,
> ws_ship_cdemo_sk int,
> ws_ship_hdemo_sk int,
> ws_ship_addr_sk int,
> ws_web_page_sk int,
> ws_web_site_sk int,
> ws_ship_mode_sk int,
> ws_warehouse_sk int,
> ws_promo_sk int,
> ws_order_number int,
> ws_quantity int,
> ws_wholesale_cost float,
> ws_list_price float,
> ws_sales_price float,
> ws_ext_discount_amt float,
> ws_ext_sales_price float,
> ws_ext_wholesale_cost float,
> ws_ext_list_price float,
> ws_ext_tax float,
> ws_coupon_amt float,
> ws_ext_ship_cost float,
> ws_net_paid float,
> ws_net_paid_inc_tax float,
> ws_net_paid_inc_ship float,
> ws_net_paid_inc_ship_tax float,
> ws_net_profit float
> )
> partitioned by (ws_sold_date string)
> stored as parquet;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions=100;
> set spark.sql.autoBroadcastJoinThreshold = 1;
> insert overwrite table web_sales1 partition (ws_sold_date)
> select
> ws.ws_sold_date_sk,
> ws.ws_sold_time_sk,
> ws.ws_ship_date_sk,
> ws.ws_item_sk,
> ws.ws_bill_customer_sk,
> ws.ws_bill_cdemo_sk,
> ws.ws_bill_hdemo_sk,
> ws.ws_bill_addr_sk,
> ws.ws_ship_customer_sk,
> ws.ws_ship_cdemo_sk,
> ws.ws_ship_hdemo_sk,
> ws.ws_ship_addr_sk,
> ws.ws_web_page_sk,
> ws.ws_web_site_sk,
> ws.ws_ship_mode_sk,
> ws.ws_warehouse_sk,
> ws.ws_promo_sk,
> ws.ws_order_number,
> ws.ws_quantity,
> ws.ws_wholesale_cost,
> ws.ws_list_price,
> ws.ws_sales_price,
> ws.ws_ext_discount_amt,
> ws.ws_ext_sales_price,
> ws.ws_ext_wholesale_cost,
> ws.ws_ext_list_price,
> ws.ws_ext_tax,
> ws.ws_coupon_amt,
> ws.ws_ext_ship_cost,
> ws.ws_net_paid,
> ws.ws_net_paid_inc_tax,
> ws.ws_net_paid_inc_ship,
> ws.ws_net_paid_inc_ship_tax,
> ws.ws_net_profit,
> dd.d_date as ws_sold_date
> from tpcds_text.web_sales ws
> join tpcds_text.date_dim dd
> on (ws.ws_sold_date_sk = dd.d_date_sk);
> {quote}
> after map stage, delete executor shuffle file, the job will fail. log is:
> {quote}
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 37.1 in stage 8.0 (TID 52, xk3, executor 6, partition 37, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 38.0 in stage 8.0 (TID 49) on xk3, executor 6: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 1]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 38.1 in stage 8.0 (TID 53, xk3, executor 2, partition 38, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 39.0 in stage 8.0 (TID 50) on xk3, executor 2: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 2]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 39.1 in stage 8.0 (TID 54, xk3, executor 3, partition 39, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 40.0 in stage 8.0 (TID 51) on xk3, executor 3: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 3]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 40.1 in stage 8.0 (TID 55, xk3, executor 6, partition 40, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 37.1 in stage 8.0 (TID 52) on xk3, executor 6: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 4]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 37.2 in stage 8.0 (TID 56, xk2, executor 4, partition 37, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Finished task 34.0 in stage 8.0 (TID 45) in 1209 ms on xk2 (executor 4) (35/200)
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 41.0 in stage 8.0 (TID 57, xk3, executor 3, partition 41, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 39.1 in stage 8.0 (TID 54) on xk3, executor 3: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 5]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 39.2 in stage 8.0 (TID 58, xk3, executor 2, partition 39, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 38.1 in stage 8.0 (TID 53) on xk3, executor 2: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 6]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 38.2 in stage 8.0 (TID 59, xk2, executor 1, partition 38, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Finished task 30.0 in stage 8.0 (TID 41) in 2224 ms on xk2 (executor 1) (36/200)
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 42.0 in stage 8.0 (TID 60, xk2, executor 4, partition 42, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 37.2 in stage 8.0 (TID 56) on xk2, executor 4: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 7]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 37.3 in stage 8.0 (TID 61, xk3, executor 3, partition 37, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 41.0 in stage 8.0 (TID 57) on xk3, executor 3: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 8]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 41.1 in stage 8.0 (TID 62, xk3, executor 6, partition 41, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 40.1 in stage 8.0 (TID 55) on xk3, executor 6: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 9]
> 17/01/18 16:28:49 INFO TaskSetManager: Starting task 40.2 in stage 8.0 (TID 63, xk2, executor 4, partition 40, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:49 INFO TaskSetManager: Lost task 42.0 in stage 8.0 (TID 60) on xk2, executor 4: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 10]
> 17/01/18 16:28:50 INFO TaskSetManager: Starting task 42.1 in stage 8.0 (TID 64, xk2, executor 1, partition 42, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:50 INFO TaskSetManager: Lost task 38.2 in stage 8.0 (TID 59) on xk2, executor 1: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 11]
> 17/01/18 16:28:50 INFO TaskSetManager: Starting task 38.3 in stage 8.0 (TID 65, xk3, executor 6, partition 38, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:50 INFO TaskSetManager: Lost task 41.1 in stage 8.0 (TID 62) on xk3, executor 6: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 12]
> 17/01/18 16:28:50 INFO TaskSetManager: Starting task 41.2 in stage 8.0 (TID 66, xk3, executor 2, partition 41, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:50 INFO TaskSetManager: Lost task 39.2 in stage 8.0 (TID 58) on xk3, executor 2: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 13]
> 17/01/18 16:28:50 INFO TaskSetManager: Starting task 39.3 in stage 8.0 (TID 67, xk3, executor 3, partition 39, NODE_LOCAL, 7416 bytes)
> 17/01/18 16:28:50 INFO TaskSetManager: Lost task 37.3 in stage 8.0 (TID 61) on xk3, executor 3: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 14]
> 17/01/18 16:28:50 ERROR TaskSetManager: Task 37 in stage 8.0 failed 4 times; aborting job
> 17/01/18 16:28:50 INFO YarnScheduler: Cancelling stage 8
> 17/01/18 16:28:50 INFO YarnScheduler: Stage 8 was cancelled
> 17/01/18 16:28:50 INFO DAGScheduler: ResultStage 8 (processCmd at CliDriver.java:377) failed in 13.292 s due to Job aborted due to stage failure: Task 37 in stage 8.0 failed 4 times, most recent failure: Lost task 37.3 in stage 8.0 (TID 61, xk3, executor 3): org.apache.spark.SparkException: Task failed while writing rows.
> at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:328)
> at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
> at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /srv/BigData/hadoop/data1/nm/localdir/usercache/super/appcache/application_1484570747988_0012/blockmgr-3c7cc8f2-a11e-4fd4-b671-b8c7db6132ca/32/shuffle_1_1_0.index (No such file or directory)
> {quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org