You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Huiqiang Liu (JIRA)" <ji...@apache.org> on 2018/04/08 08:39:00 UTC

[jira] [Created] (SPARK-23895) Job continues to run even though some tasks have been failed

Huiqiang Liu created SPARK-23895:
------------------------------------

             Summary: Job continues to run even though some tasks have been failed
                 Key: SPARK-23895
                 URL: https://issues.apache.org/jira/browse/SPARK-23895
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.0
         Environment: Ubuntu 14.04.4 LTS

Spark standalone deployment
            Reporter: Huiqiang Liu


We are using com.databricks.spark-redshift to write dataframe into Redshift table. Internally, it writes dataframe to S3 first, and then executes query to load data from S3 to the final Redshift table. The problem occurred in the writing S3 phase, one executor was down due to JVM issue but the whole job was considered as success. It continued to run query to load the incomplete data from S3 to Redshift, which resulted in data loss.

The executor log:

{{8/04/01 15:06:25 INFO MemoryStore: Block broadcast_664 stored as values in memory (estimated size 114.6 KB, free 63.3 MB)}}
{{18/04/01 15:06:25 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 11, fetching them}}
{{18/04/01 15:06:25 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@172.19.103.84:43248)}}
{{18/04/01 15:06:25 INFO MapOutputTrackerWorker: Got the output locations}}
{{18/04/01 15:06:25 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty blocks out of 200 blocks}}
{{18/04/01 15:06:25 INFO ShuffleBlockFetcherIterator: Started 15 remote fetches in 4 ms}}
{{18/04/01 15:06:25 INFO DefaultWriterContainer: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter}}
{{Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f823e54e000, 65536, 1) failed; error='Cannot allocate memory' (errno=12)}}

 

The driver log:

{{18/04/01 15:06:39 INFO DAGScheduler: ShuffleMapStage 29 (mapPartitions at RedshiftWriter.scala:237) finished in 1.579 s}}
{{18/04/01 15:06:39 INFO DAGScheduler: looking for newly runnable stages}}
{{18/04/01 15:06:39 INFO DAGScheduler: running: Set()}}
{{18/04/01 15:06:39 INFO DAGScheduler: waiting: Set(ResultStage 30)}}
{{18/04/01 15:06:39 INFO DAGScheduler: failed: Set()}}
{{18/04/01 15:06:39 INFO DAGScheduler: Submitting ResultStage 30 (MapPartitionsRDD[749] at createDataFrame at RedshiftWriter.scala:275), which has no missing parents}}
{{18/04/01 15:06:39 INFO MemoryStore: Block broadcast_667 stored as values in memory (estimated size 114.6 KB, free 160.1 MB)}}
{{18/04/01 15:06:39 INFO MemoryStore: Block broadcast_667_piece0 stored as bytes in memory (estimated size 44.1 KB, free 160.2 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory on 172.19.103.84:18128 (size: 44.1 KB, free: 492.3 MB)}}
{{18/04/01 15:06:39 INFO SparkContext: Created broadcast 667 from broadcast at DAGScheduler.scala:1006}}
{{18/04/01 15:06:39 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 30 (MapPartitionsRDD[749] at createDataFrame at RedshiftWriter.scala:275)}}
{{18/04/01 15:06:39 INFO TaskSchedulerImpl: Adding task set 30.1 with 5 tasks}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 0.0 in stage 30.1 (TID 5529, ip-172-19-103-87.ec2.internal, partition 3,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 1.0 in stage 30.1 (TID 5530, ip-172-19-105-221.ec2.internal, partition 6,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 2.0 in stage 30.1 (TID 5531, ip-172-19-101-76.ec2.internal, partition 11,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 3.0 in stage 30.1 (TID 5532, ip-172-19-103-87.ec2.internal, partition 13,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 4.0 in stage 30.1 (TID 5533, ip-172-19-105-117.ec2.internal, partition 14,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory on ip-172-19-101-76.ec2.internal:16864 (size: 44.1 KB, free: 1928.8 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory on ip-172-19-103-87.ec2.internal:62681 (size: 44.1 KB, free: 1929.9 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory on ip-172-19-105-221.ec2.internal:52999 (size: 44.1 KB, free: 1937.7 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory on ip-172-19-105-117.ec2.internal:45766 (size: 44.1 KB, free: 1929.6 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory on ip-172-19-103-87.ec2.internal:13372 (size: 44.1 KB, free: 1974.1 MB)}}
{{18/04/01 15:06:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 11 to ip-172-19-101-76.ec2.internal:44931}}
{{18/04/01 15:06:39 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 11 is 1229 bytes}}
{{18/04/01 15:06:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 11 to ip-172-19-105-221.ec2.internal:16426}}
{{18/04/01 15:06:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 11 to ip-172-19-103-87.ec2.internal:36114}}
{{18/04/01 15:06:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 11 to ip-172-19-105-117.ec2.internal:13679}}
{{18/04/01 15:06:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 11 to ip-172-19-103-87.ec2.internal:36668}}
{{18/04/01 15:06:43 WARN TaskSetManager: Lost task 6.0 in stage 30.0 (TID 5470, ip-172-19-102-6.ec2.internal): org.apache.spark.SparkException: Task failed while writing rows.}}
{{        at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)}}
{{        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)}}
{{        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)}}
{{        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)}}
{{        at org.apache.spark.scheduler.Task.run(Task.scala:89)}}
{{        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)}}
{{        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)}}
{{        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)}}
{{        at java.lang.Thread.run(Thread.java:748)}}
{{Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-172-19-109-92.ec2.internal/172.19.109.92:22910}}
{{        at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)}}
{{        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)}}
{{        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)}}
{{        at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)}}
{{        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)}}
{{        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)}}
{{        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)}}
{{        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)}}
{{        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)}}
{{        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)}}
{{        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)}}
{{        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)}}
{{        at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:262)}}
{{        ... 8 more}}
{{Caused by: java.io.IOException: Failed to connect to ip-172-19-109-92.ec2.internal/172.19.109.92:22910}}
{{        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)}}
{{        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)}}
{{        at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)}}
{{        at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)}}
{{        at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)}}
{{        at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)}}
{{        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}
{{        at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
{{        ... 3 more}}
{{Caused by: java.net.ConnectException: Connection refused: ip-172-19-109-92.ec2.internal/172.19.109.92:22910}}
{{        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)}}
{{        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)}}
{{        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)}}
{{        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)}}
{{        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)}}
{{        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)}}
{{        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)}}
{{        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)}}
{{        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)}}
{{        ... 1 more}}

{{18/04/01 15:06:43 INFO TaskSetManager: Lost task 11.0 in stage 30.0 (TID 5475) on executor ip-172-19-100-11.ec2.internal: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 1]}}
{{18/04/01 15:06:43 INFO TaskSetManager: Finished task 0.0 in stage 30.1 (TID 5529) in 3999 ms on ip-172-19-103-87.ec2.internal (1/5)}}
{{18/04/01 15:06:44 INFO TaskSetManager: Lost task 3.0 in stage 30.0 (TID 5467) on executor ip-172-19-107-199.ec2.internal: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 2]}}
{{18/04/01 15:06:44 INFO TaskSetManager: Finished task 4.0 in stage 30.1 (TID 5533) in 4322 ms on ip-172-19-105-117.ec2.internal (2/5)}}
{{18/04/01 15:06:44 INFO TaskSetManager: Lost task 14.0 in stage 30.0 (TID 5478) on executor ip-172-19-106-52.ec2.internal: org.apache.spark.SparkException (Task failed while writing rows.) [duplicate 3]}}
{{18/04/01 15:06:44 INFO TaskSchedulerImpl: Removed TaskSet 30.0, whose tasks have all completed, from pool}}
{{18/04/01 15:06:44 INFO TaskSetManager: Finished task 2.0 in stage 30.1 (TID 5531) in 4742 ms on ip-172-19-101-76.ec2.internal (3/5)}}
{{18/04/01 15:06:44 INFO TaskSetManager: Finished task 1.0 in stage 30.1 (TID 5530) in 4866 ms on ip-172-19-105-221.ec2.internal (4/5)}}
{{18/04/01 15:06:46 INFO TaskSetManager: Finished task 3.0 in stage 30.1 (TID 5532) in 6694 ms on ip-172-19-103-87.ec2.internal (5/5)}}
{{18/04/01 15:06:46 INFO TaskSchedulerImpl: Removed TaskSet 30.1, whose tasks have all completed, from pool}}
{{18/04/01 15:06:46 INFO DAGScheduler: ResultStage 30 (save at RedshiftWriter.scala:278) finished in 6.694 s}}
{{18/04/01 15:06:46 INFO DAGScheduler: Job 12 finished: save at RedshiftWriter.scala:278, took 33.277686 s}}
{{18/04/01 15:07:10 INFO DefaultWriterContainer: Job job_201804011506_0000 committed.}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org