You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 阎志涛 <to...@tendcloud.com> on 2018/10/07 12:24:12 UTC
Executor hang
Hi, All,
I am running Spark 2.1 on Hadoop 2.7.2 with yarn. While executing spark tasks, some executor keep running forever without success. From the following screenshot:
[cid:image002.jpg@01D45E7B.B4FFD340]
We can see that executor 4 keep running for 26 minutes and the shuffle read size/records keep unchanged for 26mins too. Threaddump for the thread is as following:
[cid:image004.jpg@01D45E7B.B4FFD340]
[cid:image009.jpg@01D45E7B.B4FFD340]
The linux version is: Linux version 4.14.62-70.117.amzn2.x86_64 (mockbuild@ip-10-0-1-79) and jdk version is Oracle JDK 1.8.0_181. With jstack on the machine, I can see following thread dump:
"Executor task launch worker for task 3806" #54 daemon prio=5 os_prio=0 tid=0x0000000001230800 nid=0x1fc runnable [0x00007fba0e600000]
java.lang.Thread.State: RUNNABLE
at java.lang.StringCoding.encode(StringCoding.java:364)
at java.lang.String.getBytes(String.java:941)
at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I wonder why this happened? Is it related to my environment of a bug of Spark?
Thanks and Regards,
Tony
阎志涛
研发副总裁
M + 86-139 1181 5695
Wechat zhitao_yan
北京腾云天下科技有限公司
北京市东直门外大街39号院2号楼608室,100027
TalkingData.com
答复: 答复: Executor hang
Posted by 阎志涛 <to...@tendcloud.com>.
Yeah, the problem was work around by adding config spark.sql.codegen.wholeStage=false.
Thanks and Regards,
Tony
发件人: kathleen li <ka...@gmail.com>
发送时间: 2018年10月8日 12:15
收件人: 阎志涛 <to...@tendcloud.com>
抄送: user@spark.apache.org
主题: Re: 答复: Executor hang
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-whole-stage-codegen.html#spark.sql.codegen.wholeStage
Sent from my iPhone
On Oct 7, 2018, at 3:21 PM, 阎志涛 <to...@tendcloud.com>> wrote:
It is not data skewed issue. The executor running more than 8 hours and I have to kill the executor. I doubt the problem is caused by Spark codegen but it sound spark.sql.codegen.wholeStage=false can not work. I met another problem with codegen which cause my spark job failed with following stack trace:
18/10/08 01:58:36 WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 1912, data1.aidata360.com<http://data1.aidata360.com>, executor 4): java.util.NoSuchElementException
at java.util.LinkedList.removeFirst(LinkedList.java:270)
at org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.hasNext(BytesToBytesMap.java:310)
at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:177)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.Dataset$$anonfun$56$$anon$1.hasNext(Dataset.scala:2712)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How can I disable wholestage code gen?
Thanks and Regards,
Tony
发件人: kathleen li <ka...@gmail.com>>
发送时间: 2018年10月8日 1:04
收件人: 阎志涛 <to...@tendcloud.com>>
抄送: user@spark.apache.org<ma...@spark.apache.org>
主题: Re: Executor hang
It seemed you had data skewed issue since shuffle read size for executor 4 is almost 2 times than other executors and GC time 11s almost 15 to 20 times than others.
Kathleen
Sent from my iPhone
On Oct 7, 2018, at 5:24 AM, 阎志涛 <to...@tendcloud.com>> wrote:
Hi, All,
I am running Spark 2.1 on Hadoop 2.7.2 with yarn. While executing spark tasks, some executor keep running forever without success. From the following screenshot:
<image002.jpg>
We can see that executor 4 keep running for 26 minutes and the shuffle read size/records keep unchanged for 26mins too. Threaddump for the thread is as following:
<image004.jpg>
<image009.jpg>
The linux version is: Linux version 4.14.62-70.117.amzn2.x86_64 (mockbuild@ip-10-0-1-79) and jdk version is Oracle JDK 1.8.0_181. With jstack on the machine, I can see following thread dump:
"Executor task launch worker for task 3806" #54 daemon prio=5 os_prio=0 tid=0x0000000001230800 nid=0x1fc runnable [0x00007fba0e600000]
java.lang.Thread.State: RUNNABLE
at java.lang.StringCoding.encode(StringCoding.java:364)
at java.lang.String.getBytes(String.java:941)
at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I wonder why this happened? Is it related to my environment of a bug of Spark?
Thanks and Regards,
Tony
阎志涛
研发副总裁
M + 86-139 1181 5695
Wechat zhitao_yan
北京腾云天下科技有限公司
北京市东直门外大街39号院2号楼608室,100027
TalkingData.com<http://TalkingData.com>
Re: 答复: Executor hang
Posted by kathleen li <ka...@gmail.com>.
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-whole-stage-codegen.html#spark.sql.codegen.wholeStage
Sent from my iPhone
> On Oct 7, 2018, at 3:21 PM, 阎志涛 <to...@tendcloud.com> wrote:
>
> It is not data skewed issue. The executor running more than 8 hours and I have to kill the executor. I doubt the problem is caused by Spark codegen but it sound spark.sql.codegen.wholeStage=false can not work. I met another problem with codegen which cause my spark job failed with following stack trace:
> 18/10/08 01:58:36 WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 1912, data1.aidata360.com, executor 4): java.util.NoSuchElementException
> at java.util.LinkedList.removeFirst(LinkedList.java:270)
> at org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.hasNext(BytesToBytesMap.java:310)
> at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:177)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at org.apache.spark.sql.Dataset$$anonfun$56$$anon$1.hasNext(Dataset.scala:2712)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
> at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> How can I disable wholestage code gen?
>
> Thanks and Regards,
> Tony
>
> 发件人: kathleen li <ka...@gmail.com>
> 发送时间: 2018年10月8日 1:04
> 收件人: 阎志涛 <to...@tendcloud.com>
> 抄送: user@spark.apache.org
> 主题: Re: Executor hang
>
> It seemed you had data skewed issue since shuffle read size for executor 4 is almost 2 times than other executors and GC time 11s almost 15 to 20 times than others.
>
> Kathleen
>
> Sent from my iPhone
>
> On Oct 7, 2018, at 5:24 AM, 阎志涛 <to...@tendcloud.com> wrote:
>
> Hi, All,
> I am running Spark 2.1 on Hadoop 2.7.2 with yarn. While executing spark tasks, some executor keep running forever without success. From the following screenshot:
> <image002.jpg>
> We can see that executor 4 keep running for 26 minutes and the shuffle read size/records keep unchanged for 26mins too. Threaddump for the thread is as following:
> <image004.jpg>
>
> <image009.jpg>
>
> The linux version is: Linux version 4.14.62-70.117.amzn2.x86_64 (mockbuild@ip-10-0-1-79) and jdk version is Oracle JDK 1.8.0_181. With jstack on the machine, I can see following thread dump:
>
> "Executor task launch worker for task 3806" #54 daemon prio=5 os_prio=0 tid=0x0000000001230800 nid=0x1fc runnable [0x00007fba0e600000]
> java.lang.Thread.State: RUNNABLE
> at java.lang.StringCoding.encode(StringCoding.java:364)
> at java.lang.String.getBytes(String.java:941)
> at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:109)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
> at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> I wonder why this happened? Is it related to my environment of a bug of Spark?
>
> Thanks and Regards,
> Tony
>
> 阎志涛
> 研发副总裁
>
> M + 86-139 1181 5695
> Wechat zhitao_yan
>
> 北京腾云天下科技有限公司
> 北京市东直门外大街39号院2号楼608室,100027
>
> TalkingData.com
>
答复: Executor hang
Posted by 阎志涛 <to...@tendcloud.com>.
It is not data skewed issue. The executor running more than 8 hours and I have to kill the executor. I doubt the problem is caused by Spark codegen but it sound spark.sql.codegen.wholeStage=false can not work. I met another problem with codegen which cause my spark job failed with following stack trace:
18/10/08 01:58:36 WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 1912, data1.aidata360.com, executor 4): java.util.NoSuchElementException
at java.util.LinkedList.removeFirst(LinkedList.java:270)
at org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.hasNext(BytesToBytesMap.java:310)
at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:177)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.Dataset$$anonfun$56$$anon$1.hasNext(Dataset.scala:2712)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How can I disable wholestage code gen?
Thanks and Regards,
Tony
发件人: kathleen li <ka...@gmail.com>
发送时间: 2018年10月8日 1:04
收件人: 阎志涛 <to...@tendcloud.com>
抄送: user@spark.apache.org
主题: Re: Executor hang
It seemed you had data skewed issue since shuffle read size for executor 4 is almost 2 times than other executors and GC time 11s almost 15 to 20 times than others.
Kathleen
Sent from my iPhone
On Oct 7, 2018, at 5:24 AM, 阎志涛 <to...@tendcloud.com>> wrote:
Hi, All,
I am running Spark 2.1 on Hadoop 2.7.2 with yarn. While executing spark tasks, some executor keep running forever without success. From the following screenshot:
<image002.jpg>
We can see that executor 4 keep running for 26 minutes and the shuffle read size/records keep unchanged for 26mins too. Threaddump for the thread is as following:
<image004.jpg>
<image009.jpg>
The linux version is: Linux version 4.14.62-70.117.amzn2.x86_64 (mockbuild@ip-10-0-1-79) and jdk version is Oracle JDK 1.8.0_181. With jstack on the machine, I can see following thread dump:
"Executor task launch worker for task 3806" #54 daemon prio=5 os_prio=0 tid=0x0000000001230800 nid=0x1fc runnable [0x00007fba0e600000]
java.lang.Thread.State: RUNNABLE
at java.lang.StringCoding.encode(StringCoding.java:364)
at java.lang.String.getBytes(String.java:941)
at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I wonder why this happened? Is it related to my environment of a bug of Spark?
Thanks and Regards,
Tony
阎志涛
研发副总裁
M + 86-139 1181 5695
Wechat zhitao_yan
北京腾云天下科技有限公司
北京市东直门外大街39号院2号楼608室,100027
TalkingData.com<http://TalkingData.com>
Re: Executor hang
Posted by kathleen li <ka...@gmail.com>.
It seemed you had data skewed issue since shuffle read size for executor 4 is almost 2 times than other executors and GC time 11s almost 15 to 20 times than others.
Kathleen
Sent from my iPhone
> On Oct 7, 2018, at 5:24 AM, 阎志涛 <to...@tendcloud.com> wrote:
>
> Hi, All,
> I am running Spark 2.1 on Hadoop 2.7.2 with yarn. While executing spark tasks, some executor keep running forever without success. From the following screenshot:
> <image002.jpg>
> We can see that executor 4 keep running for 26 minutes and the shuffle read size/records keep unchanged for 26mins too. Threaddump for the thread is as following:
> <image004.jpg>
>
> <image009.jpg>
>
> The linux version is: Linux version 4.14.62-70.117.amzn2.x86_64 (mockbuild@ip-10-0-1-79) and jdk version is Oracle JDK 1.8.0_181. With jstack on the machine, I can see following thread dump:
>
> "Executor task launch worker for task 3806" #54 daemon prio=5 os_prio=0 tid=0x0000000001230800 nid=0x1fc runnable [0x00007fba0e600000]
> java.lang.Thread.State: RUNNABLE
> at java.lang.StringCoding.encode(StringCoding.java:364)
> at java.lang.String.getBytes(String.java:941)
> at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:109)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
> at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
> at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> I wonder why this happened? Is it related to my environment of a bug of Spark?
>
> Thanks and Regards,
> Tony
>
> 阎志涛
> 研发副总裁
>
> M + 86-139 1181 5695
> Wechat zhitao_yan
>
> 北京腾云天下科技有限公司
> 北京市东直门外大街39号院2号楼608室,100027
>
> TalkingData.com
>