You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Shao Feng Shi (Jira)" <ji...@apache.org> on 2022/12/16 11:17:00 UTC
[jira] [Updated] (KYLIN-5008) When backend spark was failed, but corresponding job status is shown as finished in WebUI

     [ https://issues.apache.org/jira/browse/KYLIN-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shao Feng Shi updated KYLIN-5008:
---------------------------------
    Summary: When backend spark was failed, but corresponding job status is shown as finished in WebUI   (was: backend spark was failed, but corresponding job status is shown as finished in WebUI )

> When backend spark was failed, but corresponding job status is shown as finished in WebUI 
> ------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-5008
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5008
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: v4.0.0-beta
>            Reporter: ZHANGHONGJIA
>            Assignee: Yaqian Zhang
>            Priority: Major
>             Fix For: v4.0.3
>
>         Attachments: image-2021-06-10-16-46-35-919.png, image-2021-06-15-15-27-45-099.png, image-2021-06-15-15-52-10-118.png, image-2021-06-15-15-52-31-635.png, merge-job.log
>
>
> According to the log shown as below, the spark project was failed due to Container killed by YARN for exceeding memory limits , but in Kylin WebUI ,the status of the mergeJob is finished.  Besides, the amount of data in the segment after merged is as three times as the amount of actual data . It seems that kylin didn't monitor the failure of this merge job.
>  
> Here is the merge job log :
> ===============================================================
>  at org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 244 in stage 1108.0 failed 4 times, most recent failure: Lost task 244.3 in stage 1108.0 (TID 78736, r4200h1-app.travelsky.com, executor 109): ExecutorLostFailure (executor 109 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 39.0 GB of 36 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
> Driver stacktrace:
>  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at scala.Option.foreach(Option.scala:257)
>  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
>  ... 34 more
> }
> RetryInfo{
>  overrideConf : \{spark.executor.memory=36618MB, spark.executor.memoryOverhead=7323MB},
>  throwable : java.lang.RuntimeException: Error execute org.apache.kylin.engine.spark.job.CubeMergeJob
>  at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:92)
>  at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55)
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted.
>  at org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate.updateLayout(BuildLayoutWithUpdate.java:70)
>  at org.apache.kylin.engine.spark.job.CubeMergeJob.mergeSegments(CubeMergeJob.java:122)
>  at org.apache.kylin.engine.spark.job.CubeMergeJob.doExecute(CubeMergeJob.java:82)
>  at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:298)
>  at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:89)
>  ... 4 more
> Caused by: org.apache.spark.SparkException: Job aborted.
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
>  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
>  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
>  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
>  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
>  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
>  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677)
>  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
>  at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:567)
>  at org.apache.kylin.engine.spark.storage.ParquetStorage.saveTo(ParquetStorage.scala:28)
>  at org.apache.kylin.engine.spark.job.CubeMergeJob.saveAndUpdateCuboid(CubeMergeJob.java:171)
>  at org.apache.kylin.engine.spark.job.CubeMergeJob.access$000(CubeMergeJob.java:59)
>  at org.apache.kylin.engine.spark.job.CubeMergeJob$1.build(CubeMergeJob.java:118)
>  at org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:51)
>  at org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 428 in stage 360.0 failed 4 times, most recent failure: Lost task 428.3 in stage 360.0 (TID 26130, umetrip40-hdp2.6-140.travelsky.com, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 48.4 GB of 46 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
> Driver stacktrace:
>  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at scala.Option.foreach(Option.scala:257)
>  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
>  ... 34 more
> }
>  
> The WebUI monitor:
> !image-2021-06-10-16-46-35-919.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)