You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by "ZHANGHONGJIA (Jira)" <ji...@apache.org> on 2021/06/10 08:54:00 UTC

[jira] [Created] (KYLIN-5008) backend spark was failed, but corresponding job status is shown as finished in WebUI

ZHANGHONGJIA created KYLIN-5008:
-----------------------------------

             Summary: backend spark was failed, but corresponding job status is shown as finished in WebUI 
                 Key: KYLIN-5008
                 URL: https://issues.apache.org/jira/browse/KYLIN-5008
             Project: Kylin
          Issue Type: Bug
    Affects Versions: v4.0.0-beta
            Reporter: ZHANGHONGJIA
         Attachments: image-2021-06-10-16-46-35-919.png, merge-job.log

According to the log shown as below, the spark project was failed due to Container killed by YARN for exceeding memory limits , but in Kylin WebUI ,the status of the mergeJob is finished.  Besides, the amount of data in the segment after merged is as three times as the amount of actual data . It seems that kylin didn't monitor the failure of this merge job.

 

Here is the merge job log :

===============================================================
 at org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 ... 3 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 244 in stage 1108.0 failed 4 times, most recent failure: Lost task 244.3 in stage 1108.0 (TID 78736, r4200h1-app.travelsky.com, executor 109): ExecutorLostFailure (executor 109 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 39.0 GB of 36 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at scala.Option.foreach(Option.scala:257)
 at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
 at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
 at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
 at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
 ... 34 more

}
RetryInfo{
 overrideConf : \{spark.executor.memory=36618MB, spark.executor.memoryOverhead=7323MB},
 throwable : java.lang.RuntimeException: Error execute org.apache.kylin.engine.spark.job.CubeMergeJob
 at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:92)
 at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted.
 at org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate.updateLayout(BuildLayoutWithUpdate.java:70)
 at org.apache.kylin.engine.spark.job.CubeMergeJob.mergeSegments(CubeMergeJob.java:122)
 at org.apache.kylin.engine.spark.job.CubeMergeJob.doExecute(CubeMergeJob.java:82)
 at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:298)
 at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:89)
 ... 4 more
Caused by: org.apache.spark.SparkException: Job aborted.
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
 at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
 at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
 at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
 at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
 at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
 at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
 at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677)
 at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272)
 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
 at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:567)
 at org.apache.kylin.engine.spark.storage.ParquetStorage.saveTo(ParquetStorage.scala:28)
 at org.apache.kylin.engine.spark.job.CubeMergeJob.saveAndUpdateCuboid(CubeMergeJob.java:171)
 at org.apache.kylin.engine.spark.job.CubeMergeJob.access$000(CubeMergeJob.java:59)
 at org.apache.kylin.engine.spark.job.CubeMergeJob$1.build(CubeMergeJob.java:118)
 at org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:51)
 at org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 ... 3 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 428 in stage 360.0 failed 4 times, most recent failure: Lost task 428.3 in stage 360.0 (TID 26130, umetrip40-hdp2.6-140.travelsky.com, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 48.4 GB of 46 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
 at scala.Option.foreach(Option.scala:257)
 at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
 at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
 at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
 at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
 ... 34 more

}

 

The WebUI monitor:

!image-2021-06-10-16-46-35-919.png!

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)