You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Neelesh Shastry (JIRA)" <ji...@apache.org> on 2015/12/21 19:10:46 UTC

[jira] [Updated] (SPARK-12452) Add exception details to TaskCompletionListener/TaskContext

     [ https://issues.apache.org/jira/browse/SPARK-12452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neelesh Shastry updated SPARK-12452:
------------------------------------
     Affects Version/s: 1.5.2
           Environment:     (was: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 cores and 32g memory  )
    Remaining Estimate:     (was: 168h)
     Original Estimate:     (was: 168h)
              Priority: Minor  (was: Critical)
           Description: 
TaskCompletionListeners are called without success/failure details. 
If we change this
{code}
trait TaskCompletionListener extends EventListener {
  def onTaskCompletion(context: TaskContext)
}

class TaskContextImpl {
 ....
private[spark] def markTaskCompleted(throwable:Option[Throwable]): Unit
....
listener.onTaskCompletion(this,throwable)
}
{code}

to something like
{code}
trait TaskCompletionListener extends EventListener {
  def onTaskCompletion(context: TaskContext, throwable:Option[Throwable]=None)
}

{code}

.. and  in Task.scala
{code}
   val results = Try(runTask(context))
   var throwable:Option[Throwable] = None
    try {
      runTask(context)
   
    }catch{
      case t:Throwable => throwable=t
    }
     finally {
      context.markTaskCompleted(throwable)
      TaskContext.unset()
    }

{code}

  was:
Each node is allocated 30g memory by Yarn.
My application receives messages from Kafka by directstream. Each application consists of 4 dstream window
Spark application is submitted by this command:
spark-submit --class spark_security.safe.SafeSockPuppet  --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores 4  --name safeSparkDealerUser --master yarn  --deploy-mode cluster  spark_Security-1.0-SNAPSHOT.jar.nocalse hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties

After about 1 hours, some executor exits. There is no more yarn logs after the executor exits and there is no stack when the executor exits.
When I see the yarn node manager log, it shows as follows :


2015-08-17 17:25:41,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1439803298368_0005_01_000001 by user root
2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1439803298368_0005
2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root	IP=172.19.160.102	OPERATION=Start Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS	APPID=application_1439803298368_0005	CONTAINERID=container_1439803298368_0005_01_000001
2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from NEW to INITING
2015-08-17 17:25:41,552 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1439803298368_0005_01_000001 to application application_1439803298368_0005
2015-08-17 17:25:41,557 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2015-08-17 17:25:41,663 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from INITING to RUNNING
2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_000001 transitioned from NEW to LOCALIZING
2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1439803298368_0005
2015-08-17 17:25:41,664 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1439803298368_0005_01_000001
2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar transitioned from INIT to DOWNLOADING
2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar transitioned from INIT to DOWNLOADING
2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1439803298368_0005_01_000001
2015-08-17 17:25:41,668 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_000001.tokens. Credentials list: 
2015-08-17 17:25:41,682 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Initializing user root
2015-08-17 17:25:41,686 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying from /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_000001.tokens to /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000001.tokens
2015-08-17 17:25:41,686 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Localizer CWD set to /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005 = file:/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005
2015-08-17 17:25:42,240 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar(->/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/filecache/14/spark-assembly-1.3.1-hadoop2.6.0.jar) transitioned from DOWNLOADING to LOCALIZED
2015-08-17 17:25:42,508 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar(->/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/filecache/15/spark_Security-1.0-SNAPSHOT.jar) transitioned from DOWNLOADING to LOCALIZED
2015-08-17 17:25:42,508 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_000001 transitioned from LOCALIZING to LOCALIZED
2015-08-17 17:25:42,548 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_000001 transitioned from LOCALIZED to RUNNING
................................................
2015-08-17 17:26:20,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1439803298368_0005_01_000003 by user root
2015-08-17 17:26:20,367 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1439803298368_0005_01_000003 to application application_1439803298368_0005
2015-08-17 17:26:20,368 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_000003 transitioned from NEW to LOCALIZING
2015-08-17 17:26:20,368 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1439803298368_0005
2015-08-17 17:26:20,368 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1439803298368_0005_01_000003
2015-08-17 17:26:20,369 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_000003 transitioned from LOCALIZING to LOCALIZED
2015-08-17 17:26:20,370 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root	IP=172.19.160.102	OPERATION=Start Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS	APPID=application_1439803298368_0005	CONTAINERID=container_1439803298368_0005_01_000003
2015-08-17 17:26:20,443 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_000003 transitioned from LOCALIZED to RUNNING
2015-08-17 17:26:20,443 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread
2015-08-17 17:26:20,449 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000003/default_container_executor.sh]
..........................................
   
2015-08-18 01:50:30,297 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1439803298368_0005_01_000003 succeeded 
2015-08-18 01:50:30,440 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_000003 transitioned from RUNNING to EXITED_WITH_SUCCESS
2015-08-18 01:50:30,465 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1439803298368_0005_01_000003
2015-08-18 01:50:35,046 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root	OPERATION=Container Finished - Succeeded	TARGET=ContainerImpl	RESULT=SUCCESS	APPID=application_1439803298368_0005	CONTAINERID=container_1439803298368_0005_01_000003
2015-08-18 01:50:35,062 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_000003 transitioned from EXITED_WITH_SUCCESS to DONE
2015-08-18 01:50:35,065 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1439803298368_0005_01_000003 from application application_1439803298368_0005
2015-08-18 01:50:35,070 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread
2015-08-18 01:50:35,082 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1439803298368_0005_01_000003 for log-aggregation
2015-08-18 01:50:35,089 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1439803298368_0005
2015-08-18 01:50:35,099 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1439803298368_0005_01_000003
2015-08-18 01:50:35,105 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000003
2015-08-18 01:50:47,601 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1439803298368_0005_01_000001 is : 15
2015-08-18 01:50:48,401 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1439803298368_0005_01_000001 and exit code: 15
ExitCodeException exitCode=15: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
	at org.apache.hadoop.util.Shell.run(Shell.java:455)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

    container_1439803298368_0005_01_000003 was started at 2015-08-17 17:26:20. It ran normally. But it transitioned  to succeed at  2015-08-18 01:50:30 . And it transitioned to CONTAINER_STOP in the end.    container_1439803298368_0005_01_000001 was started at 2015-08-17 17:25:42. At 2015-08-18 01:50:48 it exited suddenly.

According to the node manager ,we can know that container_1439803298368_0005_01_000003 transitioned from RUNNING to EXITED_WITH_SUCCESS


           Component/s:     (was: YARN)
            Issue Type: Improvement  (was: Bug)

> Add exception details to TaskCompletionListener/TaskContext
> -----------------------------------------------------------
>
>                 Key: SPARK-12452
>                 URL: https://issues.apache.org/jira/browse/SPARK-12452
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.5.2
>            Reporter: Neelesh Shastry
>            Priority: Minor
>
> TaskCompletionListeners are called without success/failure details. 
> If we change this
> {code}
> trait TaskCompletionListener extends EventListener {
>   def onTaskCompletion(context: TaskContext)
> }
> class TaskContextImpl {
>  ....
> private[spark] def markTaskCompleted(throwable:Option[Throwable]): Unit
> ....
> listener.onTaskCompletion(this,throwable)
> }
> {code}
> to something like
> {code}
> trait TaskCompletionListener extends EventListener {
>   def onTaskCompletion(context: TaskContext, throwable:Option[Throwable]=None)
> }
> {code}
> .. and  in Task.scala
> {code}
>    val results = Try(runTask(context))
>    var throwable:Option[Throwable] = None
>     try {
>       runTask(context)
>    
>     }catch{
>       case t:Throwable => throwable=t
>     }
>      finally {
>       context.markTaskCompleted(throwable)
>       TaskContext.unset()
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org