You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Johannes Zillmann <jz...@googlemail.com> on 2016/02/25 10:03:10 UTC

Strange behaviour of session DAG

Hey guys,

have a Tez job on a customer deployment which has very strange symptoms. Maybe you have some pointers for me what to look into!

App master log:
—————————————
2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git <http://git-wip-us.apache.org/repos/asf/tez.git>, buildTime=2015-03-03T18:01:25Z ]
2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317, 

… user specific code

2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
—————————————

DAG-log:
--------------------------
2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
… user specific code

2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
--------------------------


Latest line of the (single) task:
--------------------------
...
2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output

… user specific code

2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
—————————————

The client log contains a shutdown exception:
—————————————
INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
 INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0

 INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0

	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
	at datameer.com.google.common.io.Closer.close(Closer.java:214)
	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
	at datameer.com.google.common.io.Closer.close(Closer.java:214)
	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
 INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852

—————————————


The counter and the event shows that:
- the correct amount of input records has been ingested
- the expected amount of output records is missing (we have zero records produced which is reflected in both, the counters and the output itself)
- for Tez the task and the DAG are both handled as successful

The customers says that the symptoms of the job are always the same:
- there is that standby warning
- there is the problem with the shutdown
- (and the data is missing)

To me its still unclear if there are strange things going on with Tez or with our application logic.
The same job is successful when executed with pure MapReduce.
Is the standby thing something to worry about ?

best
Johannes

Re: Strange behaviour of session DAG

Posted by Johannes Zillmann <jz...@googlemail.com>.
@Jeff, yes its a custom input-initialiser with custom format getting the file-system via FileSystem.get(uri, con).

thanks for the insight! Looking into this!
Johannes

> On 29 Feb 2016, at 13:26, Jeff Zhang <zj...@gmail.com> wrote:
> 
> It looks like the issue is in MapChainVertex's InputInitializer.  It time out after 3 minutes due to the HA issue.  Do they implement a custom InputInitializer or using a custom InputFormat ? If they use a custom InputInitializer then it might be the user code issue. 
> 
> 
> 2016-02-16 12:05:21,435 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] graphv2.ClusterTaskRuntime: Initializing cluster task configuration...
> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN:
> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> 2016-02-16 12:08:12,515 WARN [AsyncDispatcher event handler] filesystem.FileSystemNirvana: No file-system provider configuration setup!
> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> 
> On Mon, Feb 29, 2016 at 8:14 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
> Here you go:
> 
> 
> Johannes 
> 
>> On 26 Feb 2016, at 02:56, Jeff Zhang <zjffdu@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Is it possible for you to get the whole app log from your customer ?
>> 
>> On Thu, Feb 25, 2016 at 6:26 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
>> Looks like there is the same warning in MR as well:
>> 
>> 016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>> 
>> But in this case it doesn’t seem to have a bad effect…
>> Johannes
>> 
>>> On 25 Feb 2016, at 10:46, Jeff Zhang <zjffdu@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Might be, does only tez job affected ? What about other jobs like MR ?
>>> 
>>> On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
>>> So yes, i think HA is enabled.
>>> But this whole behaviour is not a one shot thing, its happen repeatedly for this job. So my guess is its rather a unhappy configuration then the name node switching again and again !?
>>> 
>>> Johannes
>>> 
>>> 
>>>> On 25 Feb 2016, at 10:29, Jeff Zhang <zjffdu@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> The standby warning is because it try to connect to the standby node. (HA should be enabled I think).  Maybe this cause the output missing, could you attach the whole app log ?
>>>> 
>>>> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
>>>> Hey guys,
>>>> 
>>>> have a Tez job on a customer deployment which has very strange symptoms. Maybe you have some pointers for me what to look into!
>>>> 
>>>> App master log:
>>>> —————————————
>>>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git <http://git-wip-us.apache.org/repos/asf/tez.git>, buildTime=2015-03-03T18:01:25Z ]
>>>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name <http://fs.default.name/> is deprecated. Instead, use fs.defaultFS
>>>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
>>>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>>>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
>>>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
>>>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
>>>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>>>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>>>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>>>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>>>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>>>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>>>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
>>>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
>>>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
>>>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>>>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
>>>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
>>>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
>>>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317, 
>>>> 
>>>> … user specific code
>>>> 
>>>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
>>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name <http://fs.default.name/> is deprecated. Instead, use fs.defaultFS
>>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
>>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
>>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
>>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
>>>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
>>>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
>>>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
>>>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
>>>> —————————————
>>>> 
>>>> DAG-log:
>>>> --------------------------
>>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
>>>> … user specific code
>>>> 
>>>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
>>>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>>>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>> --------------------------
>>>> 
>>>> 
>>>> Latest line of the (single) task:
>>>> --------------------------
>>>> ...
>>>> 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>>>> 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
>>>> 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
>>>> 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
>>>> 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output
>>>> 
>>>> … user specific code
>>>> 
>>>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
>>>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>> —————————————
>>>> 
>>>> The client log contains a shutdown exception:
>>>> —————————————
>>>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>>> 
>>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
>>>> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>>> 
>>>> 	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>>>> 	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>>>> 	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>>>> 	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>>>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>>> 	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>>>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>>> 	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>>>> 	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>>>> 	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>>>> 	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>>>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>>>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>>>> 	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>>> 
>>>> —————————————
>>>> 
>>>> 
>>>> The counter and the event shows that:
>>>> - the correct amount of input records has been ingested
>>>> - the expected amount of output records is missing (we have zero records produced which is reflected in both, the counters and the output itself)
>>>> - for Tez the task and the DAG are both handled as successful
>>>> 
>>>> The customers says that the symptoms of the job are always the same:
>>>> - there is that standby warning
>>>> - there is the problem with the shutdown
>>>> - (and the data is missing)
>>>> 
>>>> To me its still unclear if there are strange things going on with Tez or with our application logic.
>>>> The same job is successful when executed with pure MapReduce.
>>>> Is the standby thing something to worry about ?
>>>> 
>>>> best
>>>> Johannes
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Best Regards
>>>> 
>>>> Jeff Zhang
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards
>>> 
>>> Jeff Zhang
>> 
>> 
>> 
>> 
>> -- 
>> Best Regards
>> 
>> Jeff Zhang
> 
> 
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: Strange behaviour of session DAG

Posted by Jeff Zhang <zj...@gmail.com>.
It looks like the issue is in MapChainVertex's InputInitializer.  It time
out after 3 minutes due to the HA issue.  Do they implement a custom
InputInitializer or using a custom InputFormat ? If they use a
custom InputInitializer then it might be the user code issue.


2016-02-16 12:05:21,435 INFO [InputInitializer [MapChainVertex:Export job
(349): ClaimLineExportToDB2#export(Disconnected record stream)
(2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] graphv2.ClusterTaskRuntime:
Initializing cluster task configuration...
2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job
(349): ClaimLineExportToDB2#export(Disconnected record stream)
(2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root
logger level for hadoop task to WARN:
2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler]
hdfs.ClientContext: Existing client context 'default' does not match
requested configuration.  Existing: shortCircuitStreamsCacheSize = 256,
shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256,
shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout
= 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity
= 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false,
useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false,
shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested:
shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs =
300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs =
3600000, shortCircuitMmapCacheRetryTimeout = 300000,
shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16,
socketCacheExpiry = 3000, shortCircuitLocalReads = true,
useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false,
shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
*2016-02-16 12:05:21,732* WARN [AsyncDispatcher event handler] ipc.Client:
Exception encountered while connecting to the server :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby
*2016-02-16 12:08:12*,515 WARN [AsyncDispatcher event handler]
filesystem.FileSystemNirvana: No file-system provider configuration setup!
2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client:
Exception encountered while connecting to the server :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby

On Mon, Feb 29, 2016 at 8:14 PM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:

> Here you go:
>
>
> Johannes
>
> On 26 Feb 2016, at 02:56, Jeff Zhang <zj...@gmail.com> wrote:
>
> Is it possible for you to get the whole app log from your customer ?
>
> On Thu, Feb 25, 2016 at 6:26 PM, Johannes Zillmann <
> jzillmann@googlemail.com> wrote:
>
>> Looks like there is the same warning in MR as well:
>>
>> 016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client:
>> Exception encountered while connecting to the server :
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>> Operation category READ is not supported in state standby
>>
>> But in this case it doesn’t seem to have a bad effect…
>> Johannes
>>
>> On 25 Feb 2016, at 10:46, Jeff Zhang <zj...@gmail.com> wrote:
>>
>> Might be, does only tez job affected ? What about other jobs like MR ?
>>
>> On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <
>> jzillmann@googlemail.com> wrote:
>>
>>> So yes, i think HA is enabled.
>>> But this whole behaviour is not a one shot thing, its happen repeatedly
>>> for this job. So my guess is its rather a unhappy configuration then the
>>> name node switching again and again !?
>>>
>>> Johannes
>>>
>>>
>>> On 25 Feb 2016, at 10:29, Jeff Zhang <zj...@gmail.com> wrote:
>>>
>>> The standby warning is because it try to connect to the standby node.
>>> (HA should be enabled I think).  Maybe this cause the output missing, could
>>> you attach the whole app log ?
>>>
>>> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <
>>> jzillmann@googlemail.com> wrote:
>>>
>>>> Hey guys,
>>>>
>>>> have a Tez job on a customer deployment which has very strange
>>>> symptoms. Maybe you have some pointers for me what to look into!
>>>>
>>>> App master log:
>>>> —————————————
>>>>
>>>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2015-03-03T18:01:25Z ]
>>>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
>>>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
>>>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>>>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
>>>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
>>>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
>>>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>>>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>>>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>>>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>>>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>>>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>>>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
>>>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
>>>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
>>>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>>>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
>>>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
>>>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
>>>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317,
>>>>
>>>>
>>>> … user specific code
>>>>
>>>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
>>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
>>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
>>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
>>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
>>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
>>>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
>>>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
>>>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
>>>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
>>>>
>>>> —————————————
>>>>
>>>> DAG-log:
>>>> --------------------------
>>>>
>>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
>>>>
>>>> … user specific code
>>>>
>>>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN:
>>>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>>>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>>
>>>> --------------------------
>>>>
>>>>
>>>> Latest line of the (single) task:
>>>> --------------------------
>>>>
>>>> ...
>>>>
>>>> 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>>>> 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
>>>> 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
>>>> 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
>>>> 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output
>>>>
>>>>
>>>> … user specific code
>>>>
>>>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root
>>>> logger level for hadoop task to WARN:
>>>>
>>>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>>
>>>> —————————————
>>>>
>>>> The client log contains a shutdown exception:
>>>> —————————————
>>>>
>>>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>>>
>>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
>>>> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>>>
>>>> 	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>>>> 	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>>>> 	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>>>> 	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>>>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>>> 	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>>>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>>> 	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>>>> 	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>>>> 	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>>>> 	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>>>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>>>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>>>> 	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>>>
>>>>
>>>> —————————————
>>>>
>>>>
>>>> The counter and the event shows that:
>>>> - the correct amount of input records has been ingested
>>>> - the expected amount of output records is missing (we have zero
>>>> records produced which is reflected in both, the counters and the output
>>>> itself)
>>>> - for Tez the task and the DAG are both handled as successful
>>>>
>>>> The customers says that the symptoms of the job are always the same:
>>>> - there is that standby warning
>>>> - there is the problem with the shutdown
>>>> - (and the data is missing)
>>>>
>>>> To me its still unclear if there are strange things going on with Tez
>>>> or with our application logic.
>>>> The same job is successful when executed with pure MapReduce.
>>>> Is the standby thing something to worry about ?
>>>>
>>>> best
>>>> Johannes
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>


-- 
Best Regards

Jeff Zhang

Re: Strange behaviour of session DAG

Posted by Johannes Zillmann <jz...@googlemail.com>.
Here you go:


Johannes 

> On 26 Feb 2016, at 02:56, Jeff Zhang <zj...@gmail.com> wrote:
> 
> Is it possible for you to get the whole app log from your customer ?
> 
> On Thu, Feb 25, 2016 at 6:26 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
> Looks like there is the same warning in MR as well:
> 
> 016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> 
> But in this case it doesn’t seem to have a bad effect…
> Johannes
> 
>> On 25 Feb 2016, at 10:46, Jeff Zhang <zjffdu@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Might be, does only tez job affected ? What about other jobs like MR ?
>> 
>> On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
>> So yes, i think HA is enabled.
>> But this whole behaviour is not a one shot thing, its happen repeatedly for this job. So my guess is its rather a unhappy configuration then the name node switching again and again !?
>> 
>> Johannes
>> 
>> 
>>> On 25 Feb 2016, at 10:29, Jeff Zhang <zjffdu@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> The standby warning is because it try to connect to the standby node. (HA should be enabled I think).  Maybe this cause the output missing, could you attach the whole app log ?
>>> 
>>> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
>>> Hey guys,
>>> 
>>> have a Tez job on a customer deployment which has very strange symptoms. Maybe you have some pointers for me what to look into!
>>> 
>>> App master log:
>>> —————————————
>>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git <http://git-wip-us.apache.org/repos/asf/tez.git>, buildTime=2015-03-03T18:01:25Z ]
>>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name <http://fs.default.name/> is deprecated. Instead, use fs.defaultFS
>>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
>>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
>>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
>>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
>>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
>>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
>>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
>>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
>>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
>>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
>>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317, 
>>> 
>>> … user specific code
>>> 
>>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name <http://fs.default.name/> is deprecated. Instead, use fs.defaultFS
>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
>>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
>>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
>>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
>>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
>>> —————————————
>>> 
>>> DAG-log:
>>> --------------------------
>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
>>> … user specific code
>>> 
>>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
>>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>> --------------------------
>>> 
>>> 
>>> Latest line of the (single) task:
>>> --------------------------
>>> ...
>>> 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>>> 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
>>> 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
>>> 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>>> 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
>>> 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output
>>> 
>>> … user specific code
>>> 
>>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
>>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>> —————————————
>>> 
>>> The client log contains a shutdown exception:
>>> —————————————
>>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>> 
>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
>>> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>> 
>>> 	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>>> 	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>>> 	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>>> 	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>> 	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>> 	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>>> 	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>>> 	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>>> 	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>>> 	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>> 
>>> —————————————
>>> 
>>> 
>>> The counter and the event shows that:
>>> - the correct amount of input records has been ingested
>>> - the expected amount of output records is missing (we have zero records produced which is reflected in both, the counters and the output itself)
>>> - for Tez the task and the DAG are both handled as successful
>>> 
>>> The customers says that the symptoms of the job are always the same:
>>> - there is that standby warning
>>> - there is the problem with the shutdown
>>> - (and the data is missing)
>>> 
>>> To me its still unclear if there are strange things going on with Tez or with our application logic.
>>> The same job is successful when executed with pure MapReduce.
>>> Is the standby thing something to worry about ?
>>> 
>>> best
>>> Johannes
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards
>>> 
>>> Jeff Zhang
>> 
>> 
>> 
>> 
>> -- 
>> Best Regards
>> 
>> Jeff Zhang
> 
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: Strange behaviour of session DAG

Posted by Jeff Zhang <zj...@gmail.com>.
Is it possible for you to get the whole app log from your customer ?

On Thu, Feb 25, 2016 at 6:26 PM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:

> Looks like there is the same warning in MR as well:
>
> 016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client: Exception
> encountered while connecting to the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
>
> But in this case it doesn’t seem to have a bad effect…
> Johannes
>
> On 25 Feb 2016, at 10:46, Jeff Zhang <zj...@gmail.com> wrote:
>
> Might be, does only tez job affected ? What about other jobs like MR ?
>
> On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <
> jzillmann@googlemail.com> wrote:
>
>> So yes, i think HA is enabled.
>> But this whole behaviour is not a one shot thing, its happen repeatedly
>> for this job. So my guess is its rather a unhappy configuration then the
>> name node switching again and again !?
>>
>> Johannes
>>
>>
>> On 25 Feb 2016, at 10:29, Jeff Zhang <zj...@gmail.com> wrote:
>>
>> The standby warning is because it try to connect to the standby node. (HA
>> should be enabled I think).  Maybe this cause the output missing, could you
>> attach the whole app log ?
>>
>> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <
>> jzillmann@googlemail.com> wrote:
>>
>>> Hey guys,
>>>
>>> have a Tez job on a customer deployment which has very strange symptoms.
>>> Maybe you have some pointers for me what to look into!
>>>
>>> App master log:
>>> —————————————
>>>
>>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2015-03-03T18:01:25Z ]
>>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
>>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
>>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
>>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
>>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
>>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
>>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
>>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
>>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
>>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
>>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
>>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317,
>>>
>>>
>>> … user specific code
>>>
>>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
>>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
>>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
>>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
>>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
>>>
>>> —————————————
>>>
>>> DAG-log:
>>> --------------------------
>>>
>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
>>>
>>> … user specific code
>>>
>>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN:
>>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>
>>> --------------------------
>>>
>>>
>>> Latest line of the (single) task:
>>> --------------------------
>>>
>>> ...
>>>
>>> 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>>> 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
>>> 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
>>> 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>>> 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
>>> 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output
>>>
>>>
>>> … user specific code
>>>
>>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root
>>> logger level for hadoop task to WARN:
>>>
>>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>>
>>> —————————————
>>>
>>> The client log contains a shutdown exception:
>>> —————————————
>>>
>>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>>
>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
>>> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>>
>>> 	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>>> 	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>>> 	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>>> 	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>> 	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>> 	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>>> 	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>>> 	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>>> 	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>>> 	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>>
>>>
>>> —————————————
>>>
>>>
>>> The counter and the event shows that:
>>> - the correct amount of input records has been ingested
>>> - the expected amount of output records is missing (we have zero records
>>> produced which is reflected in both, the counters and the output itself)
>>> - for Tez the task and the DAG are both handled as successful
>>>
>>> The customers says that the symptoms of the job are always the same:
>>> - there is that standby warning
>>> - there is the problem with the shutdown
>>> - (and the data is missing)
>>>
>>> To me its still unclear if there are strange things going on with Tez or
>>> with our application logic.
>>> The same job is successful when executed with pure MapReduce.
>>> Is the standby thing something to worry about ?
>>>
>>> best
>>> Johannes
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>


-- 
Best Regards

Jeff Zhang

Re: Strange behaviour of session DAG

Posted by Johannes Zillmann <jz...@googlemail.com>.
Looks like there is the same warning in MR as well:

016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby

But in this case it doesn’t seem to have a bad effect…
Johannes

> On 25 Feb 2016, at 10:46, Jeff Zhang <zj...@gmail.com> wrote:
> 
> Might be, does only tez job affected ? What about other jobs like MR ?
> 
> On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
> So yes, i think HA is enabled.
> But this whole behaviour is not a one shot thing, its happen repeatedly for this job. So my guess is its rather a unhappy configuration then the name node switching again and again !?
> 
> Johannes
> 
> 
>> On 25 Feb 2016, at 10:29, Jeff Zhang <zjffdu@gmail.com <ma...@gmail.com>> wrote:
>> 
>> The standby warning is because it try to connect to the standby node. (HA should be enabled I think).  Maybe this cause the output missing, could you attach the whole app log ?
>> 
>> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
>> Hey guys,
>> 
>> have a Tez job on a customer deployment which has very strange symptoms. Maybe you have some pointers for me what to look into!
>> 
>> App master log:
>> —————————————
>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git <http://git-wip-us.apache.org/repos/asf/tez.git>, buildTime=2015-03-03T18:01:25Z ]
>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name <http://fs.default.name/> is deprecated. Instead, use fs.defaultFS
>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317, 
>> 
>> … user specific code
>> 
>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name <http://fs.default.name/> is deprecated. Instead, use fs.defaultFS
>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
>> —————————————
>> 
>> DAG-log:
>> --------------------------
>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
>> … user specific code
>> 
>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>> --------------------------
>> 
>> 
>> Latest line of the (single) task:
>> --------------------------
>> ...
>> 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>> 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
>> 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
>> 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
>> 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>> 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>> 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
>> 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output
>> 
>> … user specific code
>> 
>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>> —————————————
>> 
>> The client log contains a shutdown exception:
>> —————————————
>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>> 
>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
>> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>> 
>> 	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>> 	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>> 	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>> 	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>> 	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>> 	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>> 	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>> 	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>> 	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>> 	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>> 
>> —————————————
>> 
>> 
>> The counter and the event shows that:
>> - the correct amount of input records has been ingested
>> - the expected amount of output records is missing (we have zero records produced which is reflected in both, the counters and the output itself)
>> - for Tez the task and the DAG are both handled as successful
>> 
>> The customers says that the symptoms of the job are always the same:
>> - there is that standby warning
>> - there is the problem with the shutdown
>> - (and the data is missing)
>> 
>> To me its still unclear if there are strange things going on with Tez or with our application logic.
>> The same job is successful when executed with pure MapReduce.
>> Is the standby thing something to worry about ?
>> 
>> best
>> Johannes
>> 
>> 
>> 
>> -- 
>> Best Regards
>> 
>> Jeff Zhang
> 
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: Strange behaviour of session DAG

Posted by Jeff Zhang <zj...@gmail.com>.
Might be, does only tez job affected ? What about other jobs like MR ?

On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:

> So yes, i think HA is enabled.
> But this whole behaviour is not a one shot thing, its happen repeatedly
> for this job. So my guess is its rather a unhappy configuration then the
> name node switching again and again !?
>
> Johannes
>
>
> On 25 Feb 2016, at 10:29, Jeff Zhang <zj...@gmail.com> wrote:
>
> The standby warning is because it try to connect to the standby node. (HA
> should be enabled I think).  Maybe this cause the output missing, could you
> attach the whole app log ?
>
> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <
> jzillmann@googlemail.com> wrote:
>
>> Hey guys,
>>
>> have a Tez job on a customer deployment which has very strange symptoms.
>> Maybe you have some pointers for me what to look into!
>>
>> App master log:
>> —————————————
>>
>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2015-03-03T18:01:25Z ]
>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317,
>>
>>
>> … user specific code
>>
>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
>>
>> —————————————
>>
>> DAG-log:
>> --------------------------
>>
>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
>>
>> … user specific code
>>
>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN:
>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>
>> --------------------------
>>
>>
>> Latest line of the (single) task:
>> --------------------------
>>
>> ...
>>
>> 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>> 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
>> 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
>> 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
>> 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>> 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
>> 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
>> 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output
>>
>>
>> … user specific code
>>
>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root
>> logger level for hadoop task to WARN:
>>
>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>>
>> —————————————
>>
>> The client log contains a shutdown exception:
>> —————————————
>>
>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>
>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
>> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>>
>> 	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>> 	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>> 	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>> 	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>> 	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
>> 	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>> 	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>> 	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>> 	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>> 	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>>
>>
>> —————————————
>>
>>
>> The counter and the event shows that:
>> - the correct amount of input records has been ingested
>> - the expected amount of output records is missing (we have zero records
>> produced which is reflected in both, the counters and the output itself)
>> - for Tez the task and the DAG are both handled as successful
>>
>> The customers says that the symptoms of the job are always the same:
>> - there is that standby warning
>> - there is the problem with the shutdown
>> - (and the data is missing)
>>
>> To me its still unclear if there are strange things going on with Tez or
>> with our application logic.
>> The same job is successful when executed with pure MapReduce.
>> Is the standby thing something to worry about ?
>>
>> best
>> Johannes
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>


-- 
Best Regards

Jeff Zhang

Re: Strange behaviour of session DAG

Posted by Johannes Zillmann <jz...@googlemail.com>.
So yes, i think HA is enabled.
But this whole behaviour is not a one shot thing, its happen repeatedly for this job. So my guess is its rather a unhappy configuration then the name node switching again and again !?

Johannes


> On 25 Feb 2016, at 10:29, Jeff Zhang <zj...@gmail.com> wrote:
> 
> The standby warning is because it try to connect to the standby node. (HA should be enabled I think).  Maybe this cause the output missing, could you attach the whole app log ?
> 
> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
> Hey guys,
> 
> have a Tez job on a customer deployment which has very strange symptoms. Maybe you have some pointers for me what to look into!
> 
> App master log:
> —————————————
> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git <http://git-wip-us.apache.org/repos/asf/tez.git>, buildTime=2015-03-03T18:01:25Z ]
> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name <http://fs.default.name/> is deprecated. Instead, use fs.defaultFS
> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317, 
> 
> … user specific code
> 
> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name <http://fs.default.name/> is deprecated. Instead, use fs.defaultFS
> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
> —————————————
> 
> DAG-log:
> --------------------------
> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
> … user specific code
> 
> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> --------------------------
> 
> 
> Latest line of the (single) task:
> --------------------------
> ...
> 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
> 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
> 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
> 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
> 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
> 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
> 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
> 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output
> 
> … user specific code
> 
> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 
> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> —————————————
> 
> The client log contains a shutdown exception:
> —————————————
> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
> 
>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
> 
> 	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
> 	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
> 	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
> 	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
> 	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
> 	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
> 	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
> 	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
> 	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
> 	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
> 
> —————————————
> 
> 
> The counter and the event shows that:
> - the correct amount of input records has been ingested
> - the expected amount of output records is missing (we have zero records produced which is reflected in both, the counters and the output itself)
> - for Tez the task and the DAG are both handled as successful
> 
> The customers says that the symptoms of the job are always the same:
> - there is that standby warning
> - there is the problem with the shutdown
> - (and the data is missing)
> 
> To me its still unclear if there are strange things going on with Tez or with our application logic.
> The same job is successful when executed with pure MapReduce.
> Is the standby thing something to worry about ?
> 
> best
> Johannes
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: Strange behaviour of session DAG

Posted by Jeff Zhang <zj...@gmail.com>.
The standby warning is because it try to connect to the standby node. (HA
should be enabled I think).  Maybe this cause the output missing, could you
attach the whole app log ?

On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:

> Hey guys,
>
> have a Tez job on a customer deployment which has very strange symptoms.
> Maybe you have some pointers for me what to look into!
>
> App master log:
> —————————————
>
> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2015-03-03T18:01:25Z ]
> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application
> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap
> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker
> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4
> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl
> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler
> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10
> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService
> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658
> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317,
>
>
> … user specific code
>
> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1
>
> —————————————
>
> DAG-log:
> --------------------------
>
> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
>
> … user specific code
>
> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN:
> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>
> --------------------------
>
>
> Latest line of the (single) task:
> --------------------------
>
> ...
>
> 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput }
> 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input
> 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output
> 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output
> 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
> 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
> 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output
> 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output
>
>
> … user specific code
>
> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root
> logger level for hadoop task to WARN:
>
> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
>
> —————————————
>
> The client log contains a shutdown exception:
> —————————————
>
> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>
>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy
> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
>
> 	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
> 	at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
> 	at org.apache.tez.client.TezClient.stop(TezClient.java:476)
> 	at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
> 	at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
> 	at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
> 	at datameer.com.google.common.io.Closer.close(Closer.java:214)
> 	at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
> 	at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
> 	at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
> 	at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
> 	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
> 	at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852
>
>
> —————————————
>
>
> The counter and the event shows that:
> - the correct amount of input records has been ingested
> - the expected amount of output records is missing (we have zero records
> produced which is reflected in both, the counters and the output itself)
> - for Tez the task and the DAG are both handled as successful
>
> The customers says that the symptoms of the job are always the same:
> - there is that standby warning
> - there is the problem with the shutdown
> - (and the data is missing)
>
> To me its still unclear if there are strange things going on with Tez or
> with our application logic.
> The same job is successful when executed with pure MapReduce.
> Is the standby thing something to worry about ?
>
> best
> Johannes
>



-- 
Best Regards

Jeff Zhang

Re: Strange behaviour of session DAG

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Operation category READ is not supported in state standby

Quick q - is the system running NameNode-HA?

Something like QJM or WanDISCO?

This looks like the FileOutputCommitter is running when the NameNode does
a "switch places!".

Cheers,
Gopal