You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Chris Riccomini (JIRA)" <ji...@apache.org> on 2014/08/14 22:06:18 UTC

[jira] [Commented] (SAMZA-376) ApplicationMaster Timeout after LeaderNotAvailableException

    [ https://issues.apache.org/jira/browse/SAMZA-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097543#comment-14097543 ] 

Chris Riccomini commented on SAMZA-376:
---------------------------------------

[~zjshen], does this sound accurate to you? It seems possible to me. We make a blocking call to Kafka in SamzaAppMaster before we call amClient.start. If the Kafka calls take a long time (say, several minutes), would it lead to this behavior?

> ApplicationMaster Timeout after LeaderNotAvailableException
> -----------------------------------------------------------
>
>                 Key: SAMZA-376
>                 URL: https://issues.apache.org/jira/browse/SAMZA-376
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Nicolas Bär
>            Priority: Minor
>
> The application master does not send a heartbeat to the resource manager if the leader of the topic is not available. It will retry until the leader is available and then send the heartbeat. If the Kafka cluster is busy during this time, the leader election might take a moment and the timeout is reached resulting in a shutdown of the application master.
> I hit this issue on our testbed and received a few follow-up error messages after the application master was restarted: 
> {quote}
> ERROR security.UserGroupInformation: PriviledgedActionException as:baer (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Password not found for ApplicationAttempt appattempt_1407522131931_0001_000001
> {quote}
> I will investigate in this further, but assume it is better placed at the YARN mailing list.
> Here is the relevant part from our discussion on IRC (criccomini):
> {quote}
> SamzaAppMaster
> you'll see:       amClient.start
> and later,       amClient.stop
> the start is starting the YARN AMClient's heartbeat
> now
> SamzaAppMasterTaskManager
> calls assignContainerToSSPTaskNames
> in Util
> which calls Util.getInputStreamPartitions(config)
> and THAT is where Kafka is called
> so basically
> before amClient.start is called
> that getInputStreamPartitiosn method is invoked
> which will block on metadata timeouts
> until it can get the data it needs
> so SamzaAppMaster is constructing SamzaAppMasterTaskManager before it calls amClient.start
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)