You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Navina Ramesh (JIRA)" <ji...@apache.org> on 2017/03/31 17:43:41 UTC

[jira] [Commented] (SAMZA-1181) Fix AppMaster hang after submitting jobs to Yarn

    [ https://issues.apache.org/jira/browse/SAMZA-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951373#comment-15951373 ] 

Navina Ramesh commented on SAMZA-1181:
--------------------------------------

[~xinyu] based on this explanation, wouldn't the correct fix be to remove locality manager from JobModel? This issue has already been identified in SAMZA-889. I am linking SAMZA-899 JIRA as that does the fix. This JIRA merely reverts a buggy patch committed before. 

> Fix AppMaster hang after submitting jobs to Yarn
> ------------------------------------------------
>
>                 Key: SAMZA-1181
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1181
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Xinyu Liu
>            Assignee: Shanthoosh Venkataraman
>            Priority: Blocker
>
> Currently when a job is submitted to Yarn, it is going to hang after AppMaster is created. The log shows that it hangs during bootstrapping from Coordinator stream. Further debugging shows that the jobs hang in the second time of bootstrap while reading locality data from LocalityManager. The sequence is the following:
> 1. JobModelManager creates CoordinatorStreamConsumer, and bootstrap it,
> 2. LocalityManager writes locality info into coordinator stream
> 3. JobModelManager closes CoordinatorStreamConsumer (*)
> 4. Later localityManager bootstraps CoordinatorStreamConsumer again
> Step 3 is the problem here. Since CoordinatorStreamConsumer is still held by LocalityManager, it cannot be closed prematurely. Step 3 is introduced in SAMZA-1154, as a refactoring of JobModelManager for task rest end point. To fix this issue, we will revert this change of step 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)