You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Yi Pan (Data Infrastructure) (JIRA)" <ji...@apache.org> on 2015/07/31 20:33:07 UTC

[jira] [Commented] (SAMZA-750) Run YARN RM Recovery test to uncover any potential issues with SamzaAppMaster

    [ https://issues.apache.org/jira/browse/SAMZA-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649619#comment-14649619 ] 

Yi Pan (Data Infrastructure) commented on SAMZA-750:
----------------------------------------------------

Just to move some of the discussion here from SAMZA-563:

Samza built against YARN 2.4 works with YARN HA. However, when Recovery of the ResourceManager is enabled, the container ids assigned to jobs get an extra character in them, and the Samza ApplicationManager is unable to parse them, and crashes on startup (as it expects that part of the container id to be an integer). Googling around quickly uncovered that this was a versionitis issue people see often with old MapReduce jobs. The solution is to rebuild the MapReduce job against YARN 2.6. I suspect the same solution will work for Samza.
Reply
  
Yi Pan (Data Infrastructure) added a comment - 2 days ago
Richard Lee, thank for the details. We will try it w/ YARN 2.6 first and open a separate JIRA if confirmed this is a problem w/ the current Samza app master.
Reply
    
Richard Lee added a comment - 2 days ago - edited
You need to enable RM restart phase 2 to see the problem w/ Samza. In particular, the addition of the 'epoch' information seems to be what screws up the AM.
See https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html
ContainerId string format is changed if RM restarts with work-preserving recovery enabled. It used to be such format: Container_{clusterTimestamp}_{appId}_{attemptId}_{containerId}, e.g. Container_1410901177871_0001_01_000005.

It is now changed to: Container_e{epoch}_{clusterTimestamp}_{appId}_{attemptId}_{containerId}, e.g. Container_e17_1410901177871_0001_01_000005.


> Run YARN RM Recovery test to uncover any potential issues with SamzaAppMaster
> -----------------------------------------------------------------------------
>
>                 Key: SAMZA-750
>                 URL: https://issues.apache.org/jira/browse/SAMZA-750
>             Project: Samza
>          Issue Type: Test
>          Components: yarn
>    Affects Versions: 0.10.0
>            Reporter: Yi Pan (Data Infrastructure)
>
> Currently, there is no tests toward YARN RM Recovery support in Samza.
> As pointed out by [~llamahunter], there is likely some issue in containerID versioning in SamzaAppMaster to handle RM Recovery case. There might be more issues.
> This JIRA is to track the effort to uncover the issues related to YARN RM Recovery feature. The outcome expected is:
> 1) Some test suite that runs Samza jobs in YARN with RM Recovery
> 2) A list of issues discovered. Each issue may result in a separate JIRA to be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)