You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Jake Maes <ja...@gmail.com> on 2016/02/09 02:10:31 UTC

Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).


Repository: samza


Description
-------

* Throw a SamzaContainerStartException whenever the container start request to the NM fails.
* Catch the exception in the Allocator and re-request for a container on ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.


Diffs
-----

  samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerStartException.java PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.


Thanks,

Jake Maes


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.

> On Feb. 9, 2016, 2:18 a.m., Jagadish Venkatraman wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java, line 261
> > <https://reviews.apache.org/r/43350/diff/1/?file=1238001#file1238001line261>
> >
> >     Great work on separating out various methods :)
> >     
> >     Isn't it cleaner for the API to look like:
> >     getEscapedEnvVariables(CmdBuilder builder);
> >     
> >     I'm not sure about containerID being in the API. If it is merely for logging purpose, can the caller print the containerID?
> >     
> >     Ideally, I imagine this method to be a part of the builder sub-class or some Util class unrelated to Yarn.

I debated about this. You're right, it would be nicer to have the commandBuilder expose this method, but there are a number of reasons this is more of a headache than it's worth:
1. If we put this logic in ShellCommandBuilder, it will need to be replicated for any user-defined CommandBuilders too.
2. If we put it in the parent class (which is in the samza-api module) then we need access to Util.envVarEscape() in samza-api and I don't think that's the right place for it
3. I could create an AbstractCommandBuilder that is in samza-core and implements this method, but there's no guarantee the users will extend it, and ContainerUtil depends on it for correctness. 

So, unless there are any objections, I'm keeping the code in ContainerUtil, but refactoring it so the method signatures make more sense.


> On Feb. 9, 2016, 2:18 a.m., Jagadish Venkatraman wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java, line 194
> > <https://reviews.apache.org/r/43350/diff/1/?file=1238003#file1238003line194>
> >
> >     It's debatable, maybe getRunningSamzaContainerID () is a better name for this public API?

Haha, I went back and fourth on this too. Is it primarily a running-container or a samza-container? Since you've voiced an opinion, I'll sway your way. :-)


> On Feb. 9, 2016, 2:18 a.m., Jagadish Venkatraman wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerStartException.java, line 20
> > <https://reviews.apache.org/r/43350/diff/1/?file=1238004#file1238004line20>
> >
> >     nit: prefer to add some java doc comments here. 
> >     "A wrapper class for all exceptions thrown during a SamzaContainer launch."

You caught me. :-)
Added JavaDoc.


> On Feb. 9, 2016, 2:18 a.m., Jagadish Venkatraman wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerStartException.java, line 21
> > <https://reviews.apache.org/r/43350/diff/1/?file=1238004#file1238004line21>
> >
> >     nit: Can we rename the class to be SamzaContainerLaunchException. I think it may be more accurate? thoughts?

No preference here. I'll rename this and the method in ContainerUtil for consistency.


> On Feb. 9, 2016, 2:18 a.m., Jagadish Venkatraman wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java, line 103
> > <https://reviews.apache.org/r/43350/diff/1/?file=1237998#file1237998line103>
> >
> >     Add a line in the API - saying,the preferred Host to run the container or ANY_HOST if there are no host preferences.

Good catch. Done.


> On Feb. 9, 2016, 2:18 a.m., Jagadish Venkatraman wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java, line 121
> > <https://reviews.apache.org/r/43350/diff/1/?file=1237998#file1237998line121>
> >
> >     The behavior is to keep retrying indefinitely here, until allocation succeeds on ANY_HOST, right?

Yes. The idea is that eventually, we should find a host on which we can launch the container (note, for the purposes of this code, we don't know or care what happens immediately after launch, just that we were able to instruct the NM to launch it). 

If we can't reach any node on the cluster, there's something horribly wrong and our (AM's) container is likely to go down next. 

If the necessary resources aren't available, the RM won't allocate us another container.


- Jake


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118346
-----------------------------------------------------------


On Feb. 9, 2016, 1:10 a.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 9, 2016, 1:10 a.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerStartException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jagadish Venkatraman <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118346
-----------------------------------------------------------




samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java (line 103)
<https://reviews.apache.org/r/43350/#comment179572>

    Add a line in the API - saying,the preferred Host to run the container or ANY_HOST if there are no host preferences.



samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java (line 121)
<https://reviews.apache.org/r/43350/#comment179571>

    The behavior is to keep retrying indefinitely here, until allocation succeeds on ANY_HOST, right?



samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java (line 53)
<https://reviews.apache.org/r/43350/#comment179566>

    +1 on moving this to the parent abstract class.



samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java (line 228)
<https://reviews.apache.org/r/43350/#comment179574>

    Great work on separating out various methods :)
    
    Isn't it cleaner for the API to look like:
    getEscapedEnvVariables(CmdBuilder builder);
    
    I'm not sure about containerID being in the API. If it is merely for logging purpose, can the caller print the containerID?
    
    Ideally, I imagine this method to be a part of the builder sub-class or some Util class unrelated to Yarn.



samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java (line 194)
<https://reviews.apache.org/r/43350/#comment179575>

    It's debatable, maybe getRunningSamzaContainerID () is a better name for this public API?



samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerStartException.java (line 20)
<https://reviews.apache.org/r/43350/#comment179567>

    nit: prefer to add some java doc comments here. 
    "A wrapper class for all exceptions thrown during a SamzaContainer launch."



samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerStartException.java (line 21)
<https://reviews.apache.org/r/43350/#comment179568>

    nit: Can we rename the class to be SamzaContainerLaunchException. I think it may be more accurate? thoughts?


- Jagadish Venkatraman


On Feb. 9, 2016, 1:10 a.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 9, 2016, 1:10 a.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerStartException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jagadish Venkatraman <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118554
-----------------------------------------------------------


Ship it!




Thanks for addressing the nits :-) Looks Great to me!

- Jagadish Venkatraman


On Feb. 10, 2016, 2:40 a.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 2:40 a.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.

> On Feb. 10, 2016, 7:48 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java, line 166
> > <https://reviews.apache.org/r/43350/diff/2/?file=1239206#file1239206line166>
> >
> >     nit: I would prefer to keep UNUSED_CONTAINER_ID private since they are not directly referred to outside the class, except for this use case. We can have a static method SamzaAppState.isValidContainerId(containerId) here.

Works for me. See the latest diff.


- Jake


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118610
-----------------------------------------------------------


On Feb. 10, 2016, 2:40 a.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 2:40 a.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by "Yi Pan (Data Infrastructure)" <yi...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118610
-----------------------------------------------------------



LGTM. Just one minor comment.


samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java (line 158)
<https://reviews.apache.org/r/43350/#comment179892>

    nit: I would prefer to keep UNUSED_CONTAINER_ID private since they are not directly referred to outside the class, except for this use case. We can have a static method SamzaAppState.isValidContainerId(containerId) here.


- Yi Pan (Data Infrastructure)


On Feb. 10, 2016, 2:40 a.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 2:40 a.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.

> On Feb. 17, 2016, 5:34 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java, line 57
> > <https://reviews.apache.org/r/43350/diff/3/?file=1239498#file1239498line57>
> >
> >     This is a little subjective. My opinion on this one is: if we are rolling the containerRequestState usage within the base class (as this runContainer() method is doing), we would better wrap it all. Not being able to see the state read/write pattern in a single file is painful when something went wrong. We can keep the runContainer() method but leave the state update outside.

My feeling is that if we extract the state update from the method, it would defeat the purpose. The goals were to:
1. Avoid the triplicate code
2. Tie the state update and running the container together, you should never do one without the other.


- Jake


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review119488
-----------------------------------------------------------


On Feb. 10, 2016, 3:57 p.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 3:57 p.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by "Yi Pan (Data Infrastructure)" <yi...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review119488
-----------------------------------------------------------




samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java (line 53)
<https://reviews.apache.org/r/43350/#comment180852>

    This is a little subjective. My opinion on this one is: if we are rolling the containerRequestState usage within the base class (as this runContainer() method is doing), we would better wrap it all. Not being able to see the state read/write pattern in a single file is painful when something went wrong. We can keep the runContainer() method but leave the state update outside.


- Yi Pan (Data Infrastructure)


On Feb. 10, 2016, 3:57 p.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 3:57 p.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by "Yi Pan (Data Infrastructure)" <yi...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review122363
-----------------------------------------------------------


Ship it!




Ship It!

- Yi Pan (Data Infrastructure)


On Feb. 24, 2016, 2:05 a.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2016, 2:05 a.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> * Finally, fix a logic bug in the HostAwareContainerAllocator wherein the expiration logic was flipped.
> 
> 
> Diffs
> -----
> 
>   samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> The testing was successful for both scenarios. 
> 1. RM is unaware that some NMs are down when job is starting:
>    a. RM attempts to start AM on hosts, retrying up to the limit (50) in the case of connection errors to the NMs
>    b. AM requests containers (accounting for host affinity) and tries to start the containers
>    c. YarnClient retries connection errors (an excessive number of times) and then throws to the AM
>    d. AM releases the bad container and requests a new one on ANY_HOST
>    e. Container is eventually started on a good host. 
> 2. RM is unaware that some NMs are down when job is already running:
>    a. NM-RM heartbeat interval expires
>    b. RM is aware of the missing NMs and containers
>    c. RM notifies AM of the completed containers
>    d. AM requests new containers which the RM allocates only on good hosts (since it is already aware of the bad hosts)
>    e. Container is started on a good host.
>    
> In both cases, the job stays alive and the full set of containers is eventually running on good hosts. Also, the state counters, like "neededContainers" are correct throughout the test.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

(Updated Feb. 24, 2016, 2:05 a.m.)


Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).


Repository: samza


Description
-------

* Throw a SamzaContainerStartException whenever the container start request to the NM fails.
* Catch the exception in the Allocator and re-request for a container on ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
* Finally, fix a logic bug in the HostAwareContainerAllocator wherein the expiration logic was flipped.


Diffs (updated)
-----

  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.

The testing was successful for both scenarios. 
1. RM is unaware that some NMs are down when job is starting:
   a. RM attempts to start AM on hosts, retrying up to the limit (50) in the case of connection errors to the NMs
   b. AM requests containers (accounting for host affinity) and tries to start the containers
   c. YarnClient retries connection errors (an excessive number of times) and then throws to the AM
   d. AM releases the bad container and requests a new one on ANY_HOST
   e. Container is eventually started on a good host. 
2. RM is unaware that some NMs are down when job is already running:
   a. NM-RM heartbeat interval expires
   b. RM is aware of the missing NMs and containers
   c. RM notifies AM of the completed containers
   d. AM requests new containers which the RM allocates only on good hosts (since it is already aware of the bad hosts)
   e. Container is started on a good host.
   
In both cases, the job stays alive and the full set of containers is eventually running on good hosts. Also, the state counters, like "neededContainers" are correct throughout the test.


Thanks,

Jake Maes


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.

> On Feb. 23, 2016, 10:26 p.m., Xinyu Liu wrote:
> > samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala, line 97
> > <https://reviews.apache.org/r/43350/diff/5/?file=1255125#file1255125line97>
> >
> >     Usually this is done in Scala using option.getOrElse. So instead of adding a new method, we can just do
> >     
> >     config.getCommandClass.getOrElse(defaultValue).

This is a good suggestion, but in this case, the caller is written in Java and there isn't a concise way to pass a function to the getOrElse() method from Java.


- Jake


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review120393
-----------------------------------------------------------


On Feb. 23, 2016, 9:43 p.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 23, 2016, 9:43 p.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> * Finally, fix a logic bug in the HostAwareContainerAllocator wherein the expiration logic was flipped.
> 
> 
> Diffs
> -----
> 
>   samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> The testing was successful for both scenarios. 
> 1. RM is unaware that some NMs are down when job is starting:
>    a. RM attempts to start AM on hosts, retrying up to the limit (50) in the case of connection errors to the NMs
>    b. AM requests containers (accounting for host affinity) and tries to start the containers
>    c. YarnClient retries connection errors (an excessive number of times) and then throws to the AM
>    d. AM releases the bad container and requests a new one on ANY_HOST
>    e. Container is eventually started on a good host. 
> 2. RM is unaware that some NMs are down when job is already running:
>    a. NM-RM heartbeat interval expires
>    b. RM is aware of the missing NMs and containers
>    c. RM notifies AM of the completed containers
>    d. AM requests new containers which the RM allocates only on good hosts (since it is already aware of the bad hosts)
>    e. Container is started on a good host.
>    
> In both cases, the job stays alive and the full set of containers is eventually running on good hosts. Also, the state counters, like "neededContainers" are correct throughout the test.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Xinyu Liu <xi...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review120393
-----------------------------------------------------------


Fix it, then Ship it!




Looks good to me. A few minor suggestions below.


samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala (line 97)
<https://reviews.apache.org/r/43350/#comment181833>

    Usually this is done in Scala using option.getOrElse. So instead of adding a new method, we can just do
    
    config.getCommandClass.getOrElse(defaultValue).



samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java (line 114)
<https://reviews.apache.org/r/43350/#comment181837>

    SamzaException instead of NullPointerException?



samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java (line 27)
<https://reviews.apache.org/r/43350/#comment181838>

    is serialVersionUID needed here? If so please use the autogenerated one.


- Xinyu Liu


On Feb. 23, 2016, 9:43 p.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 23, 2016, 9:43 p.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> * Finally, fix a logic bug in the HostAwareContainerAllocator wherein the expiration logic was flipped.
> 
> 
> Diffs
> -----
> 
>   samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> The testing was successful for both scenarios. 
> 1. RM is unaware that some NMs are down when job is starting:
>    a. RM attempts to start AM on hosts, retrying up to the limit (50) in the case of connection errors to the NMs
>    b. AM requests containers (accounting for host affinity) and tries to start the containers
>    c. YarnClient retries connection errors (an excessive number of times) and then throws to the AM
>    d. AM releases the bad container and requests a new one on ANY_HOST
>    e. Container is eventually started on a good host. 
> 2. RM is unaware that some NMs are down when job is already running:
>    a. NM-RM heartbeat interval expires
>    b. RM is aware of the missing NMs and containers
>    c. RM notifies AM of the completed containers
>    d. AM requests new containers which the RM allocates only on good hosts (since it is already aware of the bad hosts)
>    e. Container is started on a good host.
>    
> In both cases, the job stays alive and the full set of containers is eventually running on good hosts. Also, the state counters, like "neededContainers" are correct throughout the test.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

(Updated Feb. 23, 2016, 9:43 p.m.)


Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).


Changes
-------

Updating with manual test results


Repository: samza


Description
-------

* Throw a SamzaContainerStartException whenever the container start request to the NM fails.
* Catch the exception in the Allocator and re-request for a container on ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
* Finally, fix a logic bug in the HostAwareContainerAllocator wherein the expiration logic was flipped.


Diffs
-----

  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing (updated)
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.

The testing was successful for both scenarios. 
1. RM is unaware that some NMs are down when job is starting:
   a. RM attempts to start AM on hosts, retrying up to the limit (50) in the case of connection errors to the NMs
   b. AM requests containers (accounting for host affinity) and tries to start the containers
   c. YarnClient retries connection errors (an excessive number of times) and then throws to the AM
   d. AM releases the bad container and requests a new one on ANY_HOST
   e. Container is eventually started on a good host. 
2. RM is unaware that some NMs are down when job is already running:
   a. NM-RM heartbeat interval expires
   b. RM is aware of the missing NMs and containers
   c. RM notifies AM of the completed containers
   d. AM requests new containers which the RM allocates only on good hosts (since it is already aware of the bad hosts)
   e. Container is started on a good host.
   
In both cases, the job stays alive and the full set of containers is eventually running on good hosts. Also, the state counters, like "neededContainers" are correct throughout the test.


Thanks,

Jake Maes


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

(Updated Feb. 18, 2016, 10:29 p.m.)


Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).


Repository: samza


Description (updated)
-------

* Throw a SamzaContainerStartException whenever the container start request to the NM fails.
* Catch the exception in the Allocator and re-request for a container on ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
* Finally, fix a logic bug in the HostAwareContainerAllocator wherein the expiration logic was flipped.


Diffs
-----

  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.


Thanks,

Jake Maes


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

(Updated Feb. 18, 2016, 10:28 p.m.)


Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).


Repository: samza


Description
-------

* Throw a SamzaContainerStartException whenever the container start request to the NM fails.
* Catch the exception in the Allocator and re-request for a container on ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.


Diffs (updated)
-----

  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.


Thanks,

Jake Maes


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

(Updated Feb. 18, 2016, 10:20 p.m.)


Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).


Changes
-------

After discussion with Yi, making containerRequestState a private field and pushing all access to AbstractContainerAllocator.

Also, spotted and fixed a logic bug in HostAwareContainerAllocator regarding request expiration.


Repository: samza


Description
-------

* Throw a SamzaContainerStartException whenever the container start request to the NM fails.
* Catch the exception in the Allocator and re-request for a container on ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.


Diffs (updated)
-----

  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.


Thanks,

Jake Maes


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.

> On Feb. 11, 2016, 10:50 p.m., Navina Ramesh wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java, line 57
> > <https://reviews.apache.org/r/43350/diff/3/?file=1239498#file1239498line57>
> >
> >     Is there any particular reason for updating state within the runContainer method? 
> >     Imo, if the method is called "runContainer", it should do things related to running a container. Updating state is very important in this workflow and I want to make sure that it is obvious when you read the code. 
> >     Can you consider renaming the method or moving back the state update statement?

There are a couple subjective things in the balance here:
1. What is the most accurate description of what these 4 lines are accomplishing? To me, it's running the container on the specified host and the state update is just something that should always happen when you run a container. That is to say, the state update is not the primary action here, but a consequence. I'm open to other names if they're accurate. 
2. It could have been named updateStateAndRunContainer, which is more descriptive, but more verbose too. 

I think we agree, though, that this code should not be duplicated in 3 places. That was the goal of the refactor. I'm happy to name it whatever you prefer.


- Jake


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118935
-----------------------------------------------------------


On Feb. 10, 2016, 3:57 p.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 3:57 p.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.

> On Feb. 11, 2016, 10:50 p.m., Navina Ramesh wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java, line 244
> > <https://reviews.apache.org/r/43350/diff/3/?file=1239500#file1239500line244>
> >
> >     You can modify the "TaskConfig.scala" to add an overrid method - getCommandClass(String defaultvalue). Or use the TaskConfigJava.java. We seem to be stuck in a limbo with the scala/java configs

That's a nice improvement. I'll make that change once we have the above issue resolved, thanks!


> On Feb. 11, 2016, 10:50 p.m., Navina Ramesh wrote:
> > samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java, line 22
> > <https://reviews.apache.org/r/43350/diff/3/?file=1239503#file1239503line22>
> >
> >     documentation says: "SamzaContainerStartException" where as the class is "SamzaContainerLaunchException".
> >     
> >     Any reason for not extending "SamzaException" ?

Thanks for catching that. Will fix once the runContainer() method name is decided. 

It doesn't extend SamzaException because SamzaException is a RuntimeException, which means it doesn't have to be declared in function signatures and handled by callers. SamzaContainerLaunchException on the other hand should not be blindly propagated upward (as most RuntimeExceptions are). It should be handled explicitly.


- Jake


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118935
-----------------------------------------------------------


On Feb. 10, 2016, 3:57 p.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 3:57 p.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Navina Ramesh <nr...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118935
-----------------------------------------------------------




samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java (line 53)
<https://reviews.apache.org/r/43350/#comment180241>

    Is there any particular reason for updating state within the runContainer method? 
    Imo, if the method is called "runContainer", it should do things related to running a container. Updating state is very important in this workflow and I want to make sure that it is obvious when you read the code. 
    Can you consider renaming the method or moving back the state update statement?



samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java (line 212)
<https://reviews.apache.org/r/43350/#comment180243>

    You can modify the "TaskConfig.scala" to add an overrid method - getCommandClass(String defaultvalue). Or use the TaskConfigJava.java. We seem to be stuck in a limbo with the scala/java configs



samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java (line 22)
<https://reviews.apache.org/r/43350/#comment180245>

    documentation says: "SamzaContainerStartException" where as the class is "SamzaContainerLaunchException".
    
    Any reason for not extending "SamzaException" ?


- Navina Ramesh


On Feb. 10, 2016, 3:57 p.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 3:57 p.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

(Updated Feb. 10, 2016, 3:57 p.m.)


Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).


Repository: samza


Description
-------

* Throw a SamzaContainerStartException whenever the container start request to the NM fails.
* Catch the exception in the Allocator and re-request for a container on ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.


Diffs (updated)
-----

  samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.


Thanks,

Jake Maes


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.

> On Feb. 10, 2016, 7:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > First question: is this an independent patch or depend on RB 43074?

It depends on 43074, but I see you've committed that one now, so this one should be good to go now.


- Jake


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118604
-----------------------------------------------------------


On Feb. 10, 2016, 2:40 a.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 2:40 a.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by "Yi Pan (Data Infrastructure)" <yi...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/#review118604
-----------------------------------------------------------



First question: is this an independent patch or depend on RB 43074?

- Yi Pan (Data Infrastructure)


On Feb. 10, 2016, 2:40 a.m., Jake Maes wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43350/
> -----------------------------------------------------------
> 
> (Updated Feb. 10, 2016, 2:40 a.m.)
> 
> 
> Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> * Throw a SamzaContainerStartException whenever the container start request to the NM fails.
> * Catch the exception in the Allocator and re-request for a container on ANY-HOST
> * Release the failed container
> * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.
> 
> 
> Diffs
> -----
> 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
>   samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 
> 
> Diff: https://reviews.apache.org/r/43350/diff/
> 
> 
> Testing
> -------
> 
> Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.
> 
> 
> Thanks,
> 
> Jake Maes
> 
>


Re: Review Request 43350: SAMZA-867 Fix job restart/shutdown in the event of a node outage.

Posted by Jake Maes <ja...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

(Updated Feb. 10, 2016, 2:40 a.m.)


Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure).


Repository: samza


Description
-------

* Throw a SamzaContainerStartException whenever the container start request to the NM fails.
* Catch the exception in the Allocator and re-request for a container on ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code.


Diffs (updated)
-----

  samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 9ee2daccc3c44202308637207a084def81b49c09 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 7c57a866114aabf76a36d3b7ca4c5810628e0c77 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java ab3061eae2cfc2da6681ce2034492b165d0d8b96 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java ff22dbfe5537ba2a3c55c4e6063680ff4c9e55f4 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host.


Thanks,

Jake Maes