You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by Sebastian Toader <st...@hortonworks.com> on 2016/02/24 17:39:25 UTC

Review Request 43948: RM fails to start: IOException: /ats/active does not exist

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/
-----------------------------------------------------------

Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.


Bugs: AMBARI-15158
    https://issues.apache.org/jira/browse/AMBARI-15158


Repository: ambari


Description
-------

If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.

In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.


Diffs
-----

  ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
  ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 

Diff: https://reviews.apache.org/r/43948/diff/


Testing
-------

Manual testing:
1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.

Python tests results:
----------------------------------------------------------------------
Total run:902
Total errors:0
Total failures:0
OK


Thanks,

Sebastian Toader


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Andrew Onischuk <ao...@hortonworks.com>.

> On Feb. 25, 2016, 1:36 p.m., Andrew Onischuk wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, line 257
> > <https://reviews.apache.org/r/43948/diff/2/?file=1270961#file1270961line257>
> >
> >     Can we change it to this:
> >     
> >     if HdfsResourceProvider.parse_path(dir_path) in ignored_dfs_dirs:
> >     
> >     
> >     In this case different spellings of the same folder will be matched
> >     
> >     like
> >     
> >     /a/b/c
> >     //a/b/c
> >     hdfs:///a/b/c
> >     hdfs://nn1:8020/a/b/c
> 
> Sebastian Toader wrote:
>     It's already there: see line 252

Yep, my bad.
+1


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120695
-----------------------------------------------------------


On Feb. 25, 2016, 1:14 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 25, 2016, 1:14 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py b73ae56 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Sebastian Toader <st...@hortonworks.com>.

> On Feb. 25, 2016, 2:36 p.m., Andrew Onischuk wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, line 257
> > <https://reviews.apache.org/r/43948/diff/2/?file=1270961#file1270961line257>
> >
> >     Can we change it to this:
> >     
> >     if HdfsResourceProvider.parse_path(dir_path) in ignored_dfs_dirs:
> >     
> >     
> >     In this case different spellings of the same folder will be matched
> >     
> >     like
> >     
> >     /a/b/c
> >     //a/b/c
> >     hdfs:///a/b/c
> >     hdfs://nn1:8020/a/b/c

It's already there: see line 252


- Sebastian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120695
-----------------------------------------------------------


On Feb. 25, 2016, 2:14 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 25, 2016, 2:14 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py b73ae56 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120695
-----------------------------------------------------------




ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py (line 254)
<https://reviews.apache.org/r/43948/#comment182157>

    Can we change it to this:
    
    if HdfsResourceProvider.parse_path(dir_path) in ignored_dfs_dirs:
    
    In this case different spellings of the same folder will be matched
    
    like
    
    /a/b/c
    //a/b/c
    hdfs:///a/b/c
    hdfs://nn1:8020/a/b/c


- Andrew Onischuk


On Feb. 25, 2016, 1:14 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 25, 2016, 1:14 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py b73ae56 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120730
-----------------------------------------------------------


Ship it!




Ship It!

- Andrew Onischuk


On Feb. 25, 2016, 1:14 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 25, 2016, 1:14 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py b73ae56 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Sebastian Toader <st...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/
-----------------------------------------------------------

(Updated Feb. 25, 2016, 2:14 p.m.)


Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.


Changes
-------

1. Skip directories listed in /var/lib/ambari-agent/data/.hdfs_resource_ignore
2. Optimize code so as kinit is invoked lesser times


Bugs: AMBARI-15158
    https://issues.apache.org/jira/browse/AMBARI-15158


Repository: ambari


Description
-------

If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.

In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.


Diffs (updated)
-----

  ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py b73ae56 
  ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
  ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 

Diff: https://reviews.apache.org/r/43948/diff/


Testing
-------

Manual testing:
1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.

Python tests results:
----------------------------------------------------------------------
Total run:902
Total errors:0
Total failures:0
OK


Thanks,

Sebastian Toader


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Alejandro Fernandez <af...@hortonworks.com>.

> On Feb. 24, 2016, 6:56 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, line 123
> > <https://reviews.apache.org/r/43948/diff/1/?file=1267791#file1267791line123>
> >
> >     If this happens during cluster install, why don't we put a dependency in role_command_order.json that RM must start after ATS.
> >     
> >     If ATS is on host1 and RM on host2, and during fresh cluster install we fail to install ATS, then RM will keep waiting.
> 
> Sebastian Toader wrote:
>     role_command_order.json won't work with Blueprints as with Blueprint there is no clustre wide ordering.
>     
>     RM will keep waiting only until it exhausts the retries (8 * 20 secs)

Can we make Blueprints respect role_command_order?
Please include Robert Nettleton in the code review.


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120537
-----------------------------------------------------------


On Feb. 24, 2016, 4:39 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2016, 4:39 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Alejandro Fernandez <af...@hortonworks.com>.

> On Feb. 24, 2016, 6:56 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, line 123
> > <https://reviews.apache.org/r/43948/diff/1/?file=1267791#file1267791line123>
> >
> >     If this happens during cluster install, why don't we put a dependency in role_command_order.json that RM must start after ATS.
> >     
> >     If ATS is on host1 and RM on host2, and during fresh cluster install we fail to install ATS, then RM will keep waiting.
> 
> Sebastian Toader wrote:
>     role_command_order.json won't work with Blueprints as with Blueprint there is no clustre wide ordering.
>     
>     RM will keep waiting only until it exhausts the retries (8 * 20 secs)
> 
> Alejandro Fernandez wrote:
>     Can we make Blueprints respect role_command_order?
>     Please include Robert Nettleton in the code review.
> 
> Andrew Onischuk wrote:
>     Alejandro, we made BP not to respect RCO to speed up deployments for one of the users, which is really critical about the timings. And if we revert that change gonna run into that problem for him again.
> 
> Alejandro Fernandez wrote:
>     I think this is a fix for one item in a larger picture. If BP doesn't respect RC0, then there's bound to be far many more errors like this related to ordering.
>     In which case, we may spend a lot of effort trying to add hacks components to keep retrying certain operations because other components on other hosts are not fully up;
>     think of Hive, Spark, History Server, Tez trying to upload tarballs to HDFS which may not be ready yet.
>     
>     A more flexible way of fixing this is to enable auto-start for this environment. If RM fails because ATS hasn't yet created directories in HDFS, then keep retrying RM. That's a simpler and more general solution.
>     
>     @Sumit Mohanty, what do you think?

I think it's ok for 2.2.2 but we should create a Jira for 2.4 to handle the case of blueprints ignoring RCO more generally. Would you mind adding some python comments so we know why it was added? I don't know if there's a way to make that check only if the cluster was installed via blueprints


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120537
-----------------------------------------------------------


On Feb. 25, 2016, 1:14 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 25, 2016, 1:14 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py b73ae56 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Andrew Onischuk <ao...@hortonworks.com>.

> On Feb. 24, 2016, 6:56 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, line 123
> > <https://reviews.apache.org/r/43948/diff/1/?file=1267791#file1267791line123>
> >
> >     If this happens during cluster install, why don't we put a dependency in role_command_order.json that RM must start after ATS.
> >     
> >     If ATS is on host1 and RM on host2, and during fresh cluster install we fail to install ATS, then RM will keep waiting.
> 
> Sebastian Toader wrote:
>     role_command_order.json won't work with Blueprints as with Blueprint there is no clustre wide ordering.
>     
>     RM will keep waiting only until it exhausts the retries (8 * 20 secs)
> 
> Alejandro Fernandez wrote:
>     Can we make Blueprints respect role_command_order?
>     Please include Robert Nettleton in the code review.

Alejandro, we made BP not to respect RCO to speed up deployments for one of the users, which is really critical about the timings. And if we revert that change gonna run into that problem for him again.


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120537
-----------------------------------------------------------


On Feb. 24, 2016, 4:39 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2016, 4:39 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Alejandro Fernandez <af...@hortonworks.com>.

> On Feb. 24, 2016, 6:56 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, line 123
> > <https://reviews.apache.org/r/43948/diff/1/?file=1267791#file1267791line123>
> >
> >     If this happens during cluster install, why don't we put a dependency in role_command_order.json that RM must start after ATS.
> >     
> >     If ATS is on host1 and RM on host2, and during fresh cluster install we fail to install ATS, then RM will keep waiting.
> 
> Sebastian Toader wrote:
>     role_command_order.json won't work with Blueprints as with Blueprint there is no clustre wide ordering.
>     
>     RM will keep waiting only until it exhausts the retries (8 * 20 secs)
> 
> Alejandro Fernandez wrote:
>     Can we make Blueprints respect role_command_order?
>     Please include Robert Nettleton in the code review.
> 
> Andrew Onischuk wrote:
>     Alejandro, we made BP not to respect RCO to speed up deployments for one of the users, which is really critical about the timings. And if we revert that change gonna run into that problem for him again.

I think this is a fix for one item in a larger picture. If BP doesn't respect RC0, then there's bound to be far many more errors like this related to ordering.
In which case, we may spend a lot of effort trying to add hacks components to keep retrying certain operations because other components on other hosts are not fully up;
think of Hive, Spark, History Server, Tez trying to upload tarballs to HDFS which may not be ready yet.

A more flexible way of fixing this is to enable auto-start for this environment. If RM fails because ATS hasn't yet created directories in HDFS, then keep retrying RM. That's a simpler and more general solution.

@Sumit Mohanty, what do you think?


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120537
-----------------------------------------------------------


On Feb. 25, 2016, 1:14 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 25, 2016, 1:14 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py b73ae56 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Sebastian Toader <st...@hortonworks.com>.

> On Feb. 24, 2016, 7:56 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, line 123
> > <https://reviews.apache.org/r/43948/diff/1/?file=1267791#file1267791line123>
> >
> >     If this happens during cluster install, why don't we put a dependency in role_command_order.json that RM must start after ATS.
> >     
> >     If ATS is on host1 and RM on host2, and during fresh cluster install we fail to install ATS, then RM will keep waiting.

role_command_order.json won't work with Blueprints as with Blueprint there is no clustre wide ordering.

RM will keep waiting only until it exhausts the retries (8 * 20 secs)


- Sebastian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120537
-----------------------------------------------------------


On Feb. 24, 2016, 5:39 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2016, 5:39 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120537
-----------------------------------------------------------




ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py (line 120)
<https://reviews.apache.org/r/43948/#comment182000>

    If this happens during cluster install, why don't we put a dependency in role_command_order.json that RM must start after ATS.
    
    If ATS is on host1 and RM on host2, and during fresh cluster install we fail to install ATS, then RM will keep waiting.



ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py (line 122)
<https://reviews.apache.org/r/43948/#comment182001>

    Can slightly optimize here by passing a list of dirs so that only have to kinit once and make fewer calls.



ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py (line 253)
<https://reviews.apache.org/r/43948/#comment182002>

    Good practice to declare variables like dir_exists in the same scope, so before the "if" block.


- Alejandro Fernandez


On Feb. 24, 2016, 4:39 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2016, 4:39 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Sebastian Toader <st...@hortonworks.com>.

> On Feb. 24, 2016, 6 p.m., Andrew Onischuk wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, line 234
> > <https://reviews.apache.org/r/43948/diff/1/?file=1267791#file1267791line234>
> >
> >     Would this be better to move this to libraries/functions, in case we will need this in other services?
> >     
> >     Also might be better to name it waitForHdfsDirectoryCreated, rather than checkHdfsDir, so it's easier to understand what function does.

Renamed the function as suggested. I haven't spent time on making function generic enough for libraries/functions as wanted to get this bug fixed and comitted quickly. If this function could be used in other places as well than I think we can raise a separate jira for this make it generic (maybe extend HdfsResource class to provide this functionallity).


- Sebastian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120519
-----------------------------------------------------------


On Feb. 24, 2016, 5:39 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2016, 5:39 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120519
-----------------------------------------------------------




ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py (line 231)
<https://reviews.apache.org/r/43948/#comment181982>

    Would this be better to move this to libraries/functions, in case we will need this in other services?
    
    Also might be better to name it waitForHdfsDirectoryCreated, rather than checkHdfsDir, so it's easier to understand what function does.


- Andrew Onischuk


On Feb. 24, 2016, 4:39 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2016, 4:39 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120520
-----------------------------------------------------------




ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py (line 250)
<https://reviews.apache.org/r/43948/#comment181983>

    This will result in slow down on those wasb clusters, the thing we tried to reduce recently. 
    
    Any opertaions with hadoop binary are really slow there
    
    Can we check if directory is in /var/lib/ambari-agent/data/.hdfs_resource_ignore before doing that. (basically if it's precreated - that's what we do there)
    
    cc @Sumit Mohanty


- Andrew Onischuk


On Feb. 24, 2016, 4:39 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2016, 4:39 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the directories where ATS will store time line data for active and completed applications exists in DFS. There migh tbe cases when RM comes up much earlier than ATS creating these directories. In these situations RM will stop with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting RM component has been modified to check the existence of these directories upfront before the RM process is started. This check is performed only if ATS is installed and have either yarn.timeline-service.entity-group-fs-store.active-dir or yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py 2ef404d 
>   ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS were deployed to different nodes. This was tested with both cases when HDFS has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to different nodes. After the cluster was kerberized and was tested with both cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>