You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by Alejandro Fernandez <af...@hortonworks.com> on 2015/09/23 00:17:50 UTC

Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/
-----------------------------------------------------------

Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.


Bugs: AMBARI-13194
    https://issues.apache.org/jira/browse/AMBARI-13194


Repository: ambari


Description
-------

Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
to track the mount points for each of the data dirs.

E.g.,
{code}
/hadoop01/data,/device1
/hadoop02/data,/device2
/hadoop03/data,/     # this one is on root, the others are all on mount points.
{code}

Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).

To improve tracking, create an alert definition that checks the following
* warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
* critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount


Diffs
-----

  ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
  ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
  ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 

Diff: https://reviews.apache.org/r/38651/diff/


Testing
-------

* Python unit tests passed
* Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
* Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host


Thanks,

Alejandro Fernandez


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100080
-----------------------------------------------------------



ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json (line 654)
<https://reviews.apache.org/r/38651/#comment157182>

    New alert definition.



ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py (line 44)
<https://reviews.apache.org/r/38651/#comment157185>

    Unit tests


- Alejandro Fernandez


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs become unmounted

Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/
-----------------------------------------------------------

(Updated Sept. 23, 2015, 9:34 p.m.)


Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.


Changes
-------

Added more logic to compare the actual mount points against the expected in the history file.


Summary (updated)
-----------------

AMBARI-13194. Alert definition when DataNode data dirs become unmounted


Bugs: AMBARI-13194
    https://issues.apache.org/jira/browse/AMBARI-13194


Repository: ambari


Description
-------

Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
to track the mount points for each of the data dirs.

E.g.,
{code}
/hadoop01/data,/device1
/hadoop02/data,/device2
/hadoop03/data,/     # this one is on root, the others are all on mount points.
{code}

Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).

To improve tracking, create an alert definition that checks the following
* warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
* critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount


Diffs (updated)
-----

  ambari-agent/src/test/python/resource_management/TestDatanodeHelper.py e348cc4 
  ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
  ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 2fcacc8 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
  ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 

Diff: https://reviews.apache.org/r/38651/diff/


Testing
-------

* Python unit tests passed
* Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
* Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host


Thanks,

Alejandro Fernandez


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Alejandro Fernandez <af...@hortonworks.com>.

> On Sept. 23, 2015, 5:17 p.m., Nate Cole wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py, lines 76-77
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081590#file1081590line76>
> >
> >     What is the purpose of the data_dir_mount_file here.  I don't see you referencing it other than to see if it exists.  Do you have to cross reference the contents at all? (Likely I don't know the purpose of said file)

The file is a poor-man's DB of the last known mapping of data dir to the mount device it was on.
E.g.,
/grid/0/data,/device1
/grid/1/data,/device2

If the file is not present, it is merely a WARNING instead of a CRITICAL. For users upgrading from Ambari 1.7.0 to 2.1.2, they need to restart DataNode to generate that file.
If the file is deleted inadvertently, then Ambari won't know if a datadir was previously on a mount and is now unmounted.


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100224
-----------------------------------------------------------


On Sept. 23, 2015, 4:46 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 23, 2015, 4:46 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 2fcacc8 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Nate Cole <nc...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100224
-----------------------------------------------------------

Ship it!



ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py (lines 76 - 77)
<https://reviews.apache.org/r/38651/#comment157383>

    What is the purpose of the data_dir_mount_file here.  I don't see you referencing it other than to see if it exists.  Do you have to cross reference the contents at all? (Likely I don't know the purpose of said file)


- Nate Cole


On Sept. 23, 2015, 12:46 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 23, 2015, 12:46 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 2fcacc8 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Nate Cole <nc...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100269
-----------------------------------------------------------

Ship it!


Ship It!

- Nate Cole


On Sept. 23, 2015, 12:46 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 23, 2015, 12:46 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 2fcacc8 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/
-----------------------------------------------------------

(Updated Sept. 23, 2015, 4:46 p.m.)


Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.


Changes
-------

Addressed comments


Bugs: AMBARI-13194
    https://issues.apache.org/jira/browse/AMBARI-13194


Repository: ambari


Description
-------

Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
to track the mount points for each of the data dirs.

E.g.,
{code}
/hadoop01/data,/device1
/hadoop02/data,/device2
/hadoop03/data,/     # this one is on root, the others are all on mount points.
{code}

Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).

To improve tracking, create an alert definition that checks the following
* warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
* critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount


Diffs (updated)
-----

  ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
  ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 2fcacc8 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
  ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 

Diff: https://reviews.apache.org/r/38651/diff/


Testing
-------

* Python unit tests passed
* Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
* Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host


Thanks,

Alejandro Fernandez


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Alejandro Fernandez <af...@hortonworks.com>.

> On Sept. 23, 2015, 4:25 p.m., Andrew Onischuk wrote:
> > ambari-common/src/main/python/resource_management/core/providers/system.py, line 156
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081587#file1081587line156>
> >
> >     Why exactly do we need this messages? Since we already have 
> >     Directory[.., recursive=True] in the logs. I guess  printing when directory doesn't exist one time is enough
> 
> Andrew Onischuk wrote:
>     meaning "Creating directory %s"

I can remove it if you want. In general, I think our logs should be more human-readable and make it easier to troubleshoot since too often we actually have to repro because the logs don't provide sufficient information to determine the code path taken.


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100207
-----------------------------------------------------------


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Andrew Onischuk <ao...@hortonworks.com>.

> On Sept. 23, 2015, 4:25 p.m., Andrew Onischuk wrote:
> > ambari-common/src/main/python/resource_management/core/providers/system.py, line 156
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081587#file1081587line156>
> >
> >     Why exactly do we need this messages? Since we already have 
> >     Directory[.., recursive=True] in the logs. I guess  printing when directory doesn't exist one time is enough
> 
> Andrew Onischuk wrote:
>     meaning "Creating directory %s"
> 
> Alejandro Fernandez wrote:
>     I can remove it if you want. In general, I think our logs should be more human-readable and make it easier to troubleshoot since too often we actually have to repro because the logs don't provide sufficient information to determine the code path taken.

Alejandro, can a look at this:

Directory["/tmp", recursive=True]
Creating directory /tmp because it doesn't exist
Creating recursive directory /tmp

I think at least one of the last 2 messages is redundant, and doesn't give new info. Let's delete it not to flood the logs


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100207
-----------------------------------------------------------


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Andrew Onischuk <ao...@hortonworks.com>.

> On Sept. 23, 2015, 4:25 p.m., Andrew Onischuk wrote:
> > ambari-common/src/main/python/resource_management/core/providers/system.py, line 156
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081587#file1081587line156>
> >
> >     Why exactly do we need this messages? Since we already have 
> >     Directory[.., recursive=True] in the logs. I guess  printing when directory doesn't exist one time is enough

meaning "Creating directory %s"


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100207
-----------------------------------------------------------


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100207
-----------------------------------------------------------



ambari-common/src/main/python/resource_management/core/providers/system.py (line 156)
<https://reviews.apache.org/r/38651/#comment157364>

    Why exactly do we need this messages? Since we already have 
    Directory[.., recursive=True] in the logs. I guess  printing when directory doesn't exist one time is enough


- Andrew Onischuk


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Sumit Mohanty <sm...@hortonworks.com>.

> On Sept. 22, 2015, 10:31 p.m., Sumit Mohanty wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py, line 77
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081590#file1081590line77>
> >
> >     Will it result in an alert after Ambari upgrade? Not sure if requiring DN restart to get rid of an alert is a good idea?
> 
> Alejandro Fernandez wrote:
>     No upgrade is needed to pickup added alert definitions in Ambari 2.1; ambari-server actually loads them from the json file on start.
>     It checks if the history file exits, if the data dirs exist, and if it's possible for the data dirs to have become unmounted.
>     One way to fix the missing history file or missing data dir is to restart DN, but that's not necessarily required.

What I meant is when I upgrade from Ambari-2.1.0 to 2.1.2 then the history file will not exist. Will we see an WARN alert?


- Sumit


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100084
-----------------------------------------------------------


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Alejandro Fernandez <af...@hortonworks.com>.

> On Sept. 22, 2015, 10:31 p.m., Sumit Mohanty wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py, line 77
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081590#file1081590line77>
> >
> >     Will it result in an alert after Ambari upgrade? Not sure if requiring DN restart to get rid of an alert is a good idea?

No upgrade is needed to pickup added alert definitions in Ambari 2.1; ambari-server actually loads them from the json file on start.
It checks if the history file exits, if the data dirs exist, and if it's possible for the data dirs to have become unmounted.
One way to fix the missing history file or missing data dir is to restart DN, but that's not necessarily required.


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100084
-----------------------------------------------------------


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Alejandro Fernandez <af...@hortonworks.com>.

> On Sept. 22, 2015, 10:31 p.m., Sumit Mohanty wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py, line 77
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081590#file1081590line77>
> >
> >     Will it result in an alert after Ambari upgrade? Not sure if requiring DN restart to get rid of an alert is a good idea?
> 
> Alejandro Fernandez wrote:
>     No upgrade is needed to pickup added alert definitions in Ambari 2.1; ambari-server actually loads them from the json file on start.
>     It checks if the history file exits, if the data dirs exist, and if it's possible for the data dirs to have become unmounted.
>     One way to fix the missing history file or missing data dir is to restart DN, but that's not necessarily required.
> 
> Sumit Mohanty wrote:
>     What I meant is when I upgrade from Ambari-2.1.0 to 2.1.2 then the history file will not exist. Will we see an WARN alert?

The history file was added in either Ambari 1.7.0/2.0.0, and it is created the first time that DataNode starts.
This means that existing clusters should not see any warnings; warnings only show up during the installation of a brand new cluster.


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100084
-----------------------------------------------------------


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Sumit Mohanty <sm...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100084
-----------------------------------------------------------



ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py (line 77)
<https://reviews.apache.org/r/38651/#comment157188>

    Will it result in an alert after Ambari upgrade? Not sure if requiring DN restart to get rid of an alert is a good idea?


- Sumit Mohanty


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Alejandro Fernandez <af...@hortonworks.com>.

> On Sept. 23, 2015, 12:10 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py, line 122
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081590#file1081590line122>
> >
> >     Can this message be clearer. You want to let them know this is a problem. Something like "The mounted data directories are writing to the file system root: {0}"

My goal was to keep the messages short because the UI trims them and only shows the full message on hover.
I can make it more descriptive.


> On Sept. 23, 2015, 12:10 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json, line 656
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081589#file1081589line656>
> >
> >     Run-on sentence. Consider revising to:
> >     This host-level alert is triggered if a host has 1 or more mounted data directories as well as 1 or more unmounted data directories. This can indicate that the data directory is writing to the root partition which is undesireable.

Thanks, will fix.


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100099
-----------------------------------------------------------


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted

Posted by Jonathan Hurley <jh...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100099
-----------------------------------------------------------



ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json (line 656)
<https://reviews.apache.org/r/38651/#comment157214>

    Run-on sentence. Consider revising to:
    This host-level alert is triggered if a host has 1 or more mounted data directories as well as 1 or more unmounted data directories. This can indicate that the data directory is writing to the root partition which is undesireable.



ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py (line 109)
<https://reviews.apache.org/r/38651/#comment157219>

    Nice.



ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py (line 122)
<https://reviews.apache.org/r/38651/#comment157220>

    Can this message be clearer. You want to let them know this is a problem. Something like "The mounted data directories are writing to the file system root: {0}"


- Jonathan Hurley


On Sept. 22, 2015, 6:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 6:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically /etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition, and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5 
>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING, CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>