You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@ambari.apache.org by Aravindan Vijayan <av...@hortonworks.com> on 2017/07/28 04:50:30 UTC

Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/
-----------------------------------------------------------

Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.


Bugs: AMBARI-21593
    https://issues.apache.org/jira/browse/AMBARI-21593


Repository: ambari


Description
-------

PROBLEM
When 2 metric collectors are started up simultaneously, both of them fail to start.

BUG
There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.

FIX
Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.


Diffs
-----

  ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 


Diff: https://reviews.apache.org/r/61203/diff/1/


Testing
-------

Manually tested.


Thanks,

Aravindan Vijayan

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Sumit Mohanty <sm...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181701
-----------------------------------------------------------


Ship it!




Ship It!

- Sumit Mohanty


On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 28, 2017, 4:50 a.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/1/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Sid Wagle <sw...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181846
-----------------------------------------------------------


Ship it!




Ship It!

- Sid Wagle


On July 31, 2017, 4:25 p.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 31, 2017, 4:25 p.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/2/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Aravindan Vijayan <av...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/
-----------------------------------------------------------

(Updated July 31, 2017, 4:25 p.m.)

Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.

Bugs: AMBARI-21593
https://issues.apache.org/jira/browse/AMBARI-21593

Repository: ambari

Description
-------

PROBLEM
When 2 metric collectors are started up simultaneously, both of them fail to start.

BUG
There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.

FIX
Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.

Diffs (updated)
-----

ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304

Diff: https://reviews.apache.org/r/61203/diff/2/

Changes: https://reviews.apache.org/r/61203/diff/1-2/

Testing
-------

Manually tested.

Thanks,

Aravindan Vijayan

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Sid Wagle <sw...@hortonworks.com>.


> On July 28, 2017, 5:17 a.m., Sid Wagle wrote:
> > ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java
> > Lines 137 (patched)
> > <https://reviews.apache.org/r/61203/diff/1/?file=1785078#file1785078line141>
> >
> >     So what happens if one collector only created partial structure? This situation would require restart to get otu of.
> 
> Aravindan Vijayan wrote:
>     Yes. Do you suggest we go ahead and create the znode again when a collector has failed after sleeping and trying for 5 times? The reason being either the other collector did not create it fully or there is no other collector and we have a partial znode created somehow.

Yea that makes it full proof.


- Sid


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181647
-----------------------------------------------------------


On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 28, 2017, 4:50 a.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/1/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Aravindan Vijayan <av...@hortonworks.com>.


> On July 28, 2017, 5:17 a.m., Sid Wagle wrote:
> > ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java
> > Lines 137 (patched)
> > <https://reviews.apache.org/r/61203/diff/1/?file=1785078#file1785078line141>
> >
> >     So what happens if one collector only created partial structure? This situation would require restart to get otu of.

Yes. Do you suggest we go ahead and create the znode again when a collector has failed after sleeping and trying for 5 times? The reason being either the other collector did not create it fully or there is no other collector and we have a partial znode created somehow.


- Aravindan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181647
-----------------------------------------------------------


On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 28, 2017, 4:50 a.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/1/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Sid Wagle <sw...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181647
-----------------------------------------------------------




ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java
Lines 137 (patched)
<https://reviews.apache.org/r/61203/#comment257276>

    So what happens if one collector only created partial structure? This situation would require restart to get otu of.


- Sid Wagle


On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 28, 2017, 4:50 a.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/1/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Aravindan Vijayan <av...@hortonworks.com>.


> On July 28, 2017, 5:02 a.m., Sumit Mohanty wrote:
> > ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java
> > Line 126 (original), 127 (patched)
> > <https://reviews.apache.org/r/61203/diff/1/?file=1785078#file1785078line127>
> >
> >     What was the reason, both instances failed?

It is because of the code which deleted and recreated the znode whenever a sub path is not found (i.e ZkNoNodeException is thrown from ZkHelixAdmin)

Let's say collectors A & B start this at the same time. 

A : Check parent /ambari-metrics-cluster. Not found. Create parent /ambari-metrics-cluster
B : Check parent /ambari-metrics-cluster. Found. So return true.
B : Try to check child C2. Not yet created by A. ZkNoNodeException thrown.
B : Catch exception. Delete the entire znode.
A : Try to create a child node. Someone deleted the top level znode itself. ZkNoNodeException thrown.
A : Catch exception. Try to Delete the entire znode.
A : Deleted children C1, C3
B : Created /ambari-metrics-cluster and children nodes C1, C2, C3.
A : Deleted child C2. 
B : Trying to delete root node. Failed since directory not empty --------> FAILED START
B : Finished creating /ambari-metrics-cluster.
A : Access C2. Not found. ------> FAILED START.


- Aravindan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181646
-----------------------------------------------------------


On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 28, 2017, 4:50 a.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/1/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Aravindan Vijayan <av...@hortonworks.com>.


> On July 28, 2017, 5:02 a.m., Sumit Mohanty wrote:
> > ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java
> > Lines 147 (patched)
> > <https://reviews.apache.org/r/61203/diff/1/?file=1785078#file1785078line151>
> >
> >     Should we add a randomness to sleep (say 5 + random value between 0-5) so that both instances do not retry at the same time?

Both the collector instances should not error out and come to this step. Since one of them will be creating the znode undisturbed and the other one will sleep and retry until the first one completes the work.


- Aravindan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181646
-----------------------------------------------------------


On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 28, 2017, 4:50 a.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/1/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Aravindan Vijayan <av...@hortonworks.com>.


> On July 28, 2017, 5:02 a.m., Sumit Mohanty wrote:
> > ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java
> > Line 126 (original), 127 (patched)
> > <https://reviews.apache.org/r/61203/diff/1/?file=1785078#file1785078line127>
> >
> >     What was the reason, both instances failed?
> 
> Aravindan Vijayan wrote:
>     It is because of the code which deleted and recreated the znode whenever a sub path is not found (i.e ZkNoNodeException is thrown from ZkHelixAdmin)
>     
>     Let's say collectors A & B start this at the same time. 
>     
>     A : Check parent /ambari-metrics-cluster. Not found. Create parent /ambari-metrics-cluster
>     B : Check parent /ambari-metrics-cluster. Found. So return true.
>     B : Try to check child C2. Not yet created by A. ZkNoNodeException thrown.
>     B : Catch exception. Delete the entire znode.
>     A : Try to create a child node. Someone deleted the top level znode itself. ZkNoNodeException thrown.
>     A : Catch exception. Try to Delete the entire znode.
>     A : Deleted children C1, C3
>     B : Created /ambari-metrics-cluster and children nodes C1, C2, C3.
>     A : Deleted child C2. 
>     B : Trying to delete root node. Failed since directory not empty --------> FAILED START
>     B : Finished creating /ambari-metrics-cluster.
>     A : Access C2. Not found. ------> FAILED START.

Small mistake. Correct flow.


A : Check parent /ambari-metrics-cluster. Not found. Create parent /ambari-metrics-cluster
B : Check parent /ambari-metrics-cluster. Found. So return true.
B : Try to check child C2. Not yet created by A. ZkNoNodeException thrown.
B : Catch exception. Delete the entire znode.
A : Try to create a child node. Someone deleted the top level znode itself. ZkNoNodeException thrown.
A : Catch exception. Try to Delete the entire znode.
A : Deleted children C1, C3
B : Created /ambari-metrics-cluster and children nodes C1, C2, C3.
A : Deleted child C2. 
A : Trying to delete root node. Failed since directory not empty (C1 and C3 are there) --------> FAILED START
B : Finished creating /ambari-metrics-cluster.
B : Access C2. Not found. ------> FAILED START.


- Aravindan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181646
-----------------------------------------------------------


On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 28, 2017, 4:50 a.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/1/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>

Re: Review Request 61203: AMBARI-21593 : AMS stopped after RU [AMS distributed mode with 2 collectors]

Posted by Sumit Mohanty <sm...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61203/#review181646
-----------------------------------------------------------




ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java
Line 126 (original), 127 (patched)
<https://reviews.apache.org/r/61203/#comment257274>

    What was the reason, both instances failed?



ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java
Lines 147 (patched)
<https://reviews.apache.org/r/61203/#comment257275>

    Should we add a randomness to sleep (say 5 + random value between 0-5) so that both instances do not retry at the same time?


- Sumit Mohanty


On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61203/
> -----------------------------------------------------------
> 
> (Updated July 28, 2017, 4:50 a.m.)
> 
> 
> Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-21593
>     https://issues.apache.org/jira/browse/AMBARI-21593
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM
> When 2 metric collectors are started up simultaneously, both of them fail to start.
> 
> BUG
> There exists a race condition in the Metric Collector HA controller initialization which was introduced through AMBARI-20179Link. When a helix controller instance finds that the /ambari-metrics-collector znode exists but a child node does not exists, it deletes the entire znode and recreates. If another controller instance also initializes simultaneously, a race condition can occur wherein each instance will end up cancelling the effort of the other.
> 
> FIX
> Do not delete and recreate the znode. Wait and retry for a few seconds to check if /ambari-metrics-collector was fully initailized.
> 
> 
> Diffs
> -----
> 
>   ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java 53e6304 
> 
> 
> Diff: https://reviews.apache.org/r/61203/diff/1/
> 
> 
> Testing
> -------
> 
> Manually tested.
> 
> 
> Thanks,
> 
> Aravindan Vijayan
> 
>