You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@helix.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/03/26 19:17:00 UTC

[jira] [Commented] (HELIX-683) Clean monitoring cache upon helix controller enable monitoring

    [ https://issues.apache.org/jira/browse/HELIX-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414392#comment-16414392 ] 

ASF GitHub Bot commented on HELIX-683:
--------------------------------------

GitHub user zhan849 opened a pull request:

    https://github.com/apache/helix/pull/162

    [HELIX-683] clean monitoring cache upon helix controller enable monitoring

    In this PR I added methods to clear monitoring records in cache when we enable cluster status monitoring. I also added tests to reproduce situation that a resource missed top state, controller lost leadership, resource regain top state, controller regain leadership, which will cause a metrics reporting problem

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhan849/helix harry/controller-monitor-cache-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/helix/pull/162.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #162
    
----
commit 373da77547fa1ea4a39c760e80da75e9d453d4f5
Author: Harry Zhang <zh...@...>
Date:   2018-03-26T19:14:07Z

    [HELIX-683] clean monitoring cache upon helix controller enable monitoring

----


> Clean monitoring cache upon helix controller enable monitoring
> --------------------------------------------------------------
>
>                 Key: HELIX-683
>                 URL: https://issues.apache.org/jira/browse/HELIX-683
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Hao Zhang
>            Priority: Major
>
> We found a bug in reporting cluster status, partition masterless duration.
> The root cause is that the duration is calculated based on controller cache. And currently, this cache is not cleaned when leadership is changed. As a result, if controller A start a mastership handoff but was interrupted once, the start time will be kept in cache until next mastership handoff on the same partition happens. Then the later handoff duration will be calculated based on the stale start time. This could be super large.
> To fix it, we might consider clean cache when leadership changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)