You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@helix.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/09/21 21:36:00 UTC
[jira] [Commented] (HELIX-753) Record top state handoff finished in single cluster data cache refresh

    [ https://issues.apache.org/jira/browse/HELIX-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624211#comment-16624211 ] 

ASF GitHub Bot commented on HELIX-753:
--------------------------------------

GitHub user zhan849 opened a pull request:

    https://github.com/apache/helix/pull/270

    [HELIX-753] Record top state handoff finished in single cluster data cache refresh

    This PR adds top state handoff reporting when a single pipeline refresh catches the entire handoff process, which we missed before. Here is the rough procedure:
    
    
    - retrieve cached last top state instance for a partition
    - retrieve current top state instance for a partition
    - if there is no missing top state record of that partition, and top state instance changed, we record the number
    
    Current top state end time is easy to find from current state in cluster data cache, for handoff start time, if we cannot find it, we use last pipeline run's end time for best guess. Detailed reason is explained in code comment.
    
    
    Added test case to verify such top state handoff, and consolidated common part in TestTopStateHandoffMetrics for avoiding code replication

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhan849/helix harry/topstate

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/helix/pull/270.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #270
    
----
commit d501e8fa30596d9cd98078f0d1ce7c1ecf20c595
Author: Harry Zhang <hr...@...>
Date:   2018-09-21T21:32:15Z

    [HELIX-753] Record top state handoff finished in single cluster data cache refresh

----


> Record top state handoff finished in single cluster data cache refresh
> ----------------------------------------------------------------------
>
>                 Key: HELIX-753
>                 URL: https://issues.apache.org/jira/browse/HELIX-753
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Harry Zhang
>            Assignee: Harry Zhang
>            Priority: Major
>
> Currently we are calculating top state handoff duration by doing the following:
>  - record missing top state when we see a top state missing
>  - record top state come back when we see it come back
>  - report top state handoff duration
> This is perfectly fine for non-P2P state transitions as the entire top state handoff process will always finish for >= 2 pipeline runs. However, for P2P enabled clusters, top state handoff are quick, and if it is quicker than cluster data refresh stage latency, we will lose a lot of short top state handoffs, which make the number miserable on ingraph.
> We need to revise top state handoff metrics implementation so we don't lose data point statistically (i.e. we are losing all short handoffs now).
> AC:
>  - revise impl so we catch those short top state hand-offs
>  - write new tests to catch the fix if needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)