You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@helix.apache.org by "Harry Zhang (JIRA)" <ji...@apache.org> on 2018/09/21 21:33:00 UTC

[jira] [Updated] (HELIX-753) Record top state handoff finished in single cluster data cache refresh

     [ https://issues.apache.org/jira/browse/HELIX-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Harry Zhang updated HELIX-753:
------------------------------
    Summary: Record top state handoff finished in single cluster data cache refresh  (was: record top state handoff finished in single cluster data cache refresh)

> Record top state handoff finished in single cluster data cache refresh
> ----------------------------------------------------------------------
>
>                 Key: HELIX-753
>                 URL: https://issues.apache.org/jira/browse/HELIX-753
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Harry Zhang
>            Assignee: Harry Zhang
>            Priority: Major
>
> Currently we are calculating top state handoff duration by doing the following:
>  - record missing top state when we see a top state missing
>  - record top state come back when we see it come back
>  - report top state handoff duration
> This is perfectly fine for non-P2P state transitions as the entire top state handoff process will always finish for >= 2 pipeline runs. However, for P2P enabled clusters, top state handoff are quick, and if it is quicker than cluster data refresh stage latency, we will lose a lot of short top state handoffs, which make the number miserable on ingraph.
> We need to revise top state handoff metrics implementation so we don't lose data point statistically (i.e. we are losing all short handoffs now).
> AC:
>  - revise impl so we catch those short top state hand-offs
>  - write new tests to catch the fix if needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)