You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexey Scherbakov (Jira)" <ji...@apache.org> on 2020/04/28 15:47:00 UTC
[jira] [Comment Edited] (IGNITE-12935) Disadvantages in log of historical rebalance

    [ https://issues.apache.org/jira/browse/IGNITE-12935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094615#comment-17094615 ] 

Alexey Scherbakov edited comment on IGNITE-12935 at 4/28/20, 3:46 PM:
----------------------------------------------------------------------

[~v.pyatkov] [~irakov]

My thoughts on a patch:

1. The string constants describing reservation stop reason should be replaced with an enum, like:

public enum ReservationResult {
     ...
}

2. Reasons are a subject to change:

NOT_RESERVED_WAL_REASON - not needed, it's just a particular case of reservation error when start pointer is invalid.

WAL_SEG_CORRUPTED_REASON - should be renamed to WAL_RESERVATION_ERROR - describes any unexpected error during processing of previous checkpoint.

NO_MORE_HISTORY_REASON - OK

CHECKPOINT_NOT_APPLICABLE_REASON - OK

FULL_HISTORY_REASON - let's rename it to NO_MORE_DATA_EARLIER_IN_HISTORY.

NO_PARTITIONS_OWNED_REASON - this is actually same as NO_MORE_DATA_EARLIER_IN_HISTORY but no history reserved at all (cp will be empty in log) 

_REASON suffix is not needed.

3. It's currently impossible to understand from a loggging if a specific partition have history for rebalancing, because each partition history is reserved independently and we currently log the "best" partition (having deepest history).
Is it done intentionally ?

4. Avoid using hard-coded strings in tests. This makes changing log format a PITA. All log messages should be stored in a single place.

5. We should log (by "Following caches were not reserved") groups having partitions excluded by reservation by partition size heuristic with corresponding reason. If a group is partially reserved, let's priint a group in both sections with a list of excluded/reserved partitions.

6. Fix typos in a word "supplay".


was (Author: ascherbakov):
[~v.pyatkov] [~irakov]

My thoughts on a patch:

1. The string constants describing reservation stop reason should be replaced with an enum, like:

public enum ReservationResult {
     ...
}

2. Reasons are a subject to change:

NOT_RESERVED_WAL_REASON - not needed, it's just a particular case of reservation error when start pointer is invalid.

WAL_SEG_CORRUPTED_REASON - should be renamed to WAL_RESERVATION_ERROR - describes any unexpected error during processing of previous checkpoint.

NO_MORE_HISTORY_REASON - OK

CHECKPOINT_NOT_APPLICABLE_REASON - OK

FULL_HISTORY_REASON - let's rename it to NO_MORE_DATA_EARLIER_IN_HISTORY.

NO_PARTITIONS_OWNED_REASON - this is actually same as NO_MORE_DATA_EARLIER_IN_HISTORY but no history reserved at all (cp will be empty in log) 

_REASON suffix is not needed.

3. It's currently impossible to understand from a loggging if a specific partition have history for rebalancing, because each partition history is reserved independently and we currently log the "best" partition (having deepest history).
Is it done intentionally ?

4. Avoid using hard-coded strings in tests. This makes changing log format a PITA. All log messages should be stored in a single place.

5. We should log (by "Following caches were not reserved") groups having partitions excluded by reservation by partition size heuristic with corresponding reason.

6. Fix typos in a word "supplay".

> Disadvantages in log of historical rebalance
> --------------------------------------------
>
>                 Key: IGNITE-12935
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12935
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladislav Pyatkov
>            Assignee: Vladislav Pyatkov
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> # Mention in the log only partitions for which there are no nodes that suit as historical supplier
>  For these partitions, print minimal counter (since which we should perform historical rebalancing) with corresponding node and maximum reserved counter (since which cluster can perform historical rebalancing) with corresponding node.
>  This will let us know:
>  ## Whether history was reserved at all
>  ## How much reserved history we lack to perform a historical rebalancing
>  ## I see resulting output like this:
> {noformat}
>  Historical rebalancing wasn't scheduled for some partitions:
>  History wasn't reserved for: [list of partitions and groups]
>  History was reserved, but minimum present counter is less than maximum reserved: [[grp=GRP, part=ID, minCntr=cntr, minNodeId=ID, maxReserved=cntr, maxReservedNodeId=ID], ...]{noformat}
>  ## We can also aggregate previous message by (minNodeId) to easily find the exact node (or nodes) which were the reason of full rebalance.
>  # Log results of {{reserveHistoryForExchange()}}. They can be compactly represented as mappings: {{(grpId -> checkpoint (id, timestamp))}}. For every group, also log message about why the previous checkpoint wasn't successfully reserved.
>  There can be three reasons:
>  ## Previous checkpoint simply isn't present in the history (the oldest is reserved)
>  ## WAL reservation failure (call below returned false)
> {code:java}
> chpEntry = entry(cpTs);
> boolean reserved = cctx.wal().reserve(chpEntry.checkpointMark());// If checkpoint WAL history can't be reserved, stop searching. 
> if (!reserved) 
>   break;{code}
> ## Checkpoint was marked as inapplicable for historical rebalancing
> {code:java}
> for (Integer grpId : new HashSet<>(groupsAndPartitions.keySet()))
>    if (!isCheckpointApplicableForGroup(grpId, chpEntry))
>      groupsAndPartitions.remove(grpId);{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)