You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Ivan Rakov (JIRA)" <ji...@apache.org> on 2018/04/18 17:37:00 UTC

[jira] [Commented] (IGNITE-8241) Docs: Triggering automatic rebalancing if the whole baseline topology is not recovered

    [ https://issues.apache.org/jira/browse/IGNITE-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442919#comment-16442919 ] 

Ivan Rakov commented on IGNITE-8241:
------------------------------------

I propose the following version of BaselineWatcher:
{noformat}
package org.apache.ignite.examples.events;

import java.util.Set;
import java.util.stream.Collectors;
import org.apache.ignite.Ignite;
import org.apache.ignite.cluster.BaselineNode;
import org.apache.ignite.cluster.ClusterNode;
import org.apache.ignite.events.DiscoveryEvent;
import org.apache.ignite.events.EventType;
import org.apache.ignite.internal.IgniteEx;
import org.apache.ignite.internal.processors.timeout.GridTimeoutObjectAdapter;

/**
 * Task that mimics old behavior without baseline topology. Only one task should be started for the whole cluster.
 * In case of server node leave/join, BLT will be automatically reset after {@link #bltChangeDelayMillis} delay.
 */
public class BaselineWatcher {
    /** Ignite. */
    private final IgniteEx ignite;

    /** BLT change delay millis. */
    private final long bltChangeDelayMillis;

    /**
     * @param ignite Ignite.
     */
    public BaselineWatcher(Ignite ignite, long bltChangeDelayMillis) {
        this.ignite = (IgniteEx)ignite;
        this.bltChangeDelayMillis = bltChangeDelayMillis;
    }

    /**
     *
     */
    public void start() {
        ignite.events().localListen(event -> {
            DiscoveryEvent e = (DiscoveryEvent)event;

            Set<Object> aliveSrvNodes = e.topologyNodes().stream()
                .filter(n -> !n.isClient())
                .map(ClusterNode::consistentId)
                .collect(Collectors.toSet());

            Set<Object> baseline = ignite.cluster().currentBaselineTopology().stream()
                .map(BaselineNode::consistentId)
                .collect(Collectors.toSet());

            final long topVer = e.topologyVersion();

            if (!aliveSrvNodes.equals(baseline))
                ignite.context().timeout().addTimeoutObject(new GridTimeoutObjectAdapter(bltChangeDelayMillis) {
                    @Override public void onTimeout() {
                        if (ignite.cluster().topologyVersion() == topVer)
                            ignite.cluster().setBaselineTopology(topVer);
                    }
                });

            return true;
        }, EventType.EVT_NODE_FAILED, EventType.EVT_NODE_LEFT, EventType.EVT_NODE_JOINED);
    }
}
{noformat}

Pros:
1) Baseline will changed only one time in case of several sequential topology changes within a short period
2) Baseline will be changed back in case missing node will be finally returned
Simply put, cluster will behave just like in 2.3.

> Docs: Triggering automatic rebalancing if the whole baseline topology is not recovered
> --------------------------------------------------------------------------------------
>
>                 Key: IGNITE-8241
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8241
>             Project: Ignite
>          Issue Type: Task
>          Components: documentation
>    Affects Versions: 2.4
>            Reporter: Denis Magda
>            Assignee: Denis Magda
>            Priority: Critical
>             Fix For: 2.5
>
>         Attachments: BaselineWatcher.java
>
>
> The ticket is created as a result of the following discussion:
> http://apache-ignite-developers.2346864.n4.nabble.com/Triggering-rebalancing-on-timeout-or-manually-if-the-baseline-topology-is-not-reassembled-td29299.html
> The rebalancing doesn't happen if one of the nodes goes down, 
> thus, shrinking the baseline topology. It complies with our assumption that 
> the node should be recovered soon and there is no need to waste 
> CPU/memory/networking resources of the cluster shifting the data around. 
> However, there are always edge cases. I was reasonably asked how to trigger 
> the rebalancing within the baseline topology manually or on timeout if: 
> * It's not expected that the failed node would be resurrected in the 
>    nearest time and 
> * It's not likely that that node will be replaced by the other one. 
> Until we embedd special facilities in the baseline topology that would consider such situations we can document the following workaround. A user application/tool/script has to subscribe to node_left events and remove the failed node from the baseline topology in some time. Once the node is removed, the baseline topology will be changed, and the rebalancing will be kicked off.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)