You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Pavel Pereslegin (Jira)" <ji...@apache.org> on 2020/08/13 16:09:00 UTC
[jira] [Commented] (IGNITE-12069) Implement file rebalancing management

    [ https://issues.apache.org/jira/browse/IGNITE-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177120#comment-17177120 ] 

Pavel Pereslegin commented on IGNITE-12069:
-------------------------------------------

This task has been temporarily held because our testing shows the following results:
 * Index rebuilding takes a very long time, in some cases, the index rebuilding time (24 threads) exceeds the full rebalance of the index cache (24 threads).
 * Both in the case of load and no-load, most of the time is spent on rebuilding the index, the current solution can be modified to transfer the index partition (if the distribution of partitions on the demander matches the supplier partition distribution (affinity can be configured for such cases on PARTITIONED caches)).
 * Index rebuilding can be started earlier on a separate (single) partition (after this mode is implemented), this should slightly smooth out the index rebuild time.
 * A critical slowdown in the transfer of partition files on hdd drives was revealed, especially with minor concurrent cache updates (in some cases, the speed drops tenfold and long timeouts occur, which lead to an abnormal termination of the process).
 * Single-threaded file transfer mode can be switched to multi-threaded (which should lead to a multiple increase in file transfer speed), because hard disks on demander are loaded slightly.

> Implement file rebalancing management
> -------------------------------------
>
>                 Key: IGNITE-12069
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12069
>             Project: Ignite
>          Issue Type: Sub-task
>            Reporter: Maxim Muzafarov
>            Assignee: Pavel Pereslegin
>            Priority: Major
>              Labels: iep-28
>             Fix For: 2.10
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{Preloader}} should be able to do the following:
>  # build the map of partitions and corresponding supplier nodes from which partitions will be loaded;
>  # switch cache data storage to {{no-op}} and back to original (HWM must be fixed here for the needs of historical rebalance) under the checkpoint and keep the partition update counter for each partition;
>  # run async the eviction indexes for the list of collected partitions;
>  # send a request message to each node one by one with the list of partitions to load;
>  # wait for files received (listening for the transmission handler);
>  # run rebuild indexes async over the receiving partitions;
>  # run historical rebalance from LWM to HWM collected above (LWM can be read from the received file meta page);
> h5. Stage 1. implement "read-only" mode for cache data store. Implement data store reinitialization on the updated persistence file.
> h6. Tests:
>  - Switching under load.
>  - Check re-initialization of partition on new file.
>  - Check that in read-only mode
>  ** H2 indexes are not updated
>  ** update counter is updated
>  ** cache entries eviction works fine
>  ** tx/atomic updates on this partition works fine in cluster
> h5. Stage 2. Build Map for request partitions by node, add message that will be sent to the supplier. Send a demand request, handle the response, switch datastore when file received.
> h6. Tests:
>  - Check partition consistency after receiving a file.
>  - File transmission under load.
>  - Failover - some of the partitions have been switched, the node has been restarted, rebalancing is expected to continue only for fully loaded large partitions through the historical rebalance, for the rest of partitions it should restart from the beginning. 
> h5. Stage 3. Add WAL history reservation on supplier. Add historical rebalance triggering (LWM (partition) - HWM (read-only)).
> h6. Tests:
>  - File rebalancing under load and without on atomic/tx caches. (check existing PDS-enabled rebalancing tests).
>  - Ensure that MVCC groups use regular rebalancing.
>  - The rebalancing on the unstable topology and failures of the supplier/demander nodes at different stages.
>  - (compatibility) The old nodes should use regular rebalancing.
> h5. Stage 4 Eviction and rebuild of indexes.
> h6. Tests:
>  - File rebalancing of caches with H2 indexes.
>  - Check consistency of H2 indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)