You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Sergey Chugunov (Jira)" <ji...@apache.org> on 2022/09/28 12:15:00 UTC

[jira] [Updated] (IGNITE-17087) Native rebalance for PDS partitions

     [ https://issues.apache.org/jira/browse/IGNITE-17087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Chugunov updated IGNITE-17087:
-------------------------------------
    Epic Link: IGNITE-17774  (was: IGNITE-16923)

> Native rebalance for PDS partitions
> -----------------------------------
>
>                 Key: IGNITE-17087
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17087
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> General idea of full rebalance is described in https://issues.apache.org/jira/browse/IGNITE-17083
> For persistent storages, there's an option to avoid copy-on-write rebalance algorithms if it's desired. Intuitively, it's a preferable option. Each storage chooses its own format.
> h2. General idea
> In this case, PDS has checkpointing feature that saves consistent state on disk. I expect SQL indexes to be in the same partition file as other data.
> For every partition, its state on disk would look like this:
> {code:java}
> part-x.bin
> part-x-1.bin
> part-x-2.bin
> ...
> part-x-n.bin{code}
> part-x.bin is a baseline, and every other file is a delta that should be applied to underlying layers to get consistent data. It can be viewed like full and incremental backups.
> When rebalance snapshot is required, we could force a checkpoint and then *prohibit merging* of new deltas to delta files from the snapshot until rebalance is finished. We must guarantee that consistent state can be read from disk.
> Now, there are several strategies of data transferring:
>  * File-based. We can send baseline and delta files as files. Two possible issues here:
>  ** Files contain duplicated pages, so the volume of data will be bigger than necessary.
>  ** Baseline file has to be truncated, because some delta pages go directly into baseline file as optimization.
>  * Page-based. Latest state of every required page is sent separately. Two strategies here:
>  ** Iterate pages in order of page indexes. Overheads during reads, but writes are very effective.
>  ** Iterate pages in order of delta files, skipping already read pages in the process (like snapshots in GridGain, for example). Little overhead on read, but write won't be append-only.
> I would argue that slower reads are more appropriate then slower writes. Generally speaking, any write should be slower than any read of the same size, right?
> Should we implement all strategies and give user a choice? It's hard to predict which one is better for which scenario. In the future, I think it would be convenient to implement many options, but at first we should stick to the simplest one.
> There must be a common "infrastructure" or a framework to stream native rebalance snapshots. Data format should be as simple as possible.
> NOTE: of course, it has to be mentioned that this approach might lead to ineffective storage space usage. It can be a problem in theory, but in practice full rebalance isn't expected to occur often, and event then we don't expect that users will rewrite the entire partition data in a span of a single rebalance.
> h2. Possible problems
> Given that "raw" data is sent, including sql indexes, all incompleted indexes will be sent incompleted. Maybe we should also send a build state for each index so that the receiving side could continue from the right place, not from the beginning.
> This problem will be resolved in the future. Currently we don't have indexes implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)