You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Ivan Bessonov (Jira)" <ji...@apache.org> on 2022/06/03 08:22:00 UTC

[jira] [Created] (IGNITE-17087) Native rebalance for PDS partitions

Ivan Bessonov created IGNITE-17087:
--------------------------------------

Summary: Native rebalance for PDS partitions
Key: IGNITE-17087
URL: https://issues.apache.org/jira/browse/IGNITE-17087
Project: Ignite
Issue Type: Improvement
Reporter: Ivan Bessonov

General idea of full rebalance is described in https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance algorithms if it's desired. Intuitively, it's a preferable option. Each storage chooses its own format.
h2. General idea

In this case, PDS has checkpointing feature that saves consistent state on disk. I expect SQL indexes to be in the same partition file as other data.

For every partition, its state on disk would look like this:
{code:java}
part-x.bin
part-x-1.bin
part-x-2.bin
...
part-x-n.bin{code}
part-x.bin is a baseline, and every other file is a delta that should be applied to underlying layers to get consistent data. It can be viewed like full and incremental backups.

When rebalance snapshot is required, we could force a checkpoint and then *prohibit merging* of new deltas to delta files from the snapshot until rebalance is finished. We must guarantee that consistent state can be read from disk.

Now, there are several strategies of data transferring:
* File-based. We can send baseline and delta files as files. Two possible issues here:
** Files contain duplicated pages, so the volume of data will be bigger than necessary.
** Baseline file has to be truncated, because some delta pages go directly into baseline file as optimization.
* Page-based. Latest state of every required page is sent separately. Two strategies here:
** Iterate pages in order of page indexes. Overheads during reads, but writes are very effective.
** Iterate pages in order of delta files, skipping already read pages in the process (like snapshots in GridGain, for example). Little overhead on read, but write won't be append-only.
I would argue that slower reads are more appropriate then slower writes. Generally speaking, any write should be slower than any read of the same size, right?

Should we implement all strategies and give user a choice? It's hard to predict which one is better for which scenario. In the future, I think it would be convenient to implement many options, but at first we should stick to the simplest one.

There must be a common "infrastructure" or a framework to stream native rebalance snapshots. Data format should be as simple as possible.

NOTE: of course, it has to be mentioned that this approach might lead to ineffective storage space usage. It can be a problem in theory, but in practice full rebalance isn't expected to occur often, and event then we don't expect that users will rewrite the entire partition data in a span of a single rebalance.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes will be sent incompleted. Maybe we should also send a build state for each index so that the receiving side could continue from the right place, not from the beginning.

This problem will be resolved in the future. Currently we don't have indexes implemented.

--
This message was sent by Atlassian Jira
(v8.20.7#820007)