You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Roman Puchkovskiy (Jira)" <ji...@apache.org> on 2024/02/09 06:43:00 UTC
[jira] [Updated] (IGNITE-18595) Implement index build process during the full state transfer

     [ https://issues.apache.org/jira/browse/IGNITE-18595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Puchkovskiy updated IGNITE-18595:
---------------------------------------
    Description: 
Before starting to accept tuples during a full state transfer, we should take the list of all the indices of the table in question that are in states between REGISTERED and READ_ONLY at the Catalog version passed with the full state transfer. Let’s put them in the *CurrentIndices* list.

Then, for each tuple version we accept:
 # If it’s committed, only consider indices from *CurrentIndices* that are not in the REGISTERED state now. We don’t need index committed versions for REGISTERED indices as they will be indexed by the backfiller (after the index switches to BACKFILLING). For each remaining index in {*}CurrentIndices{*}, put the tuple version to the index if one of the following is true:
 ## The index state is not READ_ONLY at the snapshot catalog version (so it’s one of BACKFILLING, AVAILABLE, STOPPING) - because these tuples can still be read by both RW and RO transactions via the index
 ## The index state is READ_ONLY at the snapshot catalog version, but at commitTs it either did not yet exist, or strictly preceded STOPPING (we don’t include tuples committed on STOPPING as, from the point of view of RO transactions, it’s impossible to query such tuples via the index [it is not queryable at those timestamps], new RW transactions don’t see the index, and old RW transactions [that saw it] have already finished)

 # If it’s a Write Intent, then:
 ## If the index is in the REGISTERED state at the snapshot catalog version, add the tuple to the index if its transaction was started in the REGISTERED state of the index; otherwise, skip it as it will be indexed by the backfiller.
 ## If the index is in any of BACKFILLING, AVAILABLE, STOPPING states at the snapshot catalog version, add the tuple to the index
 ## If the index is in READ_ONLY state at the snapshot catalog version, add the tuple to the index only if the transaction had been started before the index switched from the AVAILABLE state (this is to index a write intent from a finished, but not yet cleaned up, transaction)

Unlike the Backfiller operation, during a full state transfer, we don’t need to use the Write Intent resolution procedure as races with transaction cleanup are not possible, we just index a Write Intent; If, after the partition replica goes online, it gets a cleanup request with ABORT, it will clean the index itself.

If the initial state of an index during the full state transfer was BACKFILLING and, during accepting the full state transfer, we saw that the index was dropped (and moved to the [deleted] pseudostate), we should stop writing to that index (and allow it be destroyed on that partition).

If we start a full state transfer on a partition for which an index is being built (so the index is in the BACKFILLING state): we’ll index the accepted tuples (according to the rules above). After the full state transfer finishes, we’ll start getting ‘add this batch to the index’ commands from the RAFT log (as the Backfiller emits them during the backfilling process), we can just ignore or reapply them. To ignore them, we can raise a special flag in the index storage when finishing a full state transfer started with the index being in BACKFILLING state.
h1. Old version

Here there is no source of information for schema versions, associated with individual inserts. The core idea of the full rebalance is that all versions of all rows will be sent, while indexes will be rebuilt locally on the consumer. This is unfortunate. Why, you may ask.

Imagine the following situation:
 * time T1: table A with index X is created
 * time T2: user uploads the data
 * time T3: user drops index X
 * time T4: “clean” node N enters topology and downloads data via full rebalance procedure
 * time T5: N becomes a leader and receives (already running) RO transactions with timestamp T2<T<T3

Ideally, index X should be available for timestamp T. If the index is already available, it can’t suddenly become unavailable without an explicit rebuild request from the user (I guess).

The LATEST schema version at the moment of rebalance must be known. That’s unavoidable and makes total sense. First idea that comes to mind is updating all Registered and Available indexes. Situation, when an index has more indexed rows than it requires, is correct. Scan queries only return indexed rows that match corresponding value in the partition MV store. The real problem would be having less data than required.

The way that the approach is described in paragraph above is not quite correct. Let’s consider that there is a BinaryRow version. It defines a set of columns in the table at the moment of update. Not all row versions are compatible with all indexes. For example, you cannot put data into an index if a column has been deleted. On the other hand, you can put data in the index if a column has not yet been created (assuming it has a default value). In both cases the column is missing from the row version, but the outcome is very different.

This fact has some implications. A set of indexes to be updated depends on the row version for every particular row. I propose calculating it as a set of all indexes from a {_}maximal continuous range of db schemas{_}, that (if not empty) starts with the earliest known schema and _all schemas in the range have all indexed columns_ existing in the table.

For example, there’s a table T:
|DB schema version|Table columns|
|1|PK, A|
|2|PK, A, B|
|3 (LATEST)|PK, B|

 

In such configuration, ranges would be:
|Index columns|Schemas range|
|A|[1 ... 2]|
|B|[1 ... 3]|
|A, B|[1 ... 2]|

  was:
Before starting to accept tuples during a full state transfer, we should take the list of all the indices of the table in question that are in states between REGISTERED and READ_ONLY at the Catalog version passed with the full state transfer. Let’s put them in the *CurrentIndices* list.

Then, for each tuple version we accept:
 # If it’s committed, only consider indices from *CurrentIndices* that are not in the REGISTERED state now. We don’t need index committed versions for REGISTERED indices as they will be indexed by the backfiller (after the index switches to BACKFILLING). For each remaining index in {*}CurrentIndices{*}, put the tuple version to the index if one of the following is true:
 # The index state is not READ_ONLY at the snapshot catalog version (so it’s one of BACKFILLING, AVAILABLE, STOPPING) - because these tuples can still be read by both RW and RO transactions via the index
 # The index state is READ_ONLY at the snapshot catalog version, but at commitTs it either did not yet exist, or strictly preceded STOPPING (we don’t include tuples committed on STOPPING as, from the point of view of RO transactions, it’s impossible to query such tuples via the index [it is not queryable at those timestamps], new RW transactions don’t see the index, and old RW transactions [that saw it] have already finished)

 # If it’s a Write Intent, then:
 # If the index is in the REGISTERED state at the snapshot catalog version, add the tuple to the index if its transaction was started in the REGISTERED state of the index; otherwise, skip it as it will be indexed by the backfiller.
 # If the index is in any of BACKFILLING, AVAILABLE, STOPPING states at the snapshot catalog version, add the tuple to the index
 # If the index is in READ_ONLY state at the snapshot catalog version, add the tuple to the index only if the transaction had been started before the index switched to the STOPPING state (this is to index a write intent from a finished, but not yet cleaned up, transaction)

Unlike the Backfiller operation, during a full state transfer, we don’t need to use the Write Intent resolution procedure as races with transaction cleanup are not possible, we just index a Write Intent; If, after the partition replica goes online, it gets a cleanup request with ABORT, it will clean the index itself.

If the initial state of an index during the full state transfer was BACKFILLING and, during accepting the full state transfer, we saw that the index was dropped (and moved to the [deleted] pseudostate), we should stop writing to that index (and allow it be destroyed on that partition).

If we start a full state transfer on a partition for which an index is being built (so the index is in the BACKFILLING state): we’ll index the accepted tuples (according to the rules above). After the full state transfer finishes, we’ll start getting ‘add this batch to the index’ commands from the RAFT log (as the Backfiller emits them during the backfilling process), we can just ignore or reapply them. To ignore them, we can raise a special flag in the index storage when finishing a full state transfer started with the index being in BACKFILLING state.
h1. Old version

Here there is no source of information for schema versions, associated with individual inserts. The core idea of the full rebalance is that all versions of all rows will be sent, while indexes will be rebuilt locally on the consumer. This is unfortunate. Why, you may ask.

Imagine the following situation:
 * time T1: table A with index X is created
 * time T2: user uploads the data
 * time T3: user drops index X
 * time T4: “clean” node N enters topology and downloads data via full rebalance procedure
 * time T5: N becomes a leader and receives (already running) RO transactions with timestamp T2<T<T3

Ideally, index X should be available for timestamp T. If the index is already available, it can’t suddenly become unavailable without an explicit rebuild request from the user (I guess).

The LATEST schema version at the moment of rebalance must be known. That’s unavoidable and makes total sense. First idea that comes to mind is updating all Registered and Available indexes. Situation, when an index has more indexed rows than it requires, is correct. Scan queries only return indexed rows that match corresponding value in the partition MV store. The real problem would be having less data than required.

The way that the approach is described in paragraph above is not quite correct. Let’s consider that there is a BinaryRow version. It defines a set of columns in the table at the moment of update. Not all row versions are compatible with all indexes. For example, you cannot put data into an index if a column has been deleted. On the other hand, you can put data in the index if a column has not yet been created (assuming it has a default value). In both cases the column is missing from the row version, but the outcome is very different.

This fact has some implications. A set of indexes to be updated depends on the row version for every particular row. I propose calculating it as a set of all indexes from a {_}maximal continuous range of db schemas{_}, that (if not empty) starts with the earliest known schema and _all schemas in the range have all indexed columns_ existing in the table.

For example, there’s a table T:
|DB schema version|Table columns|
|1|PK, A|
|2|PK, A, B|
|3 (LATEST)|PK, B|

 

In such configuration, ranges would be:
|Index columns|Schemas range|
|A|[1 ... 2]|
|B|[1 ... 3]|
|A, B|[1 ... 2]|


> Implement index build process during the full state transfer
> ------------------------------------------------------------
>
>                 Key: IGNITE-18595
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18595
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Assignee: Kirill Tkalenko
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> Before starting to accept tuples during a full state transfer, we should take the list of all the indices of the table in question that are in states between REGISTERED and READ_ONLY at the Catalog version passed with the full state transfer. Let’s put them in the *CurrentIndices* list.
> Then, for each tuple version we accept:
>  # If it’s committed, only consider indices from *CurrentIndices* that are not in the REGISTERED state now. We don’t need index committed versions for REGISTERED indices as they will be indexed by the backfiller (after the index switches to BACKFILLING). For each remaining index in {*}CurrentIndices{*}, put the tuple version to the index if one of the following is true:
>  ## The index state is not READ_ONLY at the snapshot catalog version (so it’s one of BACKFILLING, AVAILABLE, STOPPING) - because these tuples can still be read by both RW and RO transactions via the index
>  ## The index state is READ_ONLY at the snapshot catalog version, but at commitTs it either did not yet exist, or strictly preceded STOPPING (we don’t include tuples committed on STOPPING as, from the point of view of RO transactions, it’s impossible to query such tuples via the index [it is not queryable at those timestamps], new RW transactions don’t see the index, and old RW transactions [that saw it] have already finished)
>  # If it’s a Write Intent, then:
>  ## If the index is in the REGISTERED state at the snapshot catalog version, add the tuple to the index if its transaction was started in the REGISTERED state of the index; otherwise, skip it as it will be indexed by the backfiller.
>  ## If the index is in any of BACKFILLING, AVAILABLE, STOPPING states at the snapshot catalog version, add the tuple to the index
>  ## If the index is in READ_ONLY state at the snapshot catalog version, add the tuple to the index only if the transaction had been started before the index switched from the AVAILABLE state (this is to index a write intent from a finished, but not yet cleaned up, transaction)
> Unlike the Backfiller operation, during a full state transfer, we don’t need to use the Write Intent resolution procedure as races with transaction cleanup are not possible, we just index a Write Intent; If, after the partition replica goes online, it gets a cleanup request with ABORT, it will clean the index itself.
> If the initial state of an index during the full state transfer was BACKFILLING and, during accepting the full state transfer, we saw that the index was dropped (and moved to the [deleted] pseudostate), we should stop writing to that index (and allow it be destroyed on that partition).
> If we start a full state transfer on a partition for which an index is being built (so the index is in the BACKFILLING state): we’ll index the accepted tuples (according to the rules above). After the full state transfer finishes, we’ll start getting ‘add this batch to the index’ commands from the RAFT log (as the Backfiller emits them during the backfilling process), we can just ignore or reapply them. To ignore them, we can raise a special flag in the index storage when finishing a full state transfer started with the index being in BACKFILLING state.
> h1. Old version
> Here there is no source of information for schema versions, associated with individual inserts. The core idea of the full rebalance is that all versions of all rows will be sent, while indexes will be rebuilt locally on the consumer. This is unfortunate. Why, you may ask.
> Imagine the following situation:
>  * time T1: table A with index X is created
>  * time T2: user uploads the data
>  * time T3: user drops index X
>  * time T4: “clean” node N enters topology and downloads data via full rebalance procedure
>  * time T5: N becomes a leader and receives (already running) RO transactions with timestamp T2<T<T3
> Ideally, index X should be available for timestamp T. If the index is already available, it can’t suddenly become unavailable without an explicit rebuild request from the user (I guess).
> The LATEST schema version at the moment of rebalance must be known. That’s unavoidable and makes total sense. First idea that comes to mind is updating all Registered and Available indexes. Situation, when an index has more indexed rows than it requires, is correct. Scan queries only return indexed rows that match corresponding value in the partition MV store. The real problem would be having less data than required.
> The way that the approach is described in paragraph above is not quite correct. Let’s consider that there is a BinaryRow version. It defines a set of columns in the table at the moment of update. Not all row versions are compatible with all indexes. For example, you cannot put data into an index if a column has been deleted. On the other hand, you can put data in the index if a column has not yet been created (assuming it has a default value). In both cases the column is missing from the row version, but the outcome is very different.
> This fact has some implications. A set of indexes to be updated depends on the row version for every particular row. I propose calculating it as a set of all indexes from a {_}maximal continuous range of db schemas{_}, that (if not empty) starts with the earliest known schema and _all schemas in the range have all indexed columns_ existing in the table.
> For example, there’s a table T:
> |DB schema version|Table columns|
> |1|PK, A|
> |2|PK, A, B|
> |3 (LATEST)|PK, B|
>  
> In such configuration, ranges would be:
> |Index columns|Schemas range|
> |A|[1 ... 2]|
> |B|[1 ... 3]|
> |A, B|[1 ... 2]|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)