You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Andrey Mashenkov (Jira)" <ji...@apache.org> on 2022/12/01 10:03:00 UTC

[jira] [Comment Edited] (IGNITE-18171) Descibe nodes start/stop scenarios

    [ https://issues.apache.org/jira/browse/IGNITE-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641806#comment-17641806 ] 

Andrey Mashenkov edited comment on IGNITE-18171 at 12/1/22 10:02 AM:
---------------------------------------------------------------------

[~alapin] ,
I've attached pull request with ItClusterStartupTest that checks some meaningful scenarios.
These tests pass on PR, but have some issues.
 # Node start future can't finish if there is no MetaStorage group quorum. Even ClusterManagementGroup node start future, but it may relates to p3, as node may wait for table initialization or smth like. Is it a bug?
 # Node start future fails after 30 sec timeout if there is no both CMG and MetaStorage group quorum. 
Looks like a bug. Increase NODE_JOIN_WAIT_TIMEOUT field value to reproduce, e.g. no test pass with 15+ sec value.
What should be correct behaviour?
 # Table can't be span over predefined set of nodes. This is not possible without distributed zoned support. So, it is ok for now.
 # Calling a tx.rollback() on commited transaction instance fails with exception. I thought it this is valid pattern.

{code:java}
Transaction tx = node.transactions().begin();
try {
    // operations

    tx.commit();
} finally {
    tx.rollback(); // FAILS here with TransactionException "Fail to finish the transaction inconsistent state"
}{code}
I've commented the call for now.


was (Author: amashenkov):
[~alapin] ,
I've attached pull request with ItClusterStartupTest that checks some meaningful scenarios.
These tests pass on PR, but have some issues.
 # Node start future can't finish if there is no MetaStorage group quorum. Even ClusterManagementGroup node. Is it a bug.
 # Node start future fails after 30 sec timeout if there is no both CMG and MetaStorage group quorum. 
Looks like a bug. Increase NODE_JOIN_WAIT_TIMEOUT field value to reproduce, e.g. no test pass with 15+ sec value.
What should be correct behaviour?
 # Table can't be span over predefined set of nodes. This is not possible without distributed zoned support. So, it is ok for now.
 # Calling a tx.rollback() on commited transaction instance fails with exception. I thought it this is valid pattern.

{code:java}
Transaction tx = node.transactions().begin();
try {
    // operations

    tx.commit();
} finally {
    tx.rollback(); // FAILS here with TransactionException "Fail to finish the transaction inconsistent state"
}{code}
I've commented the call for now.

> Descibe nodes start/stop scenarios
> ----------------------------------
>
>                 Key: IGNITE-18171
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18171
>             Project: Ignite
>          Issue Type: Improvement
>          Components: sql
>            Reporter: Andrey Mashenkov
>            Assignee: Andrey Mashenkov
>            Priority: Major
>              Labels: ignite-3
>
> h2. Definitions.
> We can distinguish next cluster node groups, see below. Each node may be part of one or more groups.
>  * Cluster Management Group (CMG), that control new nodes join process.
>  * MetaStorage group (MSG), that hosts meta storage.
>  * Data node group (DNG), that just hosts tables partitions.
> The components (CMG, meta storage, tables components) are depends on each other, but may resides on different (even disjoint) node subsets. So, some components may become temporary unavailable, and dependant components must be aware of such issues and handle them (wait, retry, throw exception or whatever) in expected way, which has to be documented also.
> [See IEP for details|https://cwiki.apache.org/confluence/display/IGNITE/IEP-77%3A+Node+Join+Protocol+and+Initialization+for+Ignite+3]
> h2. Motivation.
> As of now, the correct way to start the grid (after it was stopped) is: start CMG nodes, then Meta Storage nodes, then Data nodes. And in backward order for correct stop. Other scenarios are not tested and may lead to unexpected behaviour.
> Let's describe all possible scenarios, expected behaviour for each of them and extend test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)