You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by 38797715 <38...@qq.com> on 2022/10/12 13:49:10 UTC
Re: partitionLossPolicy confused

https://issues.apache.org/jira/browse/IGNITE-17835

在 2022/9/30 18:14, Вячеслав Коптилин 写道:
> Hello,
>
> In general there are two possible ways to handle lost partitions for a 
> cluster that uses Ignite Native Persistence:
> 1.
>    - Return all failed nodes to baseline topology.
>    - Call resetLostPartitions
>
> 2.
>    - Stop all remaining nodes in the cluster.
>    - Start all nodes in the cluster (including previously failed 
> nodes) and activate a cluster.
>
> it’s important to return all failed nodes to the topology before 
> calling resetLostPartitions, otherwise a cluster could end up having 
> stale data.
>
> If some owners cannot be returned to the topology for a some reason, 
> they should be excluded from baseline before attempting resetting lost 
> partition state or an ClusterTopologyCheckedException will be thrown
> with a message "Cannot reset lost partitions because no baseline nodes 
> are online [cache=someCahe, partition=someLostPart]” indicating safe 
> recovery is not possible.
>
> In your particular case, the cache does not have backups and returning 
> a node that holds a lost partition should not lead to data 
> inconsistencies.
> This particular case can be detected and automatically "resolved". I 
> will file a jira ticket in order to address this improvement.
>
> Thanks,
> Slava.
>
> пн, 26 сент. 2022 г. в 16:51, 38797715 <38...@qq.com>:
>
>     hello,
>
>     Start two nodes with native persistent enabled, and then activate it.
>
>     create a table with no backups, sql like follows:
>
>     CREATE TABLE City (
>       ID INT,
>       Name VARCHAR,
>       CountryCode CHAR(3),
>       District VARCHAR,
>       Population INT,
>       PRIMARY KEY (ID, CountryCode)
>     ) WITH "template=partitioned, affinityKey=CountryCode,
>     CACHE_NAME=City, KEY_TYPE=demo.model.CityKey,
>     VALUE_TYPE=demo.model.City";
>
>     INSERT INTO City(ID, Name, CountryCode, District, Population)
>     VALUES (1,'Kabul','AFG','Kabol',1780000);
>     INSERT INTO City(ID, Name, CountryCode, District, Population)
>     VALUES (2,'Qandahar','AFG','Qandahar',237500);
>
>     then execute SELECT COUNT(*) FROM city;
>
>     normal.
>
>     then kill one node.
>
>     then execute SELECT COUNT(*) FROM city;
>
>     Failed to execute query because cache partition has been lostPart
>     [cacheName=City, part=0]
>
>     this alse normal.
>
>     Next, start the node that was shut down before.
>
>     then execute SELECT COUNT(*) FROM city;
>
>     Failed to execute query because cache partition has been lostPart
>     [cacheName=City, part=0]
>
>     At this time, all partitions have been recovered, and all baseline
>     nodes are ONLINE. Why still report this error? It is very
>     confusing. Execute reset_lost_partitions operation at this time
>     seems redundant. Do have any special considerations here?
>
>     if this time restart the whole cluster,  thenexecute SELECT
>     COUNT(*) FROM city; normal, this state is the same as the previous
>     state, but the behavior is different.
>
>
>
>
>