You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Pavel Pereslegin (Jira)" <ji...@apache.org> on 2020/10/15 11:37:00 UTC
[jira] [Comment Edited] (IGNITE-13366) Special mode for maintenance of Ignite node. Employing Maintenance Mode for clearing corrupted PDS files.

    [ https://issues.apache.org/jira/browse/IGNITE-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214625#comment-17214625 ] 

Pavel Pereslegin edited comment on IGNITE-13366 at 10/15/20, 11:36 AM:
-----------------------------------------------------------------------

[~sergeychugunov],

could you please help me with the test failure that occurred after applying this patch?

The test verifies that we can restart the node during rebalancing.,
{code:java}
public class RestartDuringRebalancingTest extends GridCommonAbstractTest {
    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
        return super.getConfiguration(igniteInstanceName).setDataStorageConfiguration(new DataStorageConfiguration()
            .setDefaultDataRegionConfiguration(new DataRegionConfiguration().setPersistenceEnabled(true)));
    }

    @Test
    public void testRestartDuringRebalancing() throws Exception {
        cleanPersistenceDir();

        startGrids(2);

        grid(0).cluster().state(ClusterState.ACTIVE);

        startGrid(2);

        resetBaselineTopology();

        stopAllGrids();

        startGrids(3).cluster().state(ClusterState.ACTIVE);

        awaitPartitionMapExchange();
    }
}
{code}
This test fails with the following exception
{noformat}
class org.apache.ignite.IgniteCheckedException: Cache groups with potentially corrupted partition files found. To cleanup them maintenance is needed, node will enter maintenance mode on next restart. Cleanup cache group folders manually or trigger maintenance action to do that and restart the node. Corrupted files are located in subdirectories [cache-ignite-sys-cache] in a work dir /home/xtern/src/java/ignite/source/work/db/node02-a9790e24-4880-4d5b-aede-cb1e96308ad7	at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1438)
	at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2096)
	at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1748)
	at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1143)
	at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:641)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1229)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1150)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1126)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:995)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrids(GridAbstractTest.java:837)
	at org.apache.ignite.internal.processors.cache.persistence.RestartDuringRebalancingTest.testRestartDuringRebalancing(RestartDuringRebalancingTest.java:30)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.apache.ignite.testframework.junits.GridAbstractTest$7.run(GridAbstractTest.java:2373)
	at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.IgniteException: Cache groups with potentially corrupted partition files found. To cleanup them maintenance is needed, node will enter maintenance mode on next restart. Cleanup cache group folders manually or trigger maintenance action to do that and restart the node. Corrupted files are located in subdirectories [cache-ignite-sys-cache] in a work dir /home/xtern/src/java/ignite/source/work/db/node02-a9790e24-4880-4d5b-aede-cb1e96308ad7
	at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.beginRecover(FilePageStoreManager.java:388)
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.performBinaryMemoryRestore(GridCacheDatabaseSharedManager.java:1776)
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreBinaryMemory(GridCacheDatabaseSharedManager.java:837)
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.startMemoryRestore(GridCacheDatabaseSharedManager.java:1608)
	at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1282)
	... 20 more{noformat}
AFAIK we clear all restored cache data on baseline change (see using of GridCacheDatabaseSharedManager#cleanupRestoredCaches) why can't we start the node in this case?

I don't know is this a bug or feature - how can I fix this test?

 


was (Author: xtern):
[~sergeychugunov],

could you please help me with the test failure that occurred after applying this patch?

The test verifies that we can restart the node during rebalancing., 

 
{code:java}
public class RestartDuringRebalancingTest extends GridCommonAbstractTest {
    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
        return super.getConfiguration(igniteInstanceName).setDataStorageConfiguration(new DataStorageConfiguration()
            .setDefaultDataRegionConfiguration(new DataRegionConfiguration().setPersistenceEnabled(true)));
    }

    @Test
    public void testRestartDuringRebalancing() throws Exception {
        cleanPersistenceDir();

        startGrids(2);

        grid(0).cluster().state(ClusterState.ACTIVE);

        startGrid(2);

        resetBaselineTopology();

        stopAllGrids();

        startGrids(3).cluster().state(ClusterState.ACTIVE);

        awaitPartitionMapExchange();
    }
}
{code}
This test fails with the following exception
{noformat}
class org.apache.ignite.IgniteCheckedException: Cache groups with potentially corrupted partition files found. To cleanup them maintenance is needed, node will enter maintenance mode on next restart. Cleanup cache group folders manually or trigger maintenance action to do that and restart the node. Corrupted files are located in subdirectories [cache-ignite-sys-cache] in a work dir /home/xtern/src/java/ignite/source/work/db/node02-a9790e24-4880-4d5b-aede-cb1e96308ad7	at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1438)
	at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2096)
	at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1748)
	at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1143)
	at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:641)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1229)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1150)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1126)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:995)
	at org.apache.ignite.testframework.junits.GridAbstractTest.startGrids(GridAbstractTest.java:837)
	at org.apache.ignite.internal.processors.cache.persistence.RestartDuringRebalancingTest.testRestartDuringRebalancing(RestartDuringRebalancingTest.java:30)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.apache.ignite.testframework.junits.GridAbstractTest$7.run(GridAbstractTest.java:2373)
	at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.IgniteException: Cache groups with potentially corrupted partition files found. To cleanup them maintenance is needed, node will enter maintenance mode on next restart. Cleanup cache group folders manually or trigger maintenance action to do that and restart the node. Corrupted files are located in subdirectories [cache-ignite-sys-cache] in a work dir /home/xtern/src/java/ignite/source/work/db/node02-a9790e24-4880-4d5b-aede-cb1e96308ad7
	at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.beginRecover(FilePageStoreManager.java:388)
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.performBinaryMemoryRestore(GridCacheDatabaseSharedManager.java:1776)
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreBinaryMemory(GridCacheDatabaseSharedManager.java:837)
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.startMemoryRestore(GridCacheDatabaseSharedManager.java:1608)
	at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1282)
	... 20 more{noformat}
AFAIK we clear all restored cache data on baseline change (see using of GridCacheDatabaseSharedManager#cleanupRestoredCaches) why can't we start the node in this case?

I don't know is this a bug or feature - how can I fix this test?

 

> Special mode for maintenance of Ignite node. Employing Maintenance Mode for clearing corrupted PDS files.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-13366
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13366
>             Project: Ignite
>          Issue Type: New Feature
>          Components: persistence
>    Affects Versions: 2.8.1
>            Reporter: Sergey Chugunov
>            Assignee: Sergey Chugunov
>            Priority: Critical
>              Labels: IEP-53
>             Fix For: 2.10
>
>   Original Estimate: 168h
>          Time Spent: 1h 50m
>  Remaining Estimate: 166h 10m
>
> If node with persistence is stopped when WAL was disabled for a cache (no matters because of rebalancing in progress or by explicit user request) on next node start all data files of that cache are removed automatically and unconditionally.
> This behavior may be unexpected for users as they may not understand all consequences of disabling WAL locally (for rebalancing) or globally (via IgniteCluster API call). Also it is not smart enough as there is no point in deleting consistent data files.
> We should change this behavior to the following list: no automatic deletions whatsoever. If data files are consistent (equivalent to: no checkpoint was running when node was stopped) start up normally. If data files are corrupted, don't let the node start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)