You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Ilya Kasnacheev <il...@apache.org> on 2021/01/20 16:28:26 UTC

WAL enable/disable does not work on unstable topology - removal or warning

Hello!

We had this feature for a few versions, where you could do gnite.cluster().
disableWal() to temporarily disable WAL on a specific cache, involving a
PME and checkpoint on every node.

However, it became apparent that you cannot enable or disable WAL on any
kind of unstable topology, at all:
https://issues.apache.org/jira/browse/IGNITE-13976

You cannot even disable WAL while a baseline node is offline: When it comes
back, it will not sync its WAL enabled status with the rest of the cluster,
and all subsequent "WAL enable" or "WAL disable" operations will fail on
that cache, with no clear way to recover this cache:

ignite.close();
client.cluster().disableWal(CACHE_NAME);
nodes.add(Ignition.start(igniteCfg(false, consistentId)));
client.cluster().enableWal(CACHE_NAME); // will fail

Even if this simple scenario is fixed, it seems that there are multiple
failure scenarios if you try to add or remove a node in the middle of WAL
state change operation. It does not seem that we have any expertise in wal
disable/enable implementation right now, and I did not find a simple way of
fixing it short of a full rewrite.

Therefore, I propose that we should *(a) disable that feature* in 2.10 or*
(b) give a clear warning *when it is used, and also mention in the
documentation that it may only be used on stable topology.

We may also want to re-mark this feature's API as @IgniteExperimental.
I have lifted this ticket to be a Blocker.

WDYT?

Regards,

Re: WAL enable/disable does not work on unstable topology - removal or warning

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

I have created a separate ticket for that work:
https://issues.apache.org/jira/browse/IGNITE-14039
I have designated it as a blocker for 2.10
I have submitted a pull request. I will run tests:
https://github.com/apache/ignite/pull/8688/files

Mikhail, your approach makes sense. However, I don't think we can change
existing API - too late for that, we have to maintain compatibility.

Regards,
-- 
Ilya Kasnacheev


чт, 21 янв. 2021 г. в 15:33, Maxim Muzafarov <mm...@apache.org>:

> Ilya,
>
> This issue must be fixed for sure (don't think we should rewrite it
> from scratch).
>
> Let's add TODO and warning comment referencing to this issue to the
> JavaDoc and also add the same warning to documentation pages. The
> reference to the issue will allow users to track the fixing progress.
>
>
> On Wed, 20 Jan 2021 at 22:39, Mikhail Cherkasov <mc...@gridgain.com>
> wrote:
> >
> > Hi Ilya,
> >
> > WAL disable is a very powerful feature that is widely adopted by users.
> > For sure we need to fix it, even if it means rewriting it.
> > The warning makes sense, in this case, we can even reduce the priority of
> > the issue, but anyway, it's at least a critical one, because it can lead
> to
> > data loss(and it does).
> > I would say, instead of a warning, we can do something more noticeable,
> > like method signature change like:
> >  boolean disableWal(String cacheName, boolean
> > iReadJavaDockAndAwareOfTheRisk); - this one definitely will be noticed.
> >
> > Thanks,
> > Mike.
> >
> > On Wed, Jan 20, 2021 at 8:28 AM Ilya Kasnacheev <il...@apache.org>
> wrote:
> >
> > > Hello!
> > >
> > > We had this feature for a few versions, where you could do
> gnite.cluster().
> > > disableWal() to temporarily disable WAL on a specific cache, involving
> a
> > > PME and checkpoint on every node.
> > >
> > > However, it became apparent that you cannot enable or disable WAL on
> any
> > > kind of unstable topology, at all:
> > > https://issues.apache.org/jira/browse/IGNITE-13976
> > >
> > > You cannot even disable WAL while a baseline node is offline: When it
> comes
> > > back, it will not sync its WAL enabled status with the rest of the
> cluster,
> > > and all subsequent "WAL enable" or "WAL disable" operations will fail
> on
> > > that cache, with no clear way to recover this cache:
> > >
> > > ignite.close();
> > > client.cluster().disableWal(CACHE_NAME);
> > > nodes.add(Ignition.start(igniteCfg(false, consistentId)));
> > > client.cluster().enableWal(CACHE_NAME); // will fail
> > >
> > > Even if this simple scenario is fixed, it seems that there are multiple
> > > failure scenarios if you try to add or remove a node in the middle of
> WAL
> > > state change operation. It does not seem that we have any expertise in
> wal
> > > disable/enable implementation right now, and I did not find a simple
> way of
> > > fixing it short of a full rewrite.
> > >
> > > Therefore, I propose that we should *(a) disable that feature* in 2.10
> or*
> > > (b) give a clear warning *when it is used, and also mention in the
> > > documentation that it may only be used on stable topology.
> > >
> > > We may also want to re-mark this feature's API as @IgniteExperimental.
> > > I have lifted this ticket to be a Blocker.
> > >
> > > WDYT?
> > >
> > > Regards,
> > >
> >
> >
> > --
> > Thanks,
> > Mikhail.
>

Re: WAL enable/disable does not work on unstable topology - removal or warning

Posted by Maxim Muzafarov <mm...@apache.org>.
Ilya,

This issue must be fixed for sure (don't think we should rewrite it
from scratch).

Let's add TODO and warning comment referencing to this issue to the
JavaDoc and also add the same warning to documentation pages. The
reference to the issue will allow users to track the fixing progress.


On Wed, 20 Jan 2021 at 22:39, Mikhail Cherkasov <mc...@gridgain.com> wrote:
>
> Hi Ilya,
>
> WAL disable is a very powerful feature that is widely adopted by users.
> For sure we need to fix it, even if it means rewriting it.
> The warning makes sense, in this case, we can even reduce the priority of
> the issue, but anyway, it's at least a critical one, because it can lead to
> data loss(and it does).
> I would say, instead of a warning, we can do something more noticeable,
> like method signature change like:
>  boolean disableWal(String cacheName, boolean
> iReadJavaDockAndAwareOfTheRisk); - this one definitely will be noticed.
>
> Thanks,
> Mike.
>
> On Wed, Jan 20, 2021 at 8:28 AM Ilya Kasnacheev <il...@apache.org> wrote:
>
> > Hello!
> >
> > We had this feature for a few versions, where you could do gnite.cluster().
> > disableWal() to temporarily disable WAL on a specific cache, involving a
> > PME and checkpoint on every node.
> >
> > However, it became apparent that you cannot enable or disable WAL on any
> > kind of unstable topology, at all:
> > https://issues.apache.org/jira/browse/IGNITE-13976
> >
> > You cannot even disable WAL while a baseline node is offline: When it comes
> > back, it will not sync its WAL enabled status with the rest of the cluster,
> > and all subsequent "WAL enable" or "WAL disable" operations will fail on
> > that cache, with no clear way to recover this cache:
> >
> > ignite.close();
> > client.cluster().disableWal(CACHE_NAME);
> > nodes.add(Ignition.start(igniteCfg(false, consistentId)));
> > client.cluster().enableWal(CACHE_NAME); // will fail
> >
> > Even if this simple scenario is fixed, it seems that there are multiple
> > failure scenarios if you try to add or remove a node in the middle of WAL
> > state change operation. It does not seem that we have any expertise in wal
> > disable/enable implementation right now, and I did not find a simple way of
> > fixing it short of a full rewrite.
> >
> > Therefore, I propose that we should *(a) disable that feature* in 2.10 or*
> > (b) give a clear warning *when it is used, and also mention in the
> > documentation that it may only be used on stable topology.
> >
> > We may also want to re-mark this feature's API as @IgniteExperimental.
> > I have lifted this ticket to be a Blocker.
> >
> > WDYT?
> >
> > Regards,
> >
>
>
> --
> Thanks,
> Mikhail.

Re: WAL enable/disable does not work on unstable topology - removal or warning

Posted by Mikhail Cherkasov <mc...@gridgain.com>.
Hi Ilya,

WAL disable is a very powerful feature that is widely adopted by users.
For sure we need to fix it, even if it means rewriting it.
The warning makes sense, in this case, we can even reduce the priority of
the issue, but anyway, it's at least a critical one, because it can lead to
data loss(and it does).
I would say, instead of a warning, we can do something more noticeable,
like method signature change like:
 boolean disableWal(String cacheName, boolean
iReadJavaDockAndAwareOfTheRisk); - this one definitely will be noticed.

Thanks,
Mike.

On Wed, Jan 20, 2021 at 8:28 AM Ilya Kasnacheev <il...@apache.org> wrote:

> Hello!
>
> We had this feature for a few versions, where you could do gnite.cluster().
> disableWal() to temporarily disable WAL on a specific cache, involving a
> PME and checkpoint on every node.
>
> However, it became apparent that you cannot enable or disable WAL on any
> kind of unstable topology, at all:
> https://issues.apache.org/jira/browse/IGNITE-13976
>
> You cannot even disable WAL while a baseline node is offline: When it comes
> back, it will not sync its WAL enabled status with the rest of the cluster,
> and all subsequent "WAL enable" or "WAL disable" operations will fail on
> that cache, with no clear way to recover this cache:
>
> ignite.close();
> client.cluster().disableWal(CACHE_NAME);
> nodes.add(Ignition.start(igniteCfg(false, consistentId)));
> client.cluster().enableWal(CACHE_NAME); // will fail
>
> Even if this simple scenario is fixed, it seems that there are multiple
> failure scenarios if you try to add or remove a node in the middle of WAL
> state change operation. It does not seem that we have any expertise in wal
> disable/enable implementation right now, and I did not find a simple way of
> fixing it short of a full rewrite.
>
> Therefore, I propose that we should *(a) disable that feature* in 2.10 or*
> (b) give a clear warning *when it is used, and also mention in the
> documentation that it may only be used on stable topology.
>
> We may also want to re-mark this feature's API as @IgniteExperimental.
> I have lifted this ticket to be a Blocker.
>
> WDYT?
>
> Regards,
>


-- 
Thanks,
Mikhail.