You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Andrey Mashenkov <an...@gmail.com> on 2018/03/12 10:33:54 UTC

Partition recovery issue on partition loss.

Hi Igniters,

I've found we no documentation how user can recover cache from cacheStore
in case of partition loss.
Ignite provides some instruments (methods and events) that should help user
to solve this problem,
but looks like these instruments have an architecture lack.

The first one is an usability issue. Ignite provides partition loss event
to user can handle this, but Ignite fires an event per partition.
Why we can't have an event with list of lost partitions?

The second one is a bug. Ignite.resetLostPartitions() method doesn't care
about what topology version recovered partitions belonged to.
Tthere is a race, when user call this method after a node was failed, but
right before Ignite fire an event.
So, it is possible state of just lost partitions will be reseted
unexpectedly.


I've created a ticket for this [1] and think we should rethink the
architecture of the partition recovery mechanics and improve documentation.
Any thoughts?

[1] https://issues.apache.org/jira/browse/IGNITE-7832


-- 
Best regards,
Andrey V. Mashenkov

Re: Partition recovery issue on partition loss.

Posted by Andrey Mashenkov <an...@gmail.com>.

Dmitry,

Not yet, it looks like architectural issue.
Ignite provide a error prone method for resetLostPartition state in public
API.

However, IGNITE-5302 is about issue during node startup. I'd think we can
use some internals (e.g. topology version) to workaround this.

If you are sure IGNITE-5302 will solve 7832 then you may close as a
duplicate.
If IGNITE-5302 can be fixed with using some workaround, e.g. checking
topology version when reset partition state, then let's keep 7832  open.

On Mon, Mar 12, 2018 at 1:37 PM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> Hi Andrey,
>
> I remember some issue was also found in tests:
> https://issues.apache.org/jira/browse/IGNITE-5302
>
> Is this a consequence of the same problem?
>
> Sincerely,
>
> Dmitriy Pavlov
>
> пн, 12 мар. 2018 г. в 13:34, Andrey Mashenkov <andrey.mashenkov@gmail.com
> >:
>
> > Hi Igniters,
> >
> > I've found we no documentation how user can recover cache from cacheStore
> > in case of partition loss.
> > Ignite provides some instruments (methods and events) that should help
> user
> > to solve this problem,
> > but looks like these instruments have an architecture lack.
> >
> > The first one is an usability issue. Ignite provides partition loss event
> > to user can handle this, but Ignite fires an event per partition.
> > Why we can't have an event with list of lost partitions?
> >
> > The second one is a bug. Ignite.resetLostPartitions() method doesn't care
> > about what topology version recovered partitions belonged to.
> > Tthere is a race, when user call this method after a node was failed, but
> > right before Ignite fire an event.
> > So, it is possible state of just lost partitions will be reseted
> > unexpectedly.
> >
> >
> > I've created a ticket for this [1] and think we should rethink the
> > architecture of the partition recovery mechanics and improve
> documentation.
> > Any thoughts?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >
>



-- 
Best regards,
Andrey V. Mashenkov

Re: Partition recovery issue on partition loss.

Posted by Dmitry Pavlov <dp...@gmail.com>.

Hi Andrey,

I remember some issue was also found in tests:
https://issues.apache.org/jira/browse/IGNITE-5302

Is this a consequence of the same problem?

Sincerely,

Dmitriy Pavlov

пн, 12 мар. 2018 г. в 13:34, Andrey Mashenkov <an...@gmail.com>:

> Hi Igniters,
>
> I've found we no documentation how user can recover cache from cacheStore
> in case of partition loss.
> Ignite provides some instruments (methods and events) that should help user
> to solve this problem,
> but looks like these instruments have an architecture lack.
>
> The first one is an usability issue. Ignite provides partition loss event
> to user can handle this, but Ignite fires an event per partition.
> Why we can't have an event with list of lost partitions?
>
> The second one is a bug. Ignite.resetLostPartitions() method doesn't care
> about what topology version recovered partitions belonged to.
> Tthere is a race, when user call this method after a node was failed, but
> right before Ignite fire an event.
> So, it is possible state of just lost partitions will be reseted
> unexpectedly.
>
>
> I've created a ticket for this [1] and think we should rethink the
> architecture of the partition recovery mechanics and improve documentation.
> Any thoughts?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7832
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Partition recovery issue on partition loss.

Posted by Dmitry Pavlov <dp...@gmail.com>.

Denis, it seems noone is working.

чт, 22 мар. 2018 г. в 21:26, Denis Magda <dm...@apache.org>:

> Igniters,
>
> Is anybody working on this bug? There is a high chance we can add a fix to
> 2.5 if the community agrees to release it earlier.
>
> --
> Denis
>
> On Thu, Mar 15, 2018 at 11:04 AM, Denis Magda <dm...@apache.org> wrote:
>
> > I dared to set fix version to 2.5 and increased the severity. It's
> > important to fix the race since we've just released the partition loss
> > functionality in 2.4 and it's already broken.
> >
> > Andrey, please keep us posted. If you didn't fix it, we would need to
> find
> > another contributor.
> >
> > --
> > Denis
> >
> > On Thu, Mar 15, 2018 at 7:29 AM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> >
> >> Hi Andrew Mashenkov,
> >>
> >> would you like to pick up issue?
> >>
> >> Sincerely,
> >> Dmitriy Pavlov
> >>
> >> чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <ds...@apache.org>:
> >>
> >> > Completely agree, we must fix this. I like the proposed design. We
> >> should
> >> > also specify that resetLostPartitions() method should return true and
> >> > false.
> >> >
> >> > Val, do you mind updating the ticket with new design?
> >> > https://issues.apache.org/jira/browse/IGNITE-7832
> >> >
> >> > D.
> >> >
> >> > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
> >> > valentin.kulichenko@gmail.com> wrote:
> >> >
> >> > > This indeed looks like a bigger issue. Basically, there is no clear
> >> way
> >> > (or
> >> > > no way at all) to synchronize code that listens to partition loss
> >> event,
> >> > > and the code that calls resetLostPartitions() method. Example
> >> scenario:
> >> > >
> >> > > 1. Cache is configured with 3rd party persistence.
> >> > > 2. One or more nodes fail causing loss of several partitions in
> >> memory.
> >> > > 3. Ignite blocks access to those partitions according to partition
> >> loss
> >> > > policy and fires an event.
> >> > > 4. Application listens to the event and starts reloading the data
> from
> >> > > store.
> >> > > 5. When reloading is complete, application calls
> >> resetLostPartitions() to
> >> > > restore access.
> >> > > 6. Nodes fail again causing another partition loss, new event is
> >> fired.
> >> > >
> >> > > There is race between steps 5 and 6. If 2nd failure happens BEFORE
> >> > > resetLostPartitions() is called, we end up with inconsistent data.
> >> > >
> >> > > I believe the only way to fix this is to add corresponding topology
> >> > version
> >> > > to partition loss event, and also add it as a parameter for
> >> > > resetLostPartitions().
> >> > > This way if resetLostPartitions() is invoked with a version that is
> >> not
> >> > the
> >> > > latest anymore, the invocation will be ignored.
> >> > >
> >> > > The only problem with this approach  is that topology version itself
> >> is
> >> > > currently not a part of public API. It needs to be properly exposed
> >> there
> >> > > first.
> >> > >
> >> > > -Val
> >> > >
> >> > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dm...@apache.org>
> >> wrote:
> >> > >
> >> > > > Just in case here is you can find the present documentation:
> >> > > >
> >> >
> https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
> >> > > >
> >> > > > Let us know what needs to be updated once the issues reported by
> you
> >> > are
> >> > > > addressed.
> >> > > >
> >> > > > --
> >> > > > Denis
> >> > > >
> >> > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> >> > > > andrey.mashenkov@gmail.com> wrote:
> >> > > >
> >> > > > > Hi Igniters,
> >> > > > >
> >> > > > > I've found we no documentation how user can recover cache from
> >> > > cacheStore
> >> > > > > in case of partition loss.
> >> > > > > Ignite provides some instruments (methods and events) that
> should
> >> > help
> >> > > > user
> >> > > > > to solve this problem,
> >> > > > > but looks like these instruments have an architecture lack.
> >> > > > >
> >> > > > > The first one is an usability issue. Ignite provides partition
> >> loss
> >> > > event
> >> > > > > to user can handle this, but Ignite fires an event per
> partition.
> >> > > > > Why we can't have an event with list of lost partitions?
> >> > > > >
> >> > > > > The second one is a bug. Ignite.resetLostPartitions() method
> >> doesn't
> >> > > care
> >> > > > > about what topology version recovered partitions belonged to.
> >> > > > > Tthere is a race, when user call this method after a node was
> >> failed,
> >> > > but
> >> > > > > right before Ignite fire an event.
> >> > > > > So, it is possible state of just lost partitions will be reseted
> >> > > > > unexpectedly.
> >> > > > >
> >> > > > >
> >> > > > > I've created a ticket for this [1] and think we should rethink
> the
> >> > > > > architecture of the partition recovery mechanics and improve
> >> > > > documentation.
> >> > > > > Any thoughts?
> >> > > > >
> >> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best regards,
> >> > > > > Andrey V. Mashenkov
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Partition recovery issue on partition loss.

Posted by Denis Magda <dm...@apache.org>.

Igniters,

Is anybody working on this bug? There is a high chance we can add a fix to
2.5 if the community agrees to release it earlier.

--
Denis

On Thu, Mar 15, 2018 at 11:04 AM, Denis Magda <dm...@apache.org> wrote:

> I dared to set fix version to 2.5 and increased the severity. It's
> important to fix the race since we've just released the partition loss
> functionality in 2.4 and it's already broken.
>
> Andrey, please keep us posted. If you didn't fix it, we would need to find
> another contributor.
>
> --
> Denis
>
> On Thu, Mar 15, 2018 at 7:29 AM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
>
>> Hi Andrew Mashenkov,
>>
>> would you like to pick up issue?
>>
>> Sincerely,
>> Dmitriy Pavlov
>>
>> чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <ds...@apache.org>:
>>
>> > Completely agree, we must fix this. I like the proposed design. We
>> should
>> > also specify that resetLostPartitions() method should return true and
>> > false.
>> >
>> > Val, do you mind updating the ticket with new design?
>> > https://issues.apache.org/jira/browse/IGNITE-7832
>> >
>> > D.
>> >
>> > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
>> > valentin.kulichenko@gmail.com> wrote:
>> >
>> > > This indeed looks like a bigger issue. Basically, there is no clear
>> way
>> > (or
>> > > no way at all) to synchronize code that listens to partition loss
>> event,
>> > > and the code that calls resetLostPartitions() method. Example
>> scenario:
>> > >
>> > > 1. Cache is configured with 3rd party persistence.
>> > > 2. One or more nodes fail causing loss of several partitions in
>> memory.
>> > > 3. Ignite blocks access to those partitions according to partition
>> loss
>> > > policy and fires an event.
>> > > 4. Application listens to the event and starts reloading the data from
>> > > store.
>> > > 5. When reloading is complete, application calls
>> resetLostPartitions() to
>> > > restore access.
>> > > 6. Nodes fail again causing another partition loss, new event is
>> fired.
>> > >
>> > > There is race between steps 5 and 6. If 2nd failure happens BEFORE
>> > > resetLostPartitions() is called, we end up with inconsistent data.
>> > >
>> > > I believe the only way to fix this is to add corresponding topology
>> > version
>> > > to partition loss event, and also add it as a parameter for
>> > > resetLostPartitions().
>> > > This way if resetLostPartitions() is invoked with a version that is
>> not
>> > the
>> > > latest anymore, the invocation will be ignored.
>> > >
>> > > The only problem with this approach  is that topology version itself
>> is
>> > > currently not a part of public API. It needs to be properly exposed
>> there
>> > > first.
>> > >
>> > > -Val
>> > >
>> > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dm...@apache.org>
>> wrote:
>> > >
>> > > > Just in case here is you can find the present documentation:
>> > > >
>> > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
>> > > >
>> > > > Let us know what needs to be updated once the issues reported by you
>> > are
>> > > > addressed.
>> > > >
>> > > > --
>> > > > Denis
>> > > >
>> > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
>> > > > andrey.mashenkov@gmail.com> wrote:
>> > > >
>> > > > > Hi Igniters,
>> > > > >
>> > > > > I've found we no documentation how user can recover cache from
>> > > cacheStore
>> > > > > in case of partition loss.
>> > > > > Ignite provides some instruments (methods and events) that should
>> > help
>> > > > user
>> > > > > to solve this problem,
>> > > > > but looks like these instruments have an architecture lack.
>> > > > >
>> > > > > The first one is an usability issue. Ignite provides partition
>> loss
>> > > event
>> > > > > to user can handle this, but Ignite fires an event per partition.
>> > > > > Why we can't have an event with list of lost partitions?
>> > > > >
>> > > > > The second one is a bug. Ignite.resetLostPartitions() method
>> doesn't
>> > > care
>> > > > > about what topology version recovered partitions belonged to.
>> > > > > Tthere is a race, when user call this method after a node was
>> failed,
>> > > but
>> > > > > right before Ignite fire an event.
>> > > > > So, it is possible state of just lost partitions will be reseted
>> > > > > unexpectedly.
>> > > > >
>> > > > >
>> > > > > I've created a ticket for this [1] and think we should rethink the
>> > > > > architecture of the partition recovery mechanics and improve
>> > > > documentation.
>> > > > > Any thoughts?
>> > > > >
>> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best regards,
>> > > > > Andrey V. Mashenkov
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Partition recovery issue on partition loss.

Posted by Denis Magda <dm...@apache.org>.

I dared to set fix version to 2.5 and increased the severity. It's
important to fix the race since we've just released the partition loss
functionality in 2.4 and it's already broken.

Andrey, please keep us posted. If you didn't fix it, we would need to find
another contributor.

--
Denis

On Thu, Mar 15, 2018 at 7:29 AM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> Hi Andrew Mashenkov,
>
> would you like to pick up issue?
>
> Sincerely,
> Dmitriy Pavlov
>
> чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <ds...@apache.org>:
>
> > Completely agree, we must fix this. I like the proposed design. We should
> > also specify that resetLostPartitions() method should return true and
> > false.
> >
> > Val, do you mind updating the ticket with new design?
> > https://issues.apache.org/jira/browse/IGNITE-7832
> >
> > D.
> >
> > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
> > valentin.kulichenko@gmail.com> wrote:
> >
> > > This indeed looks like a bigger issue. Basically, there is no clear way
> > (or
> > > no way at all) to synchronize code that listens to partition loss
> event,
> > > and the code that calls resetLostPartitions() method. Example scenario:
> > >
> > > 1. Cache is configured with 3rd party persistence.
> > > 2. One or more nodes fail causing loss of several partitions in memory.
> > > 3. Ignite blocks access to those partitions according to partition loss
> > > policy and fires an event.
> > > 4. Application listens to the event and starts reloading the data from
> > > store.
> > > 5. When reloading is complete, application calls resetLostPartitions()
> to
> > > restore access.
> > > 6. Nodes fail again causing another partition loss, new event is fired.
> > >
> > > There is race between steps 5 and 6. If 2nd failure happens BEFORE
> > > resetLostPartitions() is called, we end up with inconsistent data.
> > >
> > > I believe the only way to fix this is to add corresponding topology
> > version
> > > to partition loss event, and also add it as a parameter for
> > > resetLostPartitions().
> > > This way if resetLostPartitions() is invoked with a version that is not
> > the
> > > latest anymore, the invocation will be ignored.
> > >
> > > The only problem with this approach  is that topology version itself is
> > > currently not a part of public API. It needs to be properly exposed
> there
> > > first.
> > >
> > > -Val
> > >
> > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dm...@apache.org>
> wrote:
> > >
> > > > Just in case here is you can find the present documentation:
> > > >
> > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
> > > >
> > > > Let us know what needs to be updated once the issues reported by you
> > are
> > > > addressed.
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> > > > andrey.mashenkov@gmail.com> wrote:
> > > >
> > > > > Hi Igniters,
> > > > >
> > > > > I've found we no documentation how user can recover cache from
> > > cacheStore
> > > > > in case of partition loss.
> > > > > Ignite provides some instruments (methods and events) that should
> > help
> > > > user
> > > > > to solve this problem,
> > > > > but looks like these instruments have an architecture lack.
> > > > >
> > > > > The first one is an usability issue. Ignite provides partition loss
> > > event
> > > > > to user can handle this, but Ignite fires an event per partition.
> > > > > Why we can't have an event with list of lost partitions?
> > > > >
> > > > > The second one is a bug. Ignite.resetLostPartitions() method
> doesn't
> > > care
> > > > > about what topology version recovered partitions belonged to.
> > > > > Tthere is a race, when user call this method after a node was
> failed,
> > > but
> > > > > right before Ignite fire an event.
> > > > > So, it is possible state of just lost partitions will be reseted
> > > > > unexpectedly.
> > > > >
> > > > >
> > > > > I've created a ticket for this [1] and think we should rethink the
> > > > > architecture of the partition recovery mechanics and improve
> > > > documentation.
> > > > > Any thoughts?
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey V. Mashenkov
> > > > >
> > > >
> > >
> >
>

Re: Partition recovery issue on partition loss.

Posted by Dmitry Pavlov <dp...@gmail.com>.

Hi Andrew Mashenkov,

would you like to pick up issue?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <ds...@apache.org>:

> Completely agree, we must fix this. I like the proposed design. We should
> also specify that resetLostPartitions() method should return true and
> false.
>
> Val, do you mind updating the ticket with new design?
> https://issues.apache.org/jira/browse/IGNITE-7832
>
> D.
>
> On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
>
> > This indeed looks like a bigger issue. Basically, there is no clear way
> (or
> > no way at all) to synchronize code that listens to partition loss event,
> > and the code that calls resetLostPartitions() method. Example scenario:
> >
> > 1. Cache is configured with 3rd party persistence.
> > 2. One or more nodes fail causing loss of several partitions in memory.
> > 3. Ignite blocks access to those partitions according to partition loss
> > policy and fires an event.
> > 4. Application listens to the event and starts reloading the data from
> > store.
> > 5. When reloading is complete, application calls resetLostPartitions() to
> > restore access.
> > 6. Nodes fail again causing another partition loss, new event is fired.
> >
> > There is race between steps 5 and 6. If 2nd failure happens BEFORE
> > resetLostPartitions() is called, we end up with inconsistent data.
> >
> > I believe the only way to fix this is to add corresponding topology
> version
> > to partition loss event, and also add it as a parameter for
> > resetLostPartitions().
> > This way if resetLostPartitions() is invoked with a version that is not
> the
> > latest anymore, the invocation will be ignored.
> >
> > The only problem with this approach  is that topology version itself is
> > currently not a part of public API. It needs to be properly exposed there
> > first.
> >
> > -Val
> >
> > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dm...@apache.org> wrote:
> >
> > > Just in case here is you can find the present documentation:
> > >
> https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
> > >
> > > Let us know what needs to be updated once the issues reported by you
> are
> > > addressed.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> > > andrey.mashenkov@gmail.com> wrote:
> > >
> > > > Hi Igniters,
> > > >
> > > > I've found we no documentation how user can recover cache from
> > cacheStore
> > > > in case of partition loss.
> > > > Ignite provides some instruments (methods and events) that should
> help
> > > user
> > > > to solve this problem,
> > > > but looks like these instruments have an architecture lack.
> > > >
> > > > The first one is an usability issue. Ignite provides partition loss
> > event
> > > > to user can handle this, but Ignite fires an event per partition.
> > > > Why we can't have an event with list of lost partitions?
> > > >
> > > > The second one is a bug. Ignite.resetLostPartitions() method doesn't
> > care
> > > > about what topology version recovered partitions belonged to.
> > > > Tthere is a race, when user call this method after a node was failed,
> > but
> > > > right before Ignite fire an event.
> > > > So, it is possible state of just lost partitions will be reseted
> > > > unexpectedly.
> > > >
> > > >
> > > > I've created a ticket for this [1] and think we should rethink the
> > > > architecture of the partition recovery mechanics and improve
> > > documentation.
> > > > Any thoughts?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey V. Mashenkov
> > > >
> > >
> >
>

Re: Partition recovery issue on partition loss.

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Completely agree, we must fix this. I like the proposed design. We should
also specify that resetLostPartitions() method should return true and false.

Val, do you mind updating the ticket with new design?
https://issues.apache.org/jira/browse/IGNITE-7832

D.

On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> This indeed looks like a bigger issue. Basically, there is no clear way (or
> no way at all) to synchronize code that listens to partition loss event,
> and the code that calls resetLostPartitions() method. Example scenario:
>
> 1. Cache is configured with 3rd party persistence.
> 2. One or more nodes fail causing loss of several partitions in memory.
> 3. Ignite blocks access to those partitions according to partition loss
> policy and fires an event.
> 4. Application listens to the event and starts reloading the data from
> store.
> 5. When reloading is complete, application calls resetLostPartitions() to
> restore access.
> 6. Nodes fail again causing another partition loss, new event is fired.
>
> There is race between steps 5 and 6. If 2nd failure happens BEFORE
> resetLostPartitions() is called, we end up with inconsistent data.
>
> I believe the only way to fix this is to add corresponding topology version
> to partition loss event, and also add it as a parameter for
> resetLostPartitions().
> This way if resetLostPartitions() is invoked with a version that is not the
> latest anymore, the invocation will be ignored.
>
> The only problem with this approach  is that topology version itself is
> currently not a part of public API. It needs to be properly exposed there
> first.
>
> -Val
>
> On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dm...@apache.org> wrote:
>
> > Just in case here is you can find the present documentation:
> > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
> >
> > Let us know what needs to be updated once the issues reported by you are
> > addressed.
> >
> > --
> > Denis
> >
> > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> > andrey.mashenkov@gmail.com> wrote:
> >
> > > Hi Igniters,
> > >
> > > I've found we no documentation how user can recover cache from
> cacheStore
> > > in case of partition loss.
> > > Ignite provides some instruments (methods and events) that should help
> > user
> > > to solve this problem,
> > > but looks like these instruments have an architecture lack.
> > >
> > > The first one is an usability issue. Ignite provides partition loss
> event
> > > to user can handle this, but Ignite fires an event per partition.
> > > Why we can't have an event with list of lost partitions?
> > >
> > > The second one is a bug. Ignite.resetLostPartitions() method doesn't
> care
> > > about what topology version recovered partitions belonged to.
> > > Tthere is a race, when user call this method after a node was failed,
> but
> > > right before Ignite fire an event.
> > > So, it is possible state of just lost partitions will be reseted
> > > unexpectedly.
> > >
> > >
> > > I've created a ticket for this [1] and think we should rethink the
> > > architecture of the partition recovery mechanics and improve
> > documentation.
> > > Any thoughts?
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> > >
> >
>

Re: Partition recovery issue on partition loss.

Posted by Valentin Kulichenko <va...@gmail.com>.

This indeed looks like a bigger issue. Basically, there is no clear way (or
no way at all) to synchronize code that listens to partition loss event,
and the code that calls resetLostPartitions() method. Example scenario:

1. Cache is configured with 3rd party persistence.
2. One or more nodes fail causing loss of several partitions in memory.
3. Ignite blocks access to those partitions according to partition loss
policy and fires an event.
4. Application listens to the event and starts reloading the data from
store.
5. When reloading is complete, application calls resetLostPartitions() to
restore access.
6. Nodes fail again causing another partition loss, new event is fired.

There is race between steps 5 and 6. If 2nd failure happens BEFORE
resetLostPartitions() is called, we end up with inconsistent data.

I believe the only way to fix this is to add corresponding topology version
to partition loss event, and also add it as a parameter for
resetLostPartitions().
This way if resetLostPartitions() is invoked with a version that is not the
latest anymore, the invocation will be ignored.

The only problem with this approach  is that topology version itself is
currently not a part of public API. It needs to be properly exposed there
first.

-Val

On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dm...@apache.org> wrote:

> Just in case here is you can find the present documentation:
> https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
>
> Let us know what needs to be updated once the issues reported by you are
> addressed.
>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> andrey.mashenkov@gmail.com> wrote:
>
> > Hi Igniters,
> >
> > I've found we no documentation how user can recover cache from cacheStore
> > in case of partition loss.
> > Ignite provides some instruments (methods and events) that should help
> user
> > to solve this problem,
> > but looks like these instruments have an architecture lack.
> >
> > The first one is an usability issue. Ignite provides partition loss event
> > to user can handle this, but Ignite fires an event per partition.
> > Why we can't have an event with list of lost partitions?
> >
> > The second one is a bug. Ignite.resetLostPartitions() method doesn't care
> > about what topology version recovered partitions belonged to.
> > Tthere is a race, when user call this method after a node was failed, but
> > right before Ignite fire an event.
> > So, it is possible state of just lost partitions will be reseted
> > unexpectedly.
> >
> >
> > I've created a ticket for this [1] and think we should rethink the
> > architecture of the partition recovery mechanics and improve
> documentation.
> > Any thoughts?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >
>

Re: Partition recovery issue on partition loss.

Posted by Denis Magda <dm...@apache.org>.

Just in case here is you can find the present documentation:
https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies

Let us know what needs to be updated once the issues reported by you are
addressed.

--
Denis

On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
andrey.mashenkov@gmail.com> wrote:

> Hi Igniters,
>
> I've found we no documentation how user can recover cache from cacheStore
> in case of partition loss.
> Ignite provides some instruments (methods and events) that should help user
> to solve this problem,
> but looks like these instruments have an architecture lack.
>
> The first one is an usability issue. Ignite provides partition loss event
> to user can handle this, but Ignite fires an event per partition.
> Why we can't have an event with list of lost partitions?
>
> The second one is a bug. Ignite.resetLostPartitions() method doesn't care
> about what topology version recovered partitions belonged to.
> Tthere is a race, when user call this method after a node was failed, but
> right before Ignite fire an event.
> So, it is possible state of just lost partitions will be reseted
> unexpectedly.
>
>
> I've created a ticket for this [1] and think we should rethink the
> architecture of the partition recovery mechanics and improve documentation.
> Any thoughts?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7832
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>