You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Mikhail Cherkasov <mc...@gridgain.com> on 2017/12/13 17:14:28 UTC

How properly handle IgniteOOM

Hi all,

I faced with a problem that if Ignite has no memory and IgniteOOM was
thrown, there's no way to continues work with a cluster.

You cannot remove some part of data to free some space because during
removing Ignite tries to move pages to a free list and free list tries
to acquire more pages, but there's no more space for this.

Ignite can not revert transactions properly due to the same reason.
If  IgniteOOM occurs during transaction Ignite will try to revert already
applied changes and as result will move some pages to free list and there's
the same problem as above, no space for the free list too.

And you even cannot add more nodes, because after rebalancing ignite will
try to evict pages and this means again we need to a space for free list:
https://issues.apache.org/jira/browse/IGNITE-7019

Do you have ideas how we can properly handle this?

-- 
Thanks,
Mikhail.

Re: How properly handle IgniteOOM

Posted by Yakov Zhdanov <yz...@apache.org>.

I agree with Alex.

Mikhail, you will have to allocate this "safe buffer" during prepare step.

I would add to Alex idea that each thread allocates its own "safe buffer"
and internal threads do not release this buffer and only enlarge if
necessary. Of course, if buffers occasionally grows too large then thread
should release extra chunk.

--Yakov

Re: How properly handle IgniteOOM

Posted by Mikhail Cherkasov <mc...@gridgain.com>.

Alexey,

 but what if we have memory to save data on the primary node, but backup
node does
not have enough memory for this.
Then it will fail on backup and we again need to revert transaction on
primary which
means that we need to allocate extra memory for freelist again.
Do you think this will be handled by your approach too?

Thanks,
Mike.



On Thu, Dec 14, 2017 at 12:30 PM, Alexey Goncharuk <
alexey.goncharuk@gmail.com> wrote:

> Mikhail,
>
> Here is the first idea that came to my mind. Before a transaction is
> committed (or an atomic update is applied), we have all entries being
> written on hands. We can estimate the maximum amount of memory required for
> this to happen and make a reservation (one AtomicLong CAS) for this memory.
> If we cannot reserve memory - throw the OOME early. This way we should
> never get into a situation when it's too late to give up.
>
> However, this may not be a very easy task, so we probably need to make a
> fast prototype to prove the idea works before we start implementing it
> fully.
>
> --AG
>
> 2017-12-14 12:22 GMT+03:00 Mikhail Cherkasov <mc...@gridgain.com>:
>
> > Hi Denis,
> >
> > but should we treat current behavior as a bug that should be fixed asap
> or
> > currently we should treat it as a known limitation?
> > Because now, IgniteOOM means that the whole cluster should be restarted.
> >
> > Thanks,
> > Mikhail.
> >
> > On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <dm...@apache.org> wrote:
> >
> > > Hello Mikhail,
> > >
> > > This problem is related to the discussion around Ignite internal
> problems
> > > and their possible resolution:
> > > http://apache-ignite-developers.2346864.n4.nabble.
> com/Internal-problems-
> > > requiring-graceful-node-shutdown-reboot-etc-td24856.html <
> > > http://apache-ignite-developers.2346864.n4.nabble.
> com/Internal-problems-
> > > requiring-graceful-node-shutdown-reboot-etc-td24856.html>
> > >
> > > Referring to that discussion, I would define a special
> > IgniteFailureAction
> > > in response to IgniteOOM (IgniteFailureCause in terms of the new API).
> > The
> > > action can purge, wipe out the page memory or do another extra steps.
> > >
> > > —
> > > Denis
> > >
> > > > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <
> > mcherkasov@gridgain.com>
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I faced with a problem that if Ignite has no memory and IgniteOOM was
> > > > thrown, there's no way to continues work with a cluster.
> > > >
> > > > You cannot remove some part of data to free some space because during
> > > > removing Ignite tries to move pages to a free list and free list
> tries
> > > > to acquire more pages, but there's no more space for this.
> > > >
> > > > Ignite can not revert transactions properly due to the same reason.
> > > > If  IgniteOOM occurs during transaction Ignite will try to revert
> > already
> > > > applied changes and as result will move some pages to free list and
> > > there's
> > > > the same problem as above, no space for the free list too.
> > > >
> > > > And you even cannot add more nodes, because after rebalancing ignite
> > will
> > > > try to evict pages and this means again we need to a space for free
> > list:
> > > > https://issues.apache.org/jira/browse/IGNITE-7019
> > > >
> > > > Do you have ideas how we can properly handle this?
> > > >
> > > > --
> > > > Thanks,
> > > > Mikhail.
> > >
> > >
> >
> >
> > --
> > Thanks,
> > Mikhail.
> >
>



-- 
Thanks,
Mikhail.

Re: How properly handle IgniteOOM

Posted by Alexey Goncharuk <al...@gmail.com>.

Mikhail,

Here is the first idea that came to my mind. Before a transaction is
committed (or an atomic update is applied), we have all entries being
written on hands. We can estimate the maximum amount of memory required for
this to happen and make a reservation (one AtomicLong CAS) for this memory.
If we cannot reserve memory - throw the OOME early. This way we should
never get into a situation when it's too late to give up.

However, this may not be a very easy task, so we probably need to make a
fast prototype to prove the idea works before we start implementing it
fully.

--AG

2017-12-14 12:22 GMT+03:00 Mikhail Cherkasov <mc...@gridgain.com>:

> Hi Denis,
>
> but should we treat current behavior as a bug that should be fixed asap or
> currently we should treat it as a known limitation?
> Because now, IgniteOOM means that the whole cluster should be restarted.
>
> Thanks,
> Mikhail.
>
> On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <dm...@apache.org> wrote:
>
> > Hello Mikhail,
> >
> > This problem is related to the discussion around Ignite internal problems
> > and their possible resolution:
> > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> > requiring-graceful-node-shutdown-reboot-etc-td24856.html <
> > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> > requiring-graceful-node-shutdown-reboot-etc-td24856.html>
> >
> > Referring to that discussion, I would define a special
> IgniteFailureAction
> > in response to IgniteOOM (IgniteFailureCause in terms of the new API).
> The
> > action can purge, wipe out the page memory or do another extra steps.
> >
> > —
> > Denis
> >
> > > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <
> mcherkasov@gridgain.com>
> > wrote:
> > >
> > > Hi all,
> > >
> > > I faced with a problem that if Ignite has no memory and IgniteOOM was
> > > thrown, there's no way to continues work with a cluster.
> > >
> > > You cannot remove some part of data to free some space because during
> > > removing Ignite tries to move pages to a free list and free list tries
> > > to acquire more pages, but there's no more space for this.
> > >
> > > Ignite can not revert transactions properly due to the same reason.
> > > If  IgniteOOM occurs during transaction Ignite will try to revert
> already
> > > applied changes and as result will move some pages to free list and
> > there's
> > > the same problem as above, no space for the free list too.
> > >
> > > And you even cannot add more nodes, because after rebalancing ignite
> will
> > > try to evict pages and this means again we need to a space for free
> list:
> > > https://issues.apache.org/jira/browse/IGNITE-7019
> > >
> > > Do you have ideas how we can properly handle this?
> > >
> > > --
> > > Thanks,
> > > Mikhail.
> >
> >
>
>
> --
> Thanks,
> Mikhail.
>

Re: How properly handle IgniteOOM

Posted by Mikhail Cherkasov <mc...@gridgain.com>.

Hi Denis,

but should we treat current behavior as a bug that should be fixed asap or
currently we should treat it as a known limitation?
Because now, IgniteOOM means that the whole cluster should be restarted.

Thanks,
Mikhail.

On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <dm...@apache.org> wrote:

> Hello Mikhail,
>
> This problem is related to the discussion around Ignite internal problems
> and their possible resolution:
> http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> requiring-graceful-node-shutdown-reboot-etc-td24856.html <
> http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> requiring-graceful-node-shutdown-reboot-etc-td24856.html>
>
> Referring to that discussion, I would define a special IgniteFailureAction
> in response to IgniteOOM (IgniteFailureCause in terms of the new API). The
> action can purge, wipe out the page memory or do another extra steps.
>
> —
> Denis
>
> > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <mc...@gridgain.com>
> wrote:
> >
> > Hi all,
> >
> > I faced with a problem that if Ignite has no memory and IgniteOOM was
> > thrown, there's no way to continues work with a cluster.
> >
> > You cannot remove some part of data to free some space because during
> > removing Ignite tries to move pages to a free list and free list tries
> > to acquire more pages, but there's no more space for this.
> >
> > Ignite can not revert transactions properly due to the same reason.
> > If  IgniteOOM occurs during transaction Ignite will try to revert already
> > applied changes and as result will move some pages to free list and
> there's
> > the same problem as above, no space for the free list too.
> >
> > And you even cannot add more nodes, because after rebalancing ignite will
> > try to evict pages and this means again we need to a space for free list:
> > https://issues.apache.org/jira/browse/IGNITE-7019
> >
> > Do you have ideas how we can properly handle this?
> >
> > --
> > Thanks,
> > Mikhail.
>
>


-- 
Thanks,
Mikhail.

Re: How properly handle IgniteOOM

Posted by Denis Magda <dm...@apache.org>.

Hello Mikhail,

This problem is related to the discussion around Ignite internal problems and their possible resolution:
http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html <http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html>

Referring to that discussion, I would define a special IgniteFailureAction in response to IgniteOOM (IgniteFailureCause in terms of the new API). The action can purge, wipe out the page memory or do another extra steps.

—
Denis

> On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <mc...@gridgain.com> wrote:
> 
> Hi all,
> 
> I faced with a problem that if Ignite has no memory and IgniteOOM was
> thrown, there's no way to continues work with a cluster.
> 
> You cannot remove some part of data to free some space because during
> removing Ignite tries to move pages to a free list and free list tries
> to acquire more pages, but there's no more space for this.
> 
> Ignite can not revert transactions properly due to the same reason.
> If  IgniteOOM occurs during transaction Ignite will try to revert already
> applied changes and as result will move some pages to free list and there's
> the same problem as above, no space for the free list too.
> 
> And you even cannot add more nodes, because after rebalancing ignite will
> try to evict pages and this means again we need to a space for free list:
> https://issues.apache.org/jira/browse/IGNITE-7019
> 
> Do you have ideas how we can properly handle this?
> 
> -- 
> Thanks,
> Mikhail.