Apache Ignite Developers - Legacy Mail Archive

How properly handle IgniteOOM

Classic

List

Threaded

6 messages Options

Mikhail Cherkasov

How properly handle IgniteOOM

Hi all,

I faced with a problem that if Ignite has no memory and IgniteOOM was
thrown, there's no way to continues work with a cluster.

You cannot remove some part of data to free some space because during
removing Ignite tries to move pages to a free list and free list tries
to acquire more pages, but there's no more space for this.

Ignite can not revert transactions properly due to the same reason.
If IgniteOOM occurs during transaction Ignite will try to revert already
applied changes and as result will move some pages to free list and there's
the same problem as above, no space for the free list too.

And you even cannot add more nodes, because after rebalancing ignite will
try to evict pages and this means again we need to a space for free list:
https://issues.apache.org/jira/browse/IGNITE-7019

Do you have ideas how we can properly handle this?

--
Thanks,
Mikhail.

dmagda

Re: How properly handle IgniteOOM

Hello Mikhail,

This problem is related to the discussion around Ignite internal problems and their possible resolution:
http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html <http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html>

Referring to that discussion, I would define a special IgniteFailureAction in response to IgniteOOM (IgniteFailureCause in terms of the new API). The action can purge, wipe out the page memory or do another extra steps.

—
Denis

> On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <[hidden email]> wrote:
>
> Hi all,
>
> I faced with a problem that if Ignite has no memory and IgniteOOM was
> thrown, there's no way to continues work with a cluster.
>
> You cannot remove some part of data to free some space because during
> removing Ignite tries to move pages to a free list and free list tries
> to acquire more pages, but there's no more space for this.
>
> Ignite can not revert transactions properly due to the same reason.
> If IgniteOOM occurs during transaction Ignite will try to revert already
> applied changes and as result will move some pages to free list and there's
> the same problem as above, no space for the free list too.
>
> And you even cannot add more nodes, because after rebalancing ignite will
> try to evict pages and this means again we need to a space for free list:
> https://issues.apache.org/jira/browse/IGNITE-7019
>
> Do you have ideas how we can properly handle this?
>
> --
> Thanks,
> Mikhail.

Mikhail Cherkasov

Re: How properly handle IgniteOOM

Hi Denis,

but should we treat current behavior as a bug that should be fixed asap or
currently we should treat it as a known limitation?
Because now, IgniteOOM means that the whole cluster should be restarted.

Thanks,
Mikhail.

On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <[hidden email]> wrote:

> Hello Mikhail,
>
> This problem is related to the discussion around Ignite internal problems
> and their possible resolution:
> http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> requiring-graceful-node-shutdown-reboot-etc-td24856.html <
> http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> requiring-graceful-node-shutdown-reboot-etc-td24856.html>
>
> Referring to that discussion, I would define a special IgniteFailureAction
> in response to IgniteOOM (IgniteFailureCause in terms of the new API). The
> action can purge, wipe out the page memory or do another extra steps.
>
> —
> Denis
>
> > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <[hidden email]>
> wrote:
> >
> > Hi all,
> >
> > I faced with a problem that if Ignite has no memory and IgniteOOM was
> > thrown, there's no way to continues work with a cluster.
> >
> > You cannot remove some part of data to free some space because during
> > removing Ignite tries to move pages to a free list and free list tries
> > to acquire more pages, but there's no more space for this.
> >
> > Ignite can not revert transactions properly due to the same reason.
> > If IgniteOOM occurs during transaction Ignite will try to revert already
> > applied changes and as result will move some pages to free list and
> there's
> > the same problem as above, no space for the free list too.
> >
> > And you even cannot add more nodes, because after rebalancing ignite will
> > try to evict pages and this means again we need to a space for free list:
> > https://issues.apache.org/jira/browse/IGNITE-7019
> >
> > Do you have ideas how we can properly handle this?
> >
> > --
> > Thanks,
> > Mikhail.
>
>

--
Thanks,
Mikhail.

Alexey Goncharuk

Re: How properly handle IgniteOOM

Mikhail,

Here is the first idea that came to my mind. Before a transaction is
committed (or an atomic update is applied), we have all entries being
written on hands. We can estimate the maximum amount of memory required for
this to happen and make a reservation (one AtomicLong CAS) for this memory.
If we cannot reserve memory - throw the OOME early. This way we should
never get into a situation when it's too late to give up.

However, this may not be a very easy task, so we probably need to make a
fast prototype to prove the idea works before we start implementing it
fully.

--AG

2017-12-14 12:22 GMT+03:00 Mikhail Cherkasov <[hidden email]>:

> Hi Denis,
>
> but should we treat current behavior as a bug that should be fixed asap or
> currently we should treat it as a known limitation?
> Because now, IgniteOOM means that the whole cluster should be restarted.
>
> Thanks,
> Mikhail.
>
> On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <[hidden email]> wrote:
>
> > Hello Mikhail,
> >
> > This problem is related to the discussion around Ignite internal problems
> > and their possible resolution:
> > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> > requiring-graceful-node-shutdown-reboot-etc-td24856.html <
> > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> > requiring-graceful-node-shutdown-reboot-etc-td24856.html>
> >
> > Referring to that discussion, I would define a special
> IgniteFailureAction
> > in response to IgniteOOM (IgniteFailureCause in terms of the new API).
> The
> > action can purge, wipe out the page memory or do another extra steps.
> >
> > —
> > Denis
> >
> > > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <
> [hidden email]>
> > wrote:
> > >
> > > Hi all,
> > >
> > > I faced with a problem that if Ignite has no memory and IgniteOOM was
> > > thrown, there's no way to continues work with a cluster.
> > >
> > > You cannot remove some part of data to free some space because during
> > > removing Ignite tries to move pages to a free list and free list tries
> > > to acquire more pages, but there's no more space for this.
> > >
> > > Ignite can not revert transactions properly due to the same reason.
> > > If IgniteOOM occurs during transaction Ignite will try to revert
> already
> > > applied changes and as result will move some pages to free list and
> > there's
> > > the same problem as above, no space for the free list too.
> > >
> > > And you even cannot add more nodes, because after rebalancing ignite
> will
> > > try to evict pages and this means again we need to a space for free
> list:
> > > https://issues.apache.org/jira/browse/IGNITE-7019
> > >
> > > Do you have ideas how we can properly handle this?
> > >
> > > --
> > > Thanks,
> > > Mikhail.
> >
> >
>
>
> --
> Thanks,
> Mikhail.
>

Mikhail Cherkasov

Re: How properly handle IgniteOOM

Alexey,

but what if we have memory to save data on the primary node, but backup
node does
not have enough memory for this.
Then it will fail on backup and we again need to revert transaction on
primary which
means that we need to allocate extra memory for freelist again.
Do you think this will be handled by your approach too?

Thanks,
Mike.

On Thu, Dec 14, 2017 at 12:30 PM, Alexey Goncharuk <
[hidden email]> wrote:

> Mikhail,
>
> Here is the first idea that came to my mind. Before a transaction is
> committed (or an atomic update is applied), we have all entries being
> written on hands. We can estimate the maximum amount of memory required for
> this to happen and make a reservation (one AtomicLong CAS) for this memory.
> If we cannot reserve memory - throw the OOME early. This way we should
> never get into a situation when it's too late to give up.
>
> However, this may not be a very easy task, so we probably need to make a
> fast prototype to prove the idea works before we start implementing it
> fully.
>
> --AG
>
> 2017-12-14 12:22 GMT+03:00 Mikhail Cherkasov <[hidden email]>:
>
> > Hi Denis,
> >
> > but should we treat current behavior as a bug that should be fixed asap
> or
> > currently we should treat it as a known limitation?
> > Because now, IgniteOOM means that the whole cluster should be restarted.
> >
> > Thanks,
> > Mikhail.
> >
> > On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <[hidden email]> wrote:
> >
> > > Hello Mikhail,
> > >
> > > This problem is related to the discussion around Ignite internal
> problems
> > > and their possible resolution:
> > > http://apache-ignite-developers.2346864.n4.nabble.
> com/Internal-problems-
> > > requiring-graceful-node-shutdown-reboot-etc-td24856.html <
> > > http://apache-ignite-developers.2346864.n4.nabble.
> com/Internal-problems-
> > > requiring-graceful-node-shutdown-reboot-etc-td24856.html>
> > >
> > > Referring to that discussion, I would define a special
> > IgniteFailureAction
> > > in response to IgniteOOM (IgniteFailureCause in terms of the new API).
> > The
> > > action can purge, wipe out the page memory or do another extra steps.
> > >
> > > —
> > > Denis
> > >
> > > > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <
> > [hidden email]>
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I faced with a problem that if Ignite has no memory and IgniteOOM was
> > > > thrown, there's no way to continues work with a cluster.
> > > >
> > > > You cannot remove some part of data to free some space because during
> > > > removing Ignite tries to move pages to a free list and free list
> tries
> > > > to acquire more pages, but there's no more space for this.
> > > >
> > > > Ignite can not revert transactions properly due to the same reason.
> > > > If IgniteOOM occurs during transaction Ignite will try to revert
> > already
> > > > applied changes and as result will move some pages to free list and
> > > there's
> > > > the same problem as above, no space for the free list too.
> > > >
> > > > And you even cannot add more nodes, because after rebalancing ignite
> > will
> > > > try to evict pages and this means again we need to a space for free
> > list:
> > > > https://issues.apache.org/jira/browse/IGNITE-7019
> > > >
> > > > Do you have ideas how we can properly handle this?
> > > >
> > > > --
> > > > Thanks,
> > > > Mikhail.
> > >
> > >
> >
> >
> > --
> > Thanks,
> > Mikhail.
> >
>

--
Thanks,
Mikhail.

yzhdanov

Re: How properly handle IgniteOOM

I agree with Alex.

Mikhail, you will have to allocate this "safe buffer" during prepare step.

I would add to Alex idea that each thread allocates its own "safe buffer"
and internal threads do not release this buffer and only enlarge if
necessary. Of course, if buffers occasionally grows too large then thread
should release extra chunk.

--Yakov