Hi all,
I faced with a problem that if Ignite has no memory and IgniteOOM was thrown, there's no way to continues work with a cluster. You cannot remove some part of data to free some space because during removing Ignite tries to move pages to a free list and free list tries to acquire more pages, but there's no more space for this. Ignite can not revert transactions properly due to the same reason. If IgniteOOM occurs during transaction Ignite will try to revert already applied changes and as result will move some pages to free list and there's the same problem as above, no space for the free list too. And you even cannot add more nodes, because after rebalancing ignite will try to evict pages and this means again we need to a space for free list: https://issues.apache.org/jira/browse/IGNITE-7019 Do you have ideas how we can properly handle this? -- Thanks, Mikhail. |
Hello Mikhail,
This problem is related to the discussion around Ignite internal problems and their possible resolution: http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html <http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html> Referring to that discussion, I would define a special IgniteFailureAction in response to IgniteOOM (IgniteFailureCause in terms of the new API). The action can purge, wipe out the page memory or do another extra steps. — Denis > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <[hidden email]> wrote: > > Hi all, > > I faced with a problem that if Ignite has no memory and IgniteOOM was > thrown, there's no way to continues work with a cluster. > > You cannot remove some part of data to free some space because during > removing Ignite tries to move pages to a free list and free list tries > to acquire more pages, but there's no more space for this. > > Ignite can not revert transactions properly due to the same reason. > If IgniteOOM occurs during transaction Ignite will try to revert already > applied changes and as result will move some pages to free list and there's > the same problem as above, no space for the free list too. > > And you even cannot add more nodes, because after rebalancing ignite will > try to evict pages and this means again we need to a space for free list: > https://issues.apache.org/jira/browse/IGNITE-7019 > > Do you have ideas how we can properly handle this? > > -- > Thanks, > Mikhail. |
Hi Denis,
but should we treat current behavior as a bug that should be fixed asap or currently we should treat it as a known limitation? Because now, IgniteOOM means that the whole cluster should be restarted. Thanks, Mikhail. On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <[hidden email]> wrote: > Hello Mikhail, > > This problem is related to the discussion around Ignite internal problems > and their possible resolution: > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems- > requiring-graceful-node-shutdown-reboot-etc-td24856.html < > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems- > requiring-graceful-node-shutdown-reboot-etc-td24856.html> > > Referring to that discussion, I would define a special IgniteFailureAction > in response to IgniteOOM (IgniteFailureCause in terms of the new API). The > action can purge, wipe out the page memory or do another extra steps. > > — > Denis > > > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <[hidden email]> > wrote: > > > > Hi all, > > > > I faced with a problem that if Ignite has no memory and IgniteOOM was > > thrown, there's no way to continues work with a cluster. > > > > You cannot remove some part of data to free some space because during > > removing Ignite tries to move pages to a free list and free list tries > > to acquire more pages, but there's no more space for this. > > > > Ignite can not revert transactions properly due to the same reason. > > If IgniteOOM occurs during transaction Ignite will try to revert already > > applied changes and as result will move some pages to free list and > there's > > the same problem as above, no space for the free list too. > > > > And you even cannot add more nodes, because after rebalancing ignite will > > try to evict pages and this means again we need to a space for free list: > > https://issues.apache.org/jira/browse/IGNITE-7019 > > > > Do you have ideas how we can properly handle this? > > > > -- > > Thanks, > > Mikhail. > > -- Thanks, Mikhail. |
Mikhail,
Here is the first idea that came to my mind. Before a transaction is committed (or an atomic update is applied), we have all entries being written on hands. We can estimate the maximum amount of memory required for this to happen and make a reservation (one AtomicLong CAS) for this memory. If we cannot reserve memory - throw the OOME early. This way we should never get into a situation when it's too late to give up. However, this may not be a very easy task, so we probably need to make a fast prototype to prove the idea works before we start implementing it fully. --AG 2017-12-14 12:22 GMT+03:00 Mikhail Cherkasov <[hidden email]>: > Hi Denis, > > but should we treat current behavior as a bug that should be fixed asap or > currently we should treat it as a known limitation? > Because now, IgniteOOM means that the whole cluster should be restarted. > > Thanks, > Mikhail. > > On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <[hidden email]> wrote: > > > Hello Mikhail, > > > > This problem is related to the discussion around Ignite internal problems > > and their possible resolution: > > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems- > > requiring-graceful-node-shutdown-reboot-etc-td24856.html < > > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems- > > requiring-graceful-node-shutdown-reboot-etc-td24856.html> > > > > Referring to that discussion, I would define a special > IgniteFailureAction > > in response to IgniteOOM (IgniteFailureCause in terms of the new API). > The > > action can purge, wipe out the page memory or do another extra steps. > > > > — > > Denis > > > > > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov < > [hidden email]> > > wrote: > > > > > > Hi all, > > > > > > I faced with a problem that if Ignite has no memory and IgniteOOM was > > > thrown, there's no way to continues work with a cluster. > > > > > > You cannot remove some part of data to free some space because during > > > removing Ignite tries to move pages to a free list and free list tries > > > to acquire more pages, but there's no more space for this. > > > > > > Ignite can not revert transactions properly due to the same reason. > > > If IgniteOOM occurs during transaction Ignite will try to revert > already > > > applied changes and as result will move some pages to free list and > > there's > > > the same problem as above, no space for the free list too. > > > > > > And you even cannot add more nodes, because after rebalancing ignite > will > > > try to evict pages and this means again we need to a space for free > list: > > > https://issues.apache.org/jira/browse/IGNITE-7019 > > > > > > Do you have ideas how we can properly handle this? > > > > > > -- > > > Thanks, > > > Mikhail. > > > > > > > -- > Thanks, > Mikhail. > |
Alexey,
but what if we have memory to save data on the primary node, but backup node does not have enough memory for this. Then it will fail on backup and we again need to revert transaction on primary which means that we need to allocate extra memory for freelist again. Do you think this will be handled by your approach too? Thanks, Mike. On Thu, Dec 14, 2017 at 12:30 PM, Alexey Goncharuk < [hidden email]> wrote: > Mikhail, > > Here is the first idea that came to my mind. Before a transaction is > committed (or an atomic update is applied), we have all entries being > written on hands. We can estimate the maximum amount of memory required for > this to happen and make a reservation (one AtomicLong CAS) for this memory. > If we cannot reserve memory - throw the OOME early. This way we should > never get into a situation when it's too late to give up. > > However, this may not be a very easy task, so we probably need to make a > fast prototype to prove the idea works before we start implementing it > fully. > > --AG > > 2017-12-14 12:22 GMT+03:00 Mikhail Cherkasov <[hidden email]>: > > > Hi Denis, > > > > but should we treat current behavior as a bug that should be fixed asap > or > > currently we should treat it as a known limitation? > > Because now, IgniteOOM means that the whole cluster should be restarted. > > > > Thanks, > > Mikhail. > > > > On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <[hidden email]> wrote: > > > > > Hello Mikhail, > > > > > > This problem is related to the discussion around Ignite internal > problems > > > and their possible resolution: > > > http://apache-ignite-developers.2346864.n4.nabble. > com/Internal-problems- > > > requiring-graceful-node-shutdown-reboot-etc-td24856.html < > > > http://apache-ignite-developers.2346864.n4.nabble. > com/Internal-problems- > > > requiring-graceful-node-shutdown-reboot-etc-td24856.html> > > > > > > Referring to that discussion, I would define a special > > IgniteFailureAction > > > in response to IgniteOOM (IgniteFailureCause in terms of the new API). > > The > > > action can purge, wipe out the page memory or do another extra steps. > > > > > > — > > > Denis > > > > > > > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov < > > [hidden email]> > > > wrote: > > > > > > > > Hi all, > > > > > > > > I faced with a problem that if Ignite has no memory and IgniteOOM was > > > > thrown, there's no way to continues work with a cluster. > > > > > > > > You cannot remove some part of data to free some space because during > > > > removing Ignite tries to move pages to a free list and free list > tries > > > > to acquire more pages, but there's no more space for this. > > > > > > > > Ignite can not revert transactions properly due to the same reason. > > > > If IgniteOOM occurs during transaction Ignite will try to revert > > already > > > > applied changes and as result will move some pages to free list and > > > there's > > > > the same problem as above, no space for the free list too. > > > > > > > > And you even cannot add more nodes, because after rebalancing ignite > > will > > > > try to evict pages and this means again we need to a space for free > > list: > > > > https://issues.apache.org/jira/browse/IGNITE-7019 > > > > > > > > Do you have ideas how we can properly handle this? > > > > > > > > -- > > > > Thanks, > > > > Mikhail. > > > > > > > > > > > > -- > > Thanks, > > Mikhail. > > > -- Thanks, Mikhail. |
I agree with Alex.
Mikhail, you will have to allocate this "safe buffer" during prepare step. I would add to Alex idea that each thread allocates its own "safe buffer" and internal threads do not release this buffer and only enlarge if necessary. Of course, if buffers occasionally grows too large then thread should release extra chunk. --Yakov |
Free forum by Nabble | Edit this page |