Vladimir Pligin created IGNITE-14248:
---------------------------------------- Summary: Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock properly Key: IGNITE-14248 URL: https://issues.apache.org/jira/browse/IGNITE-14248 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.9.1 Reporter: Vladimir Pligin If an exception (or even Error) is thrown inside of the method then the node turns into some unrecoverable state. Here's an example. # an exchange is about to finish, it's time to invalidate partition reservations. # exchange thread delegates it to a thread in the management pool # management pool tries to allocate a new thread (maybe it's idle and therefore empty) # for example ulimit is reached, the error is java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached # It's being logged, no further action is taken # partitions are reserved forever Message: 2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start reservations cleanup java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.base/java.lang.Thread.start0(Native Method) at java.base/java.lang.Thread.start(Thread.java:803) at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) at org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847) at org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119) at java.base/java.lang.Thread.run(Thread.java:834) Code of PartitionReservationManager.onDoneAfterTopologyUnlock: {code:java} @Override public void onDoneAfterTopologyUnlock(final GridDhtPartitionsExchangeFuture fut) { try { // Must not do anything at the exchange thread. Dispatch to the management thread pool. ctx.closure().runLocal(() -> { AffinityTopologyVersion topVer = ctx.cache().context().exchange() .lastAffinityChangedTopologyVersion(fut.topologyVersion()); reservations.forEach((key, r) -> { if (r != REPLICATED_RESERVABLE && !F.eq(key.topologyVersion(), topVer)) { assert r instanceof GridDhtPartitionsReservation; ((GridDhtPartitionsReservation)r).invalidate(); } }); }, GridIoPolicy.MANAGEMENT_POOL); } catch (Throwable e) { log.error("Unexpected exception on start reservations cleanup", e); } } {code} My vision is there are two basic approaches: * to kill the node (it's already non-functional at this point) * try to recover somehow (to be honest it's not clear how exactly) This particular OOM situation seems unrecoverable in fact. It's a environment misconfiguration. It would be great to investigate if potentially recoverable exceptions are possible to be raised inside this block. -- This message was sent by Atlassian Jira (v8.3.4#803005) |
Free forum by Nabble | Edit this page |