Igniters,
I've created new IEP [1] to address important case when Partition Map Exchange process (for more info on it refer to [2]) hangs for some reason. If this happens user now has to manually identify nodes causing PME to hang and do necessary actions (usually it is enough to stop hanging nodes to unblock PME). Identification and stopping of nodes blocking PME can be done automatically by coordinator node, three scenarios are already described in corresponding tickets on IEP page. But when stopping nodes we should remember about chance of loosing partitions: if nodes identified to be blocking PME hold all copies of a partition, partition will be lost if coordinator decides to stop all nodes unconditionally. To give user a choice I propose to add to configuration new policy: PMEHangResolvePolicy with three options: - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear message with information about hanging nodes and suggestions of how to fix the situation; - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only after it checks affinity distribution and makes sure no partitions will be lost; - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes unconditionally not making any checks against affinity distribution, so partition loss may happen. What does community think of proposed change? Are there any additional cases not covered by tickets or comments about new policy? Thanks, Sergey. [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving [2] https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exchange+-+under+the+hood |
Hi Sergey,
This is probably the most important IEP we have. I am assuming that after this gets fixed, Ignite cluster will never come to a freezing state. I propose to name the enum *PmeStopPolicy*. Here are my suggestions: - NONE - will result in logging the state - STOP_PRESERVE_PARTITIONS - nodes will be stopped, as long as every partition has at least one copy in the cluster - STOP_ALL - all frozen nodes will be stopped, if partitions are lost, cluster will enter read-only state and will not serve data for the lost partitions. I also have some questions: - Does this policy apply only to the server nodes, or to client nodes as well? - Can the nodes be automatically restarted? D. On Thu, Jun 21, 2018 at 5:14 AM, Sergey Chugunov <[hidden email]> wrote: > Igniters, > > I've created new IEP [1] to address important case when Partition Map > Exchange process (for more info on it refer to [2]) hangs for some reason. > > If this happens user now has to manually identify nodes causing PME to hang > and do necessary actions (usually it is enough to stop hanging nodes to > unblock PME). > > Identification and stopping of nodes blocking PME can be done automatically > by coordinator node, three scenarios are already described in corresponding > tickets on IEP page. > But when stopping nodes we should remember about chance of loosing > partitions: if nodes identified to be blocking PME hold all copies of a > partition, partition will be lost if coordinator decides to stop all nodes > unconditionally. > > To give user a choice I propose to add to configuration new policy: > PMEHangResolvePolicy with three options: > > - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear > message with information about hanging nodes and suggestions of how to > fix > the situation; > - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only > after it checks affinity distribution and makes sure no partitions will > be > lost; > - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes > unconditionally not making any checks against affinity distribution, so > partition loss may happen. > > > What does community think of proposed change? Are there any additional > cases not covered by tickets or comments about new policy? > > Thanks, > Sergey. > > [1] > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 25%3A+Partition+Map+Exchange+hangs+resolving > > [2] > <a href="https://cwiki.apache.org/confluence/display/IGNITE/%">https://cwiki.apache.org/confluence/display/IGNITE/% > 28Partition+Map%29+Exchange+-+under+the+hood > |
Dmitriy,
Answering to your questions: 1) policy applies only to servers as coordinator never waits for clients in PME protocol; 2) we cannot restart handing node automatically only stop it. Node restart should be responsibility of monitoring system or end user. -- Thanks, Sergey. On Thu, Jun 21, 2018 at 4:21 PM Dmitriy Setrakyan <[hidden email]> wrote: > Hi Sergey, > > This is probably the most important IEP we have. I am assuming that after > this gets fixed, Ignite cluster will never come to a freezing state. > > I propose to name the enum *PmeStopPolicy*. Here are my suggestions: > > - NONE - will result in logging the state > - STOP_PRESERVE_PARTITIONS - nodes will be stopped, as long as every > partition has at least one copy in the cluster > - STOP_ALL - all frozen nodes will be stopped, if partitions are lost, > cluster will enter read-only state and will not serve data for the lost > partitions. > > I also have some questions: > > - Does this policy apply only to the server nodes, or to client nodes as > well? > - Can the nodes be automatically restarted? > > D. > > > On Thu, Jun 21, 2018 at 5:14 AM, Sergey Chugunov < > [hidden email]> > wrote: > > > Igniters, > > > > I've created new IEP [1] to address important case when Partition Map > > Exchange process (for more info on it refer to [2]) hangs for some > reason. > > > > If this happens user now has to manually identify nodes causing PME to > hang > > and do necessary actions (usually it is enough to stop hanging nodes to > > unblock PME). > > > > Identification and stopping of nodes blocking PME can be done > automatically > > by coordinator node, three scenarios are already described in > corresponding > > tickets on IEP page. > > But when stopping nodes we should remember about chance of loosing > > partitions: if nodes identified to be blocking PME hold all copies of a > > partition, partition will be lost if coordinator decides to stop all > nodes > > unconditionally. > > > > To give user a choice I propose to add to configuration new policy: > > PMEHangResolvePolicy with three options: > > > > - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear > > message with information about hanging nodes and suggestions of how to > > fix > > the situation; > > - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only > > after it checks affinity distribution and makes sure no partitions > will > > be > > lost; > > - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes > > unconditionally not making any checks against affinity distribution, > so > > partition loss may happen. > > > > > > What does community think of proposed change? Are there any additional > > cases not covered by tickets or comments about new policy? > > > > Thanks, > > Sergey. > > > > [1] > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 25%3A+Partition+Map+Exchange+hangs+resolving > > > > [2] > > <a href="https://cwiki.apache.org/confluence/display/IGNITE/%">https://cwiki.apache.org/confluence/display/IGNITE/% > > 28Partition+Map%29+Exchange+-+under+the+hood > > > |
On Fri, Jun 22, 2018 at 6:13 AM, Sergey Chugunov <[hidden email]>
wrote: > Dmitriy, > > Answering to your questions: > > 1) policy applies only to servers as coordinator never waits for clients in > PME protocol; > Sergey, to my knowledge, clients do participate in PME (although would be interesting to find out why). This means that clients also can block and freeze the cluster. Does your solution support automatic unblocking of client nodes as well? D. |
Dmitriy,
I consulted with code of *GridDhtPartitionsExchangeFuture*: in its *init* method only server nodes are added to remaining set, so coordinator will never wait for clients in order to complete exchange. So clients are not a problem here. -- Thanks, Sergey On Fri, Jun 22, 2018 at 8:45 PM Dmitriy Setrakyan <[hidden email]> wrote: > On Fri, Jun 22, 2018 at 6:13 AM, Sergey Chugunov < > [hidden email]> > wrote: > > > Dmitriy, > > > > Answering to your questions: > > > > 1) policy applies only to servers as coordinator never waits for clients > in > > PME protocol; > > > > Sergey, to my knowledge, clients do participate in PME (although would be > interesting to find out why). This means that clients also can block and > freeze the cluster. Does your solution support automatic unblocking of > client nodes as well? > > D. > |
Free forum by Nabble | Edit this page |