Node-wide and cluster-wide health reflection and reporting

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Node-wide and cluster-wide health reflection and reporting

Ilya Kasnacheev
Hello Igniters,

A fair share of problems with Ignite deployment stems from the fact that
Ignite node doesn't know if it is working properly or it is experiencing
problems.
Cluster doesn't know either if it has malfunctioning nodes, and if there
are, which ones.

For example, a recent case, if node fails to deserialize cache discovery
message batch due to missing class referenced by cache store policy, an
exception is logged, and then client is created without any caches
available to it. Application has no way to figure out the exact problem, or
indeed that there is a problem, until some developer looks at logs and
greps Exception (which is also a skill not available to anyone).

Another case, sometimes Discovery connection is established with a new
node, but it has trouble creating Communication connections. Right now,
this node will hang the whole on join because it will try to re-partition
system cache in progress. If not for system cache, we could accept a node
into topology without checking that we can open Communication connections
to it.

Another trouble is, when node A cannot connect to node B, we can't know for
sure which one is faulty. We would really want to kick faulty nodes out,
but we may end up kicking A when B was the culprit, because we could only
use internal state of A while making the decision.

My proposal - let's add a check list! It is an object with a lot of boolean
fields (Futures, even better) which correspond to various stages that node
must accomplish during joining.
It will be like:
- Discovered nodes and joined ring - check.
- Could open Communication to all other nodes - check.
- Received and populated cache discovery messages - check.

When all check boxes are in, node sends its checklist via ring. If a node
failed to submit its check list, it's eventually kicked out of topology.

We can have a configuration setting specifying whether a node which failed
some checks may still join.

Moreover, we can continue to monitor various health-related metrics in one
place.

E.g. are tasks run in striped pool in the last five minutes? Do we have
alive threads in striped pool? Do we have discovery thread running? Other
kinds of threads and pools, similarly? We have some of those checks but
their results should be gathered in one place.

This will save us from scenario where Ignite instance throws Error in a
thread, thread is stopped and is never restored, node is in unusable state
but that's not apparent to the cluster.

In future we can even fix some problems, for example, by restarting failed
threads. But we need to measure this first.

Maybe there's something like this already, it should just be more visible?


I gather that this should be implemented by people above me in skill.


Regards,

--
Ilya Kasnacheev