Hello Igniters,
A fair share of problems with Ignite deployment stems from the fact that Ignite node doesn't know if it is working properly or it is experiencing problems. Cluster doesn't know either if it has malfunctioning nodes, and if there are, which ones. For example, a recent case, if node fails to deserialize cache discovery message batch due to missing class referenced by cache store policy, an exception is logged, and then client is created without any caches available to it. Application has no way to figure out the exact problem, or indeed that there is a problem, until some developer looks at logs and greps Exception (which is also a skill not available to anyone). Another case, sometimes Discovery connection is established with a new node, but it has trouble creating Communication connections. Right now, this node will hang the whole on join because it will try to re-partition system cache in progress. If not for system cache, we could accept a node into topology without checking that we can open Communication connections to it. Another trouble is, when node A cannot connect to node B, we can't know for sure which one is faulty. We would really want to kick faulty nodes out, but we may end up kicking A when B was the culprit, because we could only use internal state of A while making the decision. My proposal - let's add a check list! It is an object with a lot of boolean fields (Futures, even better) which correspond to various stages that node must accomplish during joining. It will be like: - Discovered nodes and joined ring - check. - Could open Communication to all other nodes - check. - Received and populated cache discovery messages - check. When all check boxes are in, node sends its checklist via ring. If a node failed to submit its check list, it's eventually kicked out of topology. We can have a configuration setting specifying whether a node which failed some checks may still join. Moreover, we can continue to monitor various health-related metrics in one place. E.g. are tasks run in striped pool in the last five minutes? Do we have alive threads in striped pool? Do we have discovery thread running? Other kinds of threads and pools, similarly? We have some of those checks but their results should be gathered in one place. This will save us from scenario where Ignite instance throws Error in a thread, thread is stopped and is never restored, node is in unusable state but that's not apparent to the cluster. In future we can even fix some problems, for example, by restarting failed threads. But we need to measure this first. Maybe there's something like this already, it should just be more visible? I gather that this should be implemented by people above me in skill. Regards, -- Ilya Kasnacheev |
Free forum by Nabble | Edit this page |