Guys,
We have activity to implement a set of mechanisms to handle critical issues on nodes (IEP-14 - [1]). I have an idea to spread message about critical issues to nodes through entire topology and put it to logs of all nodes. In my view this will add much more clarity. Imagine all nodes output message to log - "Critical system thread failed on node XXX [details=...]". This should help a lot with investigations. Andrey Gura, Alex Goncharuk what do you think? --Yakov [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling |
On Thu, Apr 19, 2018 at 8:19 AM, Yakov Zhdanov <[hidden email]> wrote:
> Guys, > > We have activity to implement a set of mechanisms to handle critical issues > on nodes (IEP-14 - [1]). > > I have an idea to spread message about critical issues to nodes through > entire topology and put it to logs of all nodes. In my view this will add > much more clarity. Imagine all nodes output message to log - "Critical > system thread failed on node XXX [details=...]". This should help a lot > with investigations. > > Andrey Gura, Alex Goncharuk what do you think? > Yakov, even though you did not ask me what I think, but I really like the idea :) |
Sounds helpful and easy to implement.
2018-04-20 5:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: > On Thu, Apr 19, 2018 at 8:19 AM, Yakov Zhdanov <[hidden email]> > wrote: > > > Guys, > > > > We have activity to implement a set of mechanisms to handle critical > issues > > on nodes (IEP-14 - [1]). > > > > I have an idea to spread message about critical issues to nodes through > > entire topology and put it to logs of all nodes. In my view this will add > > much more clarity. Imagine all nodes output message to log - "Critical > > system thread failed on node XXX [details=...]". This should help a lot > > with investigations. > > > > Andrey Gura, Alex Goncharuk what do you think? > > > > Yakov, even though you did not ask me what I think, but I really like the > idea :) > |
P.s. Andrey Kuznetsov, corrected me that we have no warranty that failed
node able to notify cluster. But, try{ sendDiscoveryMessageWithFail(...); } catch(){ // No-op; } is better than nothing, I think. 2018-04-20 14:22 GMT+03:00 Anton Vinogradov <[hidden email]>: > Sounds helpful and easy to implement. > > 2018-04-20 5:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: > >> On Thu, Apr 19, 2018 at 8:19 AM, Yakov Zhdanov <[hidden email]> >> wrote: >> >> > Guys, >> > >> > We have activity to implement a set of mechanisms to handle critical >> issues >> > on nodes (IEP-14 - [1]). >> > >> > I have an idea to spread message about critical issues to nodes through >> > entire topology and put it to logs of all nodes. In my view this will >> add >> > much more clarity. Imagine all nodes output message to log - "Critical >> > system thread failed on node XXX [details=...]". This should help a lot >> > with investigations. >> > >> > Andrey Gura, Alex Goncharuk what do you think? >> > >> >> Yakov, even though you did not ask me what I think, but I really like the >> idea :) >> > > |
Of course, no guarantees, but at least an effort.
--Yakov |
It might be useful if it's supported out of the box however usually DevOps
and admins use tools like DynaTrace or Splunk to monitor all the logs, arrange logs in a meaningful way and set up special hooks for particular events. It means if an event happens only on 1 node the tool will still detect it. Thus my question is who is a primary user of this improvement? -- Denis On Fri, Apr 20, 2018 at 5:12 AM, Yakov Zhdanov <[hidden email]> wrote: > Of course, no guarantees, but at least an effort. > > --Yakov > |
In reply to this post by Anton Vinogradov-2
On Fri, Apr 20, 2018 at 4:50 AM, Anton Vinogradov <[hidden email]> wrote:
> P.s. Andrey Kuznetsov, corrected me that we have no warranty that failed > node able to notify cluster. > > But, > > try{ > sendDiscoveryMessageWithFail(...); > } catch(){ > // No-op; > } > > is better than nothing, I think. > Agree about the "better than nothing" part, but do not agree about the "no-op" in the catch block. We should still log the fact that sending of the failure message failed and provide the exception stack trace if there is one. |
Hi Igniters,
+1 to idea of sending this failure to 3rd party monitoring tool. I also think most of users have its favorite monitoring tool and connect all systems to it. But I'm not sure it is easy to implement. Sincerely, Dmitriy Pavlov сб, 21 апр. 2018 г. в 13:09, Dmitriy Setrakyan <[hidden email]>: > On Fri, Apr 20, 2018 at 4:50 AM, Anton Vinogradov <[hidden email]> wrote: > > > P.s. Andrey Kuznetsov, corrected me that we have no warranty that failed > > node able to notify cluster. > > > > But, > > > > try{ > > sendDiscoveryMessageWithFail(...); > > } catch(){ > > // No-op; > > } > > > > is better than nothing, I think. > > > > Agree about the "better than nothing" part, but do not agree about the > "no-op" in the catch block. We should still log the fact that sending of > the failure message failed and provide the exception stack trace if there > is one. > |
In reply to this post by dmagda
Hello Denis!
In my opinion, the primary users of this improvement will be developers, who at testing and pre-production stage are encountering errors only when trying production-size clusters. This means they end up with a dozen of log files and have no idea where to start looking at. Since it's non-production, DevOps expertise is often unavailable or limited at this point. This is the pattern that we repeatedly see in this maillist and on SO, and elsewhere. Regards, -- Ilya Kasnacheev 2018-04-21 1:20 GMT+03:00 Denis Magda <[hidden email]>: > It might be useful if it's supported out of the box however usually DevOps > and admins use tools like DynaTrace or Splunk to monitor all the logs, > arrange logs in a meaningful way and set up special hooks for particular > events. It means if an event happens only on 1 node the tool will still > detect it. > > Thus my question is who is a primary user of this improvement? > > -- > Denis > > On Fri, Apr 20, 2018 at 5:12 AM, Yakov Zhdanov <[hidden email]> > wrote: > > > Of course, no guarantees, but at least an effort. > > > > --Yakov > > > |
Ilya,
adding of message that will be sent to all other nodes still doesn't make mentioned task easier. You still should understand where to find problem description and what exactly. Only helpful case here is using NoOpFailureHandler because node can just hang but still be in topology so any diagnostic will be painful. I don't sure that we should send any cluster-wide message about critical errors. But I don't have enough arguments against such behaviour. On Mon, Apr 23, 2018 at 6:14 PM, Ilya Kasnacheev <[hidden email]> wrote: > Hello Denis! > > In my opinion, the primary users of this improvement will be developers, > who at testing and pre-production stage are encountering errors only when > trying production-size clusters. > > This means they end up with a dozen of log files and have no idea where to > start looking at. Since it's non-production, DevOps expertise is often > unavailable or limited at this point. > > This is the pattern that we repeatedly see in this maillist and on SO, and > elsewhere. > > Regards, > > -- > Ilya Kasnacheev > > 2018-04-21 1:20 GMT+03:00 Denis Magda <[hidden email]>: > >> It might be useful if it's supported out of the box however usually DevOps >> and admins use tools like DynaTrace or Splunk to monitor all the logs, >> arrange logs in a meaningful way and set up special hooks for particular >> events. It means if an event happens only on 1 node the tool will still >> detect it. >> >> Thus my question is who is a primary user of this improvement? >> >> -- >> Denis >> >> On Fri, Apr 20, 2018 at 5:12 AM, Yakov Zhdanov <[hidden email]> >> wrote: >> >> > Of course, no guarantees, but at least an effort. >> > >> > --Yakov >> > >> |
Free forum by Nabble | Edit this page |