Topology-wide notification on critical errors

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Topology-wide notification on critical errors

yzhdanov
Guys,

We have activity to implement a set of mechanisms to handle critical issues
on nodes (IEP-14 - [1]).

I have an idea to spread message about critical issues to nodes through
entire topology and put it to logs of all nodes. In my view this will add
much more clarity. Imagine all nodes output message to log - "Critical
system thread failed on node XXX [details=...]". This should help a lot
with investigations.

Andrey Gura, Alex Goncharuk what do you think?

--Yakov

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

dsetrakyan
On Thu, Apr 19, 2018 at 8:19 AM, Yakov Zhdanov <[hidden email]> wrote:

> Guys,
>
> We have activity to implement a set of mechanisms to handle critical issues
> on nodes (IEP-14 - [1]).
>
> I have an idea to spread message about critical issues to nodes through
> entire topology and put it to logs of all nodes. In my view this will add
> much more clarity. Imagine all nodes output message to log - "Critical
> system thread failed on node XXX [details=...]". This should help a lot
> with investigations.
>
> Andrey Gura, Alex Goncharuk what do you think?
>

Yakov, even though you did not ask me what I think, but I really like the
idea :)
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

Anton Vinogradov-2
Sounds helpful and easy to implement.

2018-04-20 5:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:

> On Thu, Apr 19, 2018 at 8:19 AM, Yakov Zhdanov <[hidden email]>
> wrote:
>
> > Guys,
> >
> > We have activity to implement a set of mechanisms to handle critical
> issues
> > on nodes (IEP-14 - [1]).
> >
> > I have an idea to spread message about critical issues to nodes through
> > entire topology and put it to logs of all nodes. In my view this will add
> > much more clarity. Imagine all nodes output message to log - "Critical
> > system thread failed on node XXX [details=...]". This should help a lot
> > with investigations.
> >
> > Andrey Gura, Alex Goncharuk what do you think?
> >
>
> Yakov, even though you did not ask me what I think, but I really like the
> idea :)
>
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

Anton Vinogradov-2
P.s. Andrey Kuznetsov, corrected me that we have no warranty that failed
node able to notify cluster.

But,

try{
   sendDiscoveryMessageWithFail(...);
} catch(){
   // No-op;
}

is better than nothing, I think.

2018-04-20 14:22 GMT+03:00 Anton Vinogradov <[hidden email]>:

> Sounds helpful and easy to implement.
>
> 2018-04-20 5:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:
>
>> On Thu, Apr 19, 2018 at 8:19 AM, Yakov Zhdanov <[hidden email]>
>> wrote:
>>
>> > Guys,
>> >
>> > We have activity to implement a set of mechanisms to handle critical
>> issues
>> > on nodes (IEP-14 - [1]).
>> >
>> > I have an idea to spread message about critical issues to nodes through
>> > entire topology and put it to logs of all nodes. In my view this will
>> add
>> > much more clarity. Imagine all nodes output message to log - "Critical
>> > system thread failed on node XXX [details=...]". This should help a lot
>> > with investigations.
>> >
>> > Andrey Gura, Alex Goncharuk what do you think?
>> >
>>
>> Yakov, even though you did not ask me what I think, but I really like the
>> idea :)
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

yzhdanov
Of course, no guarantees, but at least an effort.

--Yakov
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

dmagda
It might be useful if it's supported out of the box however usually DevOps
and admins use tools like DynaTrace or Splunk to monitor all the logs,
arrange logs in a meaningful way and set up special hooks for particular
events. It means if an event happens only on 1 node the tool will still
detect it.

Thus my question is who is a primary user of this improvement?

--
Denis

On Fri, Apr 20, 2018 at 5:12 AM, Yakov Zhdanov <[hidden email]> wrote:

> Of course, no guarantees, but at least an effort.
>
> --Yakov
>
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

dsetrakyan
In reply to this post by Anton Vinogradov-2
On Fri, Apr 20, 2018 at 4:50 AM, Anton Vinogradov <[hidden email]> wrote:

> P.s. Andrey Kuznetsov, corrected me that we have no warranty that failed
> node able to notify cluster.
>
> But,
>
> try{
>    sendDiscoveryMessageWithFail(...);
> } catch(){
>    // No-op;
> }
>
> is better than nothing, I think.
>

Agree about the "better than nothing" part, but do not agree about the
"no-op" in the catch block. We should still log the fact that sending of
the failure message failed and provide the exception stack trace if there
is one.
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

Dmitriy Pavlov
Hi Igniters,

+1 to idea of sending this failure to 3rd party monitoring tool.

I also think most of users have its favorite monitoring tool and connect
all systems to it.

But I'm not sure it is easy to implement.

Sincerely,
Dmitriy Pavlov

сб, 21 апр. 2018 г. в 13:09, Dmitriy Setrakyan <[hidden email]>:

> On Fri, Apr 20, 2018 at 4:50 AM, Anton Vinogradov <[hidden email]> wrote:
>
> > P.s. Andrey Kuznetsov, corrected me that we have no warranty that failed
> > node able to notify cluster.
> >
> > But,
> >
> > try{
> >    sendDiscoveryMessageWithFail(...);
> > } catch(){
> >    // No-op;
> > }
> >
> > is better than nothing, I think.
> >
>
> Agree about the "better than nothing" part, but do not agree about the
> "no-op" in the catch block. We should still log the fact that sending of
> the failure message failed and provide the exception stack trace if there
> is one.
>
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

Ilya Kasnacheev
In reply to this post by dmagda
Hello Denis!

In my opinion, the primary users of this improvement will be developers,
who at testing and pre-production stage are encountering errors only when
trying production-size clusters.

This means they end up with a dozen of log files and have no idea where to
start looking at. Since it's non-production, DevOps expertise is often
unavailable or limited at this point.

This is the pattern that we repeatedly see in this maillist and on SO, and
elsewhere.

Regards,

--
Ilya Kasnacheev

2018-04-21 1:20 GMT+03:00 Denis Magda <[hidden email]>:

> It might be useful if it's supported out of the box however usually DevOps
> and admins use tools like DynaTrace or Splunk to monitor all the logs,
> arrange logs in a meaningful way and set up special hooks for particular
> events. It means if an event happens only on 1 node the tool will still
> detect it.
>
> Thus my question is who is a primary user of this improvement?
>
> --
> Denis
>
> On Fri, Apr 20, 2018 at 5:12 AM, Yakov Zhdanov <[hidden email]>
> wrote:
>
> > Of course, no guarantees, but at least an effort.
> >
> > --Yakov
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Topology-wide notification on critical errors

agura
Ilya,

adding of message that will be sent to all other nodes still doesn't
make mentioned task easier. You still should understand where to find
problem description and what exactly.

Only helpful case here is using NoOpFailureHandler because node can
just hang but still be in topology so any diagnostic will be painful.

I don't sure that we should send any cluster-wide message about
critical errors. But I don't have enough arguments against such
behaviour.

On Mon, Apr 23, 2018 at 6:14 PM, Ilya Kasnacheev
<[hidden email]> wrote:

> Hello Denis!
>
> In my opinion, the primary users of this improvement will be developers,
> who at testing and pre-production stage are encountering errors only when
> trying production-size clusters.
>
> This means they end up with a dozen of log files and have no idea where to
> start looking at. Since it's non-production, DevOps expertise is often
> unavailable or limited at this point.
>
> This is the pattern that we repeatedly see in this maillist and on SO, and
> elsewhere.
>
> Regards,
>
> --
> Ilya Kasnacheev
>
> 2018-04-21 1:20 GMT+03:00 Denis Magda <[hidden email]>:
>
>> It might be useful if it's supported out of the box however usually DevOps
>> and admins use tools like DynaTrace or Splunk to monitor all the logs,
>> arrange logs in a meaningful way and set up special hooks for particular
>> events. It means if an event happens only on 1 node the tool will still
>> detect it.
>>
>> Thus my question is who is a primary user of this improvement?
>>
>> --
>> Denis
>>
>> On Fri, Apr 20, 2018 at 5:12 AM, Yakov Zhdanov <[hidden email]>
>> wrote:
>>
>> > Of course, no guarantees, but at least an effort.
>> >
>> > --Yakov
>> >
>>