Hopeless looping in TcpCommunicationSpi

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Hopeless looping in TcpCommunicationSpi

Ilya Kasnacheev
Hello Igniters,

In two weeks there were three times when I've sumbled on looping behavior
of TcpCommunicationSpi.reserveClient(): while (true) {}

One of them, for example, included differing SQL certificates on two nodes,
which led to successful discovery followed by ever-failing communication
(which I fixed). The general problem is that malfunctioning node will never
abandon its attempts to connect, and the rest of cluster will wait forever
for partition map exchange.

Any persisting exception in TcpCommunicationSpi.createTcpClient() will
cause the whole cluster to hang. In degenerate cases it will look like
megabytes of:

[2017-08-31 18:28:20,787][INFO
][grid-nio-worker-tcp-comm-0-#26%server1%][TcpCommunicationSpi] Accepted
incoming communication connection [locAddr=/127.0.0.1:45010, rmtAddr=/
127.0.0.1:33002]
[2017-08-31 18:28:20,988][INFO
][grid-nio-worker-tcp-comm-1-#27%server1%][TcpCommunicationSpi] Accepted
incoming communication connection [locAddr=/127.0.0.1:45010, rmtAddr=/
127.0.0.1:33004]
[2017-08-31 18:28:21,188][INFO
][grid-nio-worker-tcp-comm-0-#26%server1%][TcpCommunicationSpi] Accepted
incoming communication connection [locAddr=/127.0.0.1:45010, rmtAddr=/
127.0.0.1:33006]

This is causing a lot of trouble and therefore I propose to limit
reserveClient() to several attempts, after which a last exception should be
thrown and the node should leave cluster for good.

What do you think?

--
Ilya Kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Hopeless looping in TcpCommunicationSpi

yzhdanov
Ilya, can you please provide more details? Is this client or server failing
to connect?

--Yakov
Reply | Threaded
Open this post in threaded view
|

Re: Hopeless looping in TcpCommunicationSpi

Ilya Kasnacheev
Hello Yakov,

This is client repeatedly failing (for some spurious reason) after it
connects to socket in createTcpClient()

It would then return null and reserveClient() will retry. Forever.

I've prepared a sample pull request to reproduce and remedy the problem:
https://github.com/apache/ignite/pull/2575


--
Ilya Kasnacheev

2017-09-01 13:04 GMT+03:00 Yakov Zhdanov <[hidden email]>:

> Ilya, can you please provide more details? Is this client or server failing
> to connect?
>
> --Yakov
>
Reply | Threaded
Open this post in threaded view
|

Re: Hopeless looping in TcpCommunicationSpi

yzhdanov
Ilya, I reviewed your test and I now see the point.

Is the same possible with server-server? Or failure process will be
initiated for remote node by local one?

Will the following work for client? If connection fails then client takes a
pause for 1 sec with Thread.sleep() and takes another attempt. If
maxConnectionAttempts (3 by default) fails then client stops and restart
itself.

--Yakov
Reply | Threaded
Open this post in threaded view
|

Re: Hopeless looping in TcpCommunicationSpi

Ilya Kasnacheev
Yes, the same problem is possible with server-server communication. I've
amended my pull request to include a test for that.

There are already Thread.sleeps() in this look so I propose to add a
countdown to maxConnectionAttempts instead of while(true)

Ideally we should find a way to log last error, such issues are hard to
debug otherwise.

--
Ilya Kasnacheev

2017-09-01 15:27 GMT+03:00 Yakov Zhdanov <[hidden email]>:

> Ilya, I reviewed your test and I now see the point.
>
> Is the same possible with server-server? Or failure process will be
> initiated for remote node by local one?
>
> Will the following work for client? If connection fails then client takes a
> pause for 1 sec with Thread.sleep() and takes another attempt. If
> maxConnectionAttempts (3 by default) fails then client stops and restart
> itself.
>
> --Yakov
>
Reply | Threaded
Open this post in threaded view
|

Re: Hopeless looping in TcpCommunicationSpi

yzhdanov
Ilya, go ahead with the plan you suggested.

--Yakov