You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by VeenaMithare <v....@cmcmarkets.com> on 2021/01/15 13:18:17 UTC

Regarding Connection timed out observed during client startup

Hello,

We see this behaviour in our client startup :
Client 1 -

Client1.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/t2757/Client1.txt>  
Server003 log -

SERVER3.TXT
<http://apache-ignite-users.70518.x6.nabble.com/file/t2757/SERVER3.TXT>  

1. The client join the cluster around (2021-01-14T16:16:29) in both
client1.log and server003.log. This started the partition map exchange . 
2. The client1.log shows BLOCKED SYSTEM CRITICAL THREAD : 
2021-01-14T16:16:44,472 ERROR o.a.i.i.u.t.G
[grid-timeout-worker-#21%InstanceName%]: Blocked system-critical thread has
been detected. This can lead to cluster-wide undefined behaviour
[workerName=partition-exchanger,
threadName=exchange-worker-#37%InstanceName%, blockedFor=14s]

After this , we see the connection timesout  to machinename003. 
021-01-14T16:16:45,017 WARN  o.a.i.s.c.t.TcpCommunicationSpi
[exchange-worker-#37%InstanceName%]: Connection timed out (will stop
attempts to perform the connect) [node=7b9d8c6f-814c-4cb6-9822-8fa3d7f79eb7,
connTimeoutStgy=ExponentialBackoffTimeoutStrategy [maxTimeout=10000,
totalTimeout=15000, startNanos=14942834332684124, currTimeout=10000],
failureDetectionTimeoutEnabled=false, timeout=9938, err=null,
addr=/m.n.o.202:47130]*



3. If I look in to the logs in machinename003, the partition map exchange
finished in 8 seconds. 


2021-01-14T16:16:29,671 INFO 
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture [exchange-worker-#566]:
Exchange timings [startVer=AffinityTopologyVersion [topVer=47,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=47, minorTopVer=0],
stage="Waiting in exchange queue" (0 ms), stage="Exchange parameters
initialization" (0 ms), stage="Determine exchange type" (4 ms),
stage="Exchange done" (4 ms), stage="Total time" (8 ms)]

If so, why was the partition-exchanger blocked on the client ?

4. Inspite of showing connection timeout however, it manages to successfully
connect to machinename003. ( Please note that m.n.o.202, x.y.z.202 are ip
addresses of the same server machinename003 ). 

2021-01-14T16:16:45,026 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-0-#22%InstanceName%]: Established outgoing
communication connection [locAddr=/a.b.c.21:53607,
rmtAddr=machinename003.cmc.local/x.y.z.202:47130]



Kindly guide us what happened here.. 
=========================
5. Also we have configured TcpCommunicationSpi timeouts as below as per the
recommendation given in : 

http://apache-ignite-users.70518.x6.nabble.com/IgniteSpiOperationTimeoutException-Operation-timed-out-timeoutStrategy-ExponentialBackoffTimeoutStray-tp34196p34377.html

            <bean
class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi"
scope="prototype">
                <property name="connectTimeout" value="5000"/>
                <property name="maxConnectTimeout" value="10000"/>
  
Is the timeout observed because of this setting ?
===============================

6. Our TxTimeoutOnPartitionMapExchange is 1 second
                <property name="TxTimeoutOnPartitionMapExchange"
value="1000"/>

What is the ideal TxTimeoutOnPartitionMapExchange value that should be given
should it be something like 50 milliseconds ?

=====================================
A similar log captured during client 2 startup attached as well. 
Client2.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/t2757/Client2.txt>  

regards,
Veena.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Regarding Connection timed out observed during client startup

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I'm actually not sure. You can try running the same query in local mode
(query.setLocal(true)) to check if there would be any difference. However,
you will need to run it against a server node as well. When it is run
against client, the client will become reducer, connect to some server
nodes.

I don't think you should be diagnosing this issue by counting "Established
outgoing communication connection" messages.

Regards,
-- 
Ilya Kasnacheev


вт, 26 янв. 2021 г. в 23:38, VeenaMithare <v....@cmcmarkets.com>:

> Hi Ilya,
>
> >>Generally a reducer node will need to connect to all map nodes while
> doing
> SQL query.
>
> >>You will not see this message if connection was already up.
> *Here are examples of 'outgoing communication' when the count was incorrect
> :
> *Example 1 :
> 2021-01-15T17:04:35,905 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-0-#60%Instance%]: Established outgoing
> communication connection [locAddr=/a.b.c.153:59010,
> rmtAddr=/a.b.c.199:47130]
> 2021-01-15T17:04:51,682 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-1-#61%Instance%]: Established outgoing
> communication connection [locAddr=/a.b.c.153:61374,
> rmtAddr=/a.b.c.201:47130]
> /2021-01-15T17:04:51,840 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-2-#62%Instance%]: Established outgoing
> communication connection [locAddr=/x.y.z.153:52558,
> rmtAddr=machine003.cmc.local/x.y.z.202:47130]/
>
>
> Example 2 :
> 2021-01-15T17:17:25,876 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-0-#60%Instance%]: Established outgoing
> communication connection [locAddr=/a.b.c.153:59786,
> rmtAddr=/a.b.c.199:47130]
> 2021-01-15T17:17:41,720 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-1-#61%Instance%]: Established outgoing
> communication connection [locAddr=/a.b.c.153:62150,
> rmtAddr=/a.b.c.201:47130]
> /2021-01-15T17:17:41,878 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-2-#62%Instance%]: Established outgoing
> communication connection [locAddr=/x.y.z.153:53334,
> rmtAddr=machine003.cmc.local/x.y.z.202:47130]/
>
> *Here is an example of  'outgoing communication' when the count was correct
> :
> *Example 3 :
> 2021-01-15T17:30:12,638 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-0-#60%Instance%]: Established outgoing
> communication connection [locAddr=/a.b.c.153:60534,
> rmtAddr=/a.b.c.199:47130]
> 2021-01-15T17:30:28,403 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-1-#61%Instance%]: Established outgoing
> communication connection [locAddr=/a.b.c.153:62898,
> rmtAddr=/a.b.c.201:47130]
>
> 1. We are on a 3 node cluster ( a.b.c.199, a.b.c.201, x.y.z.202  ).  When
> the count is right it doesn't even try to make a connection to x.y.z.202 .
> It is happy with the connections made to a.b.c.199, a.b.c.201.
>
> As mentioned in the cases where the count is incorrect the connection to
> x.y.z.202 is made between cache.query and cursor.getall.
>
>
> 2. The cache that is used in the query here is a REPLICATED cache. As per
> this documentation here - the query on a REPLICATED CACHE would not be run
> against multiple server nodes -
> https://apacheignite-sql.readme.io/docs/select
> Would the reducer still be used ?
>
> regards,
> Veena.
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Regarding Connection timed out observed during client startup

Posted by VeenaMithare <v....@cmcmarkets.com>.

Hi Ilya, 

>>Generally a reducer node will need to connect to all map nodes while doing
SQL query.

>>You will not see this message if connection was already up.
*Here are examples of 'outgoing communication' when the count was incorrect
: 
*Example 1 : 
2021-01-15T17:04:35,905 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-0-#60%Instance%]: Established outgoing
communication connection [locAddr=/a.b.c.153:59010,
rmtAddr=/a.b.c.199:47130]
2021-01-15T17:04:51,682 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-1-#61%Instance%]: Established outgoing
communication connection [locAddr=/a.b.c.153:61374,
rmtAddr=/a.b.c.201:47130]
/2021-01-15T17:04:51,840 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-2-#62%Instance%]: Established outgoing
communication connection [locAddr=/x.y.z.153:52558,
rmtAddr=machine003.cmc.local/x.y.z.202:47130]/


Example 2 :
2021-01-15T17:17:25,876 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-0-#60%Instance%]: Established outgoing
communication connection [locAddr=/a.b.c.153:59786,
rmtAddr=/a.b.c.199:47130]
2021-01-15T17:17:41,720 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-1-#61%Instance%]: Established outgoing
communication connection [locAddr=/a.b.c.153:62150,
rmtAddr=/a.b.c.201:47130]
/2021-01-15T17:17:41,878 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-2-#62%Instance%]: Established outgoing
communication connection [locAddr=/x.y.z.153:53334,
rmtAddr=machine003.cmc.local/x.y.z.202:47130]/

*Here is an example of  'outgoing communication' when the count was correct
: 
*Example 3 :
2021-01-15T17:30:12,638 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-0-#60%Instance%]: Established outgoing
communication connection [locAddr=/a.b.c.153:60534,
rmtAddr=/a.b.c.199:47130]
2021-01-15T17:30:28,403 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-1-#61%Instance%]: Established outgoing
communication connection [locAddr=/a.b.c.153:62898,
rmtAddr=/a.b.c.201:47130]

1. We are on a 3 node cluster ( a.b.c.199, a.b.c.201, x.y.z.202  ).  When
the count is right it doesn't even try to make a connection to x.y.z.202 .
It is happy with the connections made to a.b.c.199, a.b.c.201.  

As mentioned in the cases where the count is incorrect the connection to
x.y.z.202 is made between cache.query and cursor.getall. 


2. The cache that is used in the query here is a REPLICATED cache. As per
this documentation here - the query on a REPLICATED CACHE would not be run
against multiple server nodes - 
https://apacheignite-sql.readme.io/docs/select
Would the reducer still be used ?

regards,
Veena.






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Regarding Connection timed out observed during client startup

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

Generally a reducer node will need to connect to all map nodes while doing
SQL query.

You will not see this message if connection was already up.

Regards,
-- 
Ilya Kasnacheev


вт, 26 янв. 2021 г. в 17:11, VeenaMithare <v....@cmcmarkets.com>:

> Hi Ilya,
>
> >>I don't think it could have affected CQ initial query, since
> queries/cache
> operations are supposed to block until they are successful, but not fail
> silently.
>
> I have another observation related to this
>
> http://apache-ignite-users.70518.x6.nabble.com/2-8-1-Continuous-Query-Initial-query-not-returning-any-result-sometimes-td35167.html
>
> On all the nodes where I got the wrong count :
> Between the time it executes cache.query and cur.getall, I see this log :
> 2021-01-15T15:37:15,167 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-2-#62%InstanceName%]: Established outgoing
> communication connection [locAddr=/a.b.c.153:50546,
> rmtAddr=machinename003.cmc.local/a.b.c.202:47130 ]
>
>
> On the nodes where the count has been right, I dont see that line.
>
> Why it is trying to connect to another server from the cluster between the
> cache.query and cur.getall ?
>
> >>The main factor is that you expect your transactions to finish
> successfully. Setting low TxTimeoutOnPartitionMapExchange will lead to
> aborting of all your transactions when PME happens. Why would you want
> that,
> as opposed to the default when transactions finish and PME happens?
>
> Okay. Let me try a few values on this.
>
> regards,
> Veena.
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Regarding Connection timed out observed during client startup

Posted by VeenaMithare <v....@cmcmarkets.com>.

Hi Ilya, 

>>I don't think it could have affected CQ initial query, since queries/cache
operations are supposed to block until they are successful, but not fail
silently.

I have another observation related to this
http://apache-ignite-users.70518.x6.nabble.com/2-8-1-Continuous-Query-Initial-query-not-returning-any-result-sometimes-td35167.html

On all the nodes where I got the wrong count :
Between the time it executes cache.query and cur.getall, I see this log :
2021-01-15T15:37:15,167 INFO  o.a.i.s.c.t.TcpCommunicationSpi
[grid-nio-worker-tcp-comm-2-#62%InstanceName%]: Established outgoing
communication connection [locAddr=/a.b.c.153:50546,
rmtAddr=machinename003.cmc.local/a.b.c.202:47130 ]


On the nodes where the count has been right, I dont see that line.

Why it is trying to connect to another server from the cluster between the
cache.query and cur.getall ?

>>The main factor is that you expect your transactions to finish
successfully. Setting low TxTimeoutOnPartitionMapExchange will lead to
aborting of all your transactions when PME happens. Why would you want that,
as opposed to the default when transactions finish and PME happens?

Okay. Let me try a few values on this.

regards,
Veena.






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Regarding Connection timed out observed during client startup

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I don't think it could have affected CQ initial query, since queries/cache
operations are supposed to block until they are successful, but not fail
silently.

I'm not sure about connectTimeout, please try it and see.

The main factor is that you expect your transactions to finish
successfully. Setting low TxTimeoutOnPartitionMapExchange will lead to
aborting of all your transactions when PME happens. Why would you want
that, as opposed to the default when transactions finish and PME happens?

Regards,
-- 
Ilya Kasnacheev


ср, 20 янв. 2021 г. в 17:50, VeenaMithare <v....@cmcmarkets.com>:

> >>It's possible that you have more than one network/interface and some
> combinations are causing problems. Please try to specify localHost property
> on every node pointing to a current actual external IP address of the node.
>
> This does seem to help the points 1,2,3,4.
> Do you think this could have affected this issue :
>
>
> http://apache-ignite-users.70518.x6.nabble.com/2-8-1-Continuous-Query-Initial-query-not-returning-any-result-sometimes-td35167.html
>
> >>5. Yes, it will do exponential backout based on those values.
> Can the connectTimeout be set up for client side as well ? Would you know
> why the default value is 600,000.
>
> >>6. I'm not sure what is the optimal value, it depends on your env.
> Thanks. My understanding is that if a partition map exchange is detected
> while a transaction is going on , timeout the transaction . In this case
> shouldnt the TxTimeoutOnPartitionMapExchange be kept as small as possible ?
> What are the other factors to consider here.
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Regarding Connection timed out observed during client startup

Posted by VeenaMithare <v....@cmcmarkets.com>.

>>It's possible that you have more than one network/interface and some
combinations are causing problems. Please try to specify localHost property
on every node pointing to a current actual external IP address of the node.

This does seem to help the points 1,2,3,4. 
Do you think this could have affected this issue : 

http://apache-ignite-users.70518.x6.nabble.com/2-8-1-Continuous-Query-Initial-query-not-returning-any-result-sometimes-td35167.html

>>5. Yes, it will do exponential backout based on those values.
Can the connectTimeout be set up for client side as well ? Would you know
why the default value is 600,000. 

>>6. I'm not sure what is the optimal value, it depends on your env.
Thanks. My understanding is that if a partition map exchange is detected
while a transaction is going on , timeout the transaction . In this case
shouldnt the TxTimeoutOnPartitionMapExchange be kept as small as possible ?
What are the other factors to consider here. 






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Regarding Connection timed out observed during client startup

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

5. Yes, it will do exponential backout based on those values.
6. I'm not sure what is the optimal value, it depends on your env.

Regards,
-- 
Ilya Kasnacheev


ср, 20 янв. 2021 г. в 14:23, VeenaMithare <v....@cmcmarkets.com>:

> Thanks Ilya,
>
> I will check if adding localhost improves the issue mentioned in point 1,
> 2,
> 3 , 4.
>
> Could you guide on point 5,6 ,
>
> regards,
> Veena.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Regarding Connection timed out observed during client startup

Posted by VeenaMithare <v....@cmcmarkets.com>.

Thanks Ilya, 

I will check if adding localhost improves the issue mentioned in point 1, 2,
3 , 4.

Could you guide on point 5,6 ,

regards,
Veena.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Regarding Connection timed out observed during client startup

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

It seems that you have network problems.

It's possible that you have more than one network/interface and some
combinations are causing problems. Please try to specify localHost property
on every node pointing to a current actual external IP address of the node.

Regards,
-- 
Ilya Kasnacheev


пт, 15 янв. 2021 г. в 16:19, VeenaMithare <v....@cmcmarkets.com>:

> Hello,
>
> We see this behaviour in our client startup :
> Client 1 -
>
> Client1.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2757/Client1.txt>
> Server003 log -
>
> SERVER3.TXT
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2757/SERVER3.TXT>
>
> 1. The client join the cluster around (2021-01-14T16:16:29) in both
> client1.log and server003.log. This started the partition map exchange .
> 2. The client1.log shows BLOCKED SYSTEM CRITICAL THREAD :
> 2021-01-14T16:16:44,472 ERROR o.a.i.i.u.t.G
> [grid-timeout-worker-#21%InstanceName%]: Blocked system-critical thread has
> been detected. This can lead to cluster-wide undefined behaviour
> [workerName=partition-exchanger,
> threadName=exchange-worker-#37%InstanceName%, blockedFor=14s]
>
> After this , we see the connection timesout  to machinename003.
> 021-01-14T16:16:45,017 WARN  o.a.i.s.c.t.TcpCommunicationSpi
> [exchange-worker-#37%InstanceName%]: Connection timed out (will stop
> attempts to perform the connect)
> [node=7b9d8c6f-814c-4cb6-9822-8fa3d7f79eb7,
> connTimeoutStgy=ExponentialBackoffTimeoutStrategy [maxTimeout=10000,
> totalTimeout=15000, startNanos=14942834332684124, currTimeout=10000],
> failureDetectionTimeoutEnabled=false, timeout=9938, err=null,
> addr=/m.n.o.202:47130]*
>
>
>
> 3. If I look in to the logs in machinename003, the partition map exchange
> finished in 8 seconds.
>
>
> 2021-01-14T16:16:29,671 INFO
> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture [exchange-worker-#566]:
> Exchange timings [startVer=AffinityTopologyVersion [topVer=47,
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=47, minorTopVer=0],
> stage="Waiting in exchange queue" (0 ms), stage="Exchange parameters
> initialization" (0 ms), stage="Determine exchange type" (4 ms),
> stage="Exchange done" (4 ms), stage="Total time" (8 ms)]
>
> If so, why was the partition-exchanger blocked on the client ?
>
> 4. Inspite of showing connection timeout however, it manages to
> successfully
> connect to machinename003. ( Please note that m.n.o.202, x.y.z.202 are ip
> addresses of the same server machinename003 ).
>
> 2021-01-14T16:16:45,026 INFO  o.a.i.s.c.t.TcpCommunicationSpi
> [grid-nio-worker-tcp-comm-0-#22%InstanceName%]: Established outgoing
> communication connection [locAddr=/a.b.c.21:53607,
> rmtAddr=machinename003.cmc.local/x.y.z.202:47130]
>
>
>
> Kindly guide us what happened here..
> =========================
> 5. Also we have configured TcpCommunicationSpi timeouts as below as per the
> recommendation given in :
>
>
> http://apache-ignite-users.70518.x6.nabble.com/IgniteSpiOperationTimeoutException-Operation-timed-out-timeoutStrategy-ExponentialBackoffTimeoutStray-tp34196p34377.html
>
>             <bean
> class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi"
> scope="prototype">
>                 <property name="connectTimeout" value="5000"/>
>                 <property name="maxConnectTimeout" value="10000"/>
>
> Is the timeout observed because of this setting ?
> ===============================
>
> 6. Our TxTimeoutOnPartitionMapExchange is 1 second
>                 <property name="TxTimeoutOnPartitionMapExchange"
> value="1000"/>
>
> What is the ideal TxTimeoutOnPartitionMapExchange value that should be
> given
> should it be something like 50 milliseconds ?
>
> =====================================
> A similar log captured during client 2 startup attached as well.
> Client2.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2757/Client2.txt>
>
> regards,
> Veena.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>