You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Ray <ra...@cisco.com> on 2018/07/25 10:03:51 UTC

"Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

I have a three node Ignite 2.6 cluster setup with the following config. 

    <bean id="grid.cfg" 
class="org.apache.ignite.configuration.IgniteConfiguration"> 
        <property name="segmentationPolicy" value="RESTART_JVM"/> 
        <property name="peerClassLoadingEnabled" value="true"/> 
        <property name="failureDetectionTimeout" value="60000"/> 
        <property name="dataStorageConfiguration"> 
            <bean 
class="org.apache.ignite.configuration.DataStorageConfiguration"> 
            <property name="storagePath" value="/data/ignite/persistence"/> 
            <property name="walPath" value="/wal"/> 
            <property name="walArchivePath" value="/wal/archive"/> 
            <property name="defaultDataRegionConfiguration"> 
                <bean 
class="org.apache.ignite.configuration.DataRegionConfiguration"> 
                    <property name="name" value="default_Region"/> 
                    <property name="initialSize" value="#{100L * 1024 * 1024 
* 1024}"/> 
                    <property name="maxSize" value="#{460L * 1024 * 1024 * 
1024}"/> 
                    <property name="persistenceEnabled" value="true"/> 
                    <property name="checkpointPageBufferSize" value="#{8L * 
1024 * 1024 * 1024}"/> 
                </bean> 
            </property> 
            <property name="walMode" value="BACKGROUND"/> 
            <property name="walFlushFrequency" value="5000"/> 
            <property name="checkpointFrequency" value="600000"/> 
            </bean> 
        </property> 
        <property name="discoverySpi"> 
                <bean 
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> 
                    <property name="localPort" value="49500"/> 
                    <property name="ipFinder"> 
                        <bean 
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder"> 
                            <property name="addresses"> 
                                <list> 
                                <value>node1:49500</value> 
                                <value>node2:49500</value> 
                                <value>node3:49500</value> 
                                </list> 
                            </property> 
                        </bean> 
                    </property> 
                </bean> 
            </property> 
            <property name="gridLogger"> 
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger"> 
                <constructor-arg type="java.lang.String" 
value="config/ignite-log4j2.xml"/> 
            </bean> 
        </property> 
    </bean> 
</beans> 

And I used this command to start Ignite service on three nodes. 

./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC 
-J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch 
-J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps 
-J-XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime 
-J-Xloggc:/spare/ignite/log/ignitegc-$(date +%Y_%m_%d-%H_%M).log 
config/persistent-config.xml 

When I'm using Spark dataframe API to ingest data into this cluster, the
cluster freezes after some time and no new data can be ingested into Ignite.
Both the client(spark executor) and server are showing the "Unable to await
partitions release latch within timeout: ServerLatch" exception starts from
line 51834 in full log like this

[2018-07-25T09:45:42,177][WARN
][exchange-worker-#162][GridDhtPartitionsExchangeFuture] Unable to await
partitions release latch within timeout: ServerLatch [permits=2,
pendingAcks=[429edc2b-eb14-414f-a978-9bfe35443c8c,
6783732c-9a13-466f-800a-ad4c8d9be3bf], super=Completab        leLatch
[id=exchange, topVer=AffinityTopologyVersion [topVer=239, minorTopVer=0]]]

Here's the full log on server node having the exception.
07-25.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/07-25.zip>  





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Ray <ra...@cisco.com>.

Hello Pavel, 

The PME stuck again and here's the detailed log and thread dump.
node1.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node1.zip>  
node2.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node2.zip>  
node3.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node3.zip>  

This time, the client prints the following log and stuck.

18/07/31 06:40:48 INFO exchange.time: Started exchange init
[topVer=AffinityTopologyVersion [topVer=98, minorTopVer=0], crd=false,
evt=NODE_JOINED, evtNode=e6b3704f-0e12-44bb-b14f-af9b0b49df4e,
customEvt=null, allowMerge=true]
18/07/31 06:40:49 INFO cache.GridCacheProcessor: Started cache
[name=ignite-sys-cache, id=-2100569601, memoryPolicyName=sysMemPlc,
mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647]
18/07/31 06:40:49 INFO tcp.TcpCommunicationSpi: Established outgoing
communication connection [locAddr=/10.255.201.110:59811,
rmtAddr=001/10.240.161.91:47100]
18/07/31 06:40:49 INFO exchange.time: Finished exchange init
[topVer=AffinityTopologyVersion [topVer=98, minorTopVer=0], crd=false]
18/07/31 06:40:59 WARN internal.diagnostic: Failed to wait for partition map
exchange [topVer=AffinityTopologyVersion [topVer=98, minorTopVer=0],
node=e6b3704f-0e12-44bb-b14f-af9b0b49df4e]. Consider changing
TransactionConfiguration.txTimeoutOnPartitionMapSynchronization to non
default value to avoid this message. Dumping pending objects that might be
the cause:
18/07/31 06:40:59 WARN internal.diagnostic: Ready affinity version:
AffinityTopologyVersion [topVer=-1, minorTopVer=0]
18/07/31 06:40:59 WARN internal.diagnostic: Last exchange future:
GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=e6b3704f-0e12-44bb-b14f-af9b0b49df4e,
addrs=[10.255.201.110, 127.0.0.1], sockAddrs=[rpsj2hsn132/10.255.201.110:0,
/127.0.0.1:0], discPort=0, order=98, intOrder=0,
lastExchangeTime=1533019247506, loc=true, ver=2.6.0#20180709-sha1:5faffcee,
isClient=true], topVer=98, nodeId8=e6b3704f, msg=null, type=NODE_JOINED,
tstamp=1533019248647], crd=TcpDiscoveryNode
[id=0b3ced74-9cc4-4b37-9eea-e0b519bc26fd, addrs=[10.240.161.91, 127.0.0.1],
sockAddrs=[/127.0.0.1:49500, 001/10.240.161.91:49500], discPort=49500,
order=1, intOrder=1, lastExchangeTime=1533019248041, loc=false,
ver=2.6.0#20180709-sha1:5faffcee, isClient=false],
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion
[topVer=98, minorTopVer=0], discoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=e6b3704f-0e12-44bb-b14f-af9b0b49df4e,
addrs=[10.255.201.110, 127.0.0.1], sockAddrs=[rpsj2hsn132/10.255.201.110:0,
/127.0.0.1:0], discPort=0, order=98, intOrder=0,
lastExchangeTime=1533019247506, loc=true, ver=2.6.0#20180709-sha1:5faffcee,
isClient=true], topVer=98, nodeId8=e6b3704f, msg=null, type=NODE_JOINED,
tstamp=1533019248647], nodeId=e6b3704f, evt=NODE_JOINED], added=true,
initFut=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=true,
hash=118882705], init=true, lastVer=null, partReleaseFut=null,
exchActions=ExchangeActions [startCaches=null, stopCaches=null,
startGrps=[], stopGrps=[], resetParts=null, stateChangeRequest=null],
affChangeMsg=null, initTs=1533019248697, centralizedAff=false,
forceAffReassignment=false, changeGlobalStateE=null, done=false,
state=CLIENT, evtLatch=0, remaining=[07a24ed6-c6c7-4ee1-94cd-5c8ff95d956d,
0b3ced74-9cc4-4b37-9eea-e0b519bc26fd, 8967f541-5627-43a2-ab74-bb822d12f02b],
super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
hash=1778222816]]
18/07/31 06:40:59 WARN cache.GridCachePartitionExchangeManager: First 10
pending exchange futures [total=0]
18/07/31 06:40:59 WARN internal.diagnostic: Last 10 exchange futures (total:
1):
18/07/31 06:40:59 WARN internal.diagnostic: >>>
GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=98,
minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=e6b3704f-0e12-44bb-b14f-af9b0b49df4e, addrs=[10.255.201.110, 127.0.0.1],
sockAddrs=[rpsj2hsn132/10.255.201.110:0, /127.0.0.1:0], discPort=0,
order=98, intOrder=0, lastExchangeTime=1533019247506, loc=true,
ver=2.6.0#20180709-sha1:5faffcee, isClient=true], done=false]
18/07/31 06:40:59 WARN internal.diagnostic: Latch manager state:
ExchangeLatchManager [serverLatches={}, clientLatches={}]
18/07/31 06:40:59 WARN internal.diagnostic: Pending transactions:
18/07/31 06:40:59 WARN internal.diagnostic: Pending explicit locks:
18/07/31 06:40:59 WARN internal.diagnostic: Pending cache futures:
18/07/31 06:40:59 WARN internal.diagnostic: Pending atomic cache futures:
18/07/31 06:40:59 WARN internal.diagnostic: Pending data streamer futures:
18/07/31 06:40:59 WARN internal.diagnostic: Pending transaction deadlock
detection futures:
18/07/31 06:41:08 WARN cache.GridCachePartitionExchangeManager: Failed to
wait for initial partition map exchange. Possible reasons are:
  ^-- Transactions in deadlock.
  ^-- Long running transactions (ignore if this is the case).
  ^-- Unreleased explicit locks.
18/07/31 06:41:19 WARN internal.diagnostic: Failed to wait for partition map
exchange [topVer=AffinityTopologyVersion [topVer=98, minorTopVer=0],
node=e6b3704f-0e12-44bb-b14f-af9b0b49df4e]. Consider changing
TransactionConfiguration.txTimeoutOnPartitionMapSynchronization to non
default value to avoid this message. Dumping pending objects that might be
the cause:
18/07/31 06:41:19 WARN internal.diagnostic: Ready affinity version:
AffinityTopologyVersion [topVer=-1, minorTopVer=0]
18/07/31 06:41:19 WARN internal.diagnostic: Last exchange future:
GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=e6b3704f-0e12-44bb-b14f-af9b0b49df4e,
addrs=[10.255.201.110, 127.0.0.1], sockAddrs=[rpsj2hsn132/10.255.201.110:0,
/127.0.0.1:0], discPort=0, order=98, intOrder=0,
lastExchangeTime=1533019247506, loc=true, ver=2.6.0#20180709-sha1:5faffcee,
isClient=true], topVer=98, nodeId8=e6b3704f, msg=null, type=NODE_JOINED,
tstamp=1533019248647], crd=TcpDiscoveryNode
[id=0b3ced74-9cc4-4b37-9eea-e0b519bc26fd, addrs=[10.240.161.91, 127.0.0.1],
sockAddrs=[/127.0.0.1:49500, 001/10.240.161.91:49500], discPort=49500,
order=1, intOrder=1, lastExchangeTime=1533019248041, loc=false,
ver=2.6.0#20180709-sha1:5faffcee, isClient=false],
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion
[topVer=98, minorTopVer=0], discoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=e6b3704f-0e12-44bb-b14f-af9b0b49df4e,
addrs=[10.255.201.110, 127.0.0.1], sockAddrs=[rpsj2hsn132/10.255.201.110:0,
/127.0.0.1:0], discPort=0, order=98, intOrder=0,
lastExchangeTime=1533019247506, loc=true, ver=2.6.0#20180709-sha1:5faffcee,
isClient=true], topVer=98, nodeId8=e6b3704f, msg=null, type=NODE_JOINED,
tstamp=1533019248647], nodeId=e6b3704f, evt=NODE_JOINED], added=true,
initFut=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=true,
hash=118882705], init=true, lastVer=null, partReleaseFut=null,
exchActions=ExchangeActions [startCaches=null, stopCaches=null,
startGrps=[], stopGrps=[], resetParts=null, stateChangeRequest=null],
affChangeMsg=null, initTs=1533019248697, centralizedAff=false,
forceAffReassignment=false, changeGlobalStateE=null, done=false,
state=CLIENT, evtLatch=0, remaining=[07a24ed6-c6c7-4ee1-94cd-5c8ff95d956d,
0b3ced74-9cc4-4b37-9eea-e0b519bc26fd, 8967f541-5627-43a2-ab74-bb822d12f02b],
super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
hash=1778222816]]
18/07/31 06:41:19 WARN cache.GridCachePartitionExchangeManager: First 10
pending exchange futures [total=0]
18/07/31 06:41:19 WARN internal.diagnostic: Last 10 exchange futures (total:
1):
18/07/31 06:41:19 WARN internal.diagnostic: >>>
GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=98,
minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=e6b3704f-0e12-44bb-b14f-af9b0b49df4e, addrs=[10.255.201.110, 127.0.0.1],
sockAddrs=[rpsj2hsn132/10.255.201.110:0, /127.0.0.1:0], discPort=0,
order=98, intOrder=0, lastExchangeTime=1533019247506, loc=true,
ver=2.6.0#20180709-sha1:5faffcee, isClient=true], done=false]
18/07/31 06:41:19 WARN internal.diagnostic: Latch manager state:
ExchangeLatchManager [serverLatches={}, clientLatches={}]
18/07/31 06:41:19 WARN internal.diagnostic: Pending transactions:
18/07/31 06:41:19 WARN internal.diagnostic: Pending explicit locks:
18/07/31 06:41:19 WARN internal.diagnostic: Pending cache futures:
18/07/31 06:41:19 WARN internal.diagnostic: Pending atomic cache futures:
18/07/31 06:41:19 WARN internal.diagnostic: Pending data streamer futures:
18/07/31 06:41:19 WARN internal.diagnostic: Pending transaction deadlock
detection futures:
18/07/31 06:41:28 WARN cache.GridCachePartitionExchangeManager: Still
waiting for initial partition map exchange
[fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=e6b3704f-0e12-44bb-b14f-af9b0b49df4e,
addrs=[10.255.201.110, 127.0.0.1], sockAddrs=[rpsj2hsn132/10.255.201.110:0,
/127.0.0.1:0], discPort=0, order=98, intOrder=0,
lastExchangeTime=1533019247506, loc=true, ver=2.6.0#20180709-sha1:5faffcee,
isClient=true], topVer=98, nodeId8=e6b3704f, msg=null, type=NODE_JOINED,
tstamp=1533019248647], crd=TcpDiscoveryNode
[id=0b3ced74-9cc4-4b37-9eea-e0b519bc26fd, addrs=[10.240.161.91, 127.0.0.1],
sockAddrs=[/127.0.0.1:49500, 001/10.240.161.91:49500], discPort=49500,
order=1, intOrder=1, lastExchangeTime=1533019248041, loc=false,
ver=2.6.0#20180709-sha1:5faffcee, isClient=false],
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion
[topVer=98, minorTopVer=0], discoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=e6b3704f-0e12-44bb-b14f-af9b0b49df4e,
addrs=[10.255.201.110, 127.0.0.1], sockAddrs=[rpsj2hsn132/10.255.201.110:0,
/127.0.0.1:0], discPort=0, order=98, intOrder=0,
lastExchangeTime=1533019247506, loc=true, ver=2.6.0#20180709-sha1:5faffcee,
isClient=true], topVer=98, nodeId8=e6b3704f, msg=null, type=NODE_JOINED,
tstamp=1533019248647], nodeId=e6b3704f, evt=NODE_JOINED], added=true,
initFut=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=true,
hash=118882705], init=true, lastVer=null, partReleaseFut=null,
exchActions=ExchangeActions [startCaches=null, stopCaches=null,
startGrps=[], stopGrps=[], resetParts=null, stateChangeRequest=null],
affChangeMsg=null, initTs=1533019248697, centralizedAff=false,
forceAffReassignment=false, changeGlobalStateE=null, done=false,
state=CLIENT, evtLatch=0, remaining=[07a24ed6-c6c7-4ee1-94cd-5c8ff95d956d,
0b3ced74-9cc4-4b37-9eea-e0b519bc26fd, 8967f541-5627-43a2-ab74-bb822d12f02b],
super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
hash=1778222816]]]

And servers prints out logs on node1
[2018-07-31T05:43:04,781][WARN
][exchange-worker-#162][GridDhtPartitionsExchangeFuture] Unable to await
partitions release latch within t=2,
pendingAcks=[07a24ed6-c6c7-4ee1-94cd-5c8ff95d956d,
8967f541-5627-43a2-ab74-bb822d12f02b], super=CompletableLatch [id=exchange,
topVerpVer=58, minorTopVer=0]]]
and these logs on node2 and node3
Unable to await partitions release latch within timeout: ClientLatch
[coordinator=TcpDiscoveryNode [id=0b3ced74-9cc4-4b37-9eea-e0b519bc26fd,
addrs=[10.240.161.91, 127.0.0.1], sockAddrs=[        /127.0.0.1:49500],
discPort=49500, order=1, intOrder=1, lastExchangeTime=1532940317188, loc=fa       
lse, ver=2.6.0#20180709-sha1:5faffcee, isClient=false], ackSent=true,
super=CompletableLatch [id=exchange,
topVer=AffinityTopologyVersion[topVer=90, minorTopVer=0]]]

I did a little research and found that both ClientLatch and CompletableLatch
in ExchangeLatchManager are introduced by this ticket.
https://issues.apache.org/jira/browse/IGNITE-7871

Could this be the cause of PME freezing?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Ray <ra...@cisco.com>.

Hello Pavel,

I was able to reproduce this issue and I've attached the DEBUG log and
thread dump for three nodes as you suggested.
Archive.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Archive.zip>  

This time, there's no "no route to host" exception between server and client
nodes.

Node2 and node3 logs "Unable to await partitions release latch within
timeout: ClientLatch" shortly after cluster starts, node1 don't have
explicitly errors.

And cluster begins to freeze after about 20 minutes after the data ingestion
starts.

The attached picture is data streaming threads running/park time slice in
each of three nodes.
You can see that node3 freezes first then node2 freezes.
So client can only writes to node1 and triggered a lot of rebalancing.

node1.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node1.png>    
node2.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node2.png>  
node3.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node3.png>  

By the time I wrote the post, the data ingestion usually takes 5 minutes is
still not finished after 1.1 hour.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hello Ray,

Without explicit errors in the log, it's not so easy to guess what was
that.
Because I don't see any errors, it should be a recoverable failure (even
taking a long time).
If you have such option, could you please enable DEBUG log level
for org.apache.ignite.internal.util.nio.GridTcpNioCommunicationClient
and org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi for server
nodes?
If such a long time of PME happens, again and again, debug logs from these
classes give us a lot of useful information to find the exact cause of such
a long communication process.

If a client node has a stable connection to the cluster it should wait for
PME till its end. My message about reconnecting was mostly about if client
connection to the cluster breaks.
But, If after the end of PME client doesn't send any data yet, thread dump
from the client will be very useful to analyze why it's happened.

2018-07-26 18:36 GMT+03:00 Ray <ra...@cisco.com>:

> Hello Pavel,
>
> Thanks for the explanation, it's been great help.
>
> Can you take a guess why PME has performed a long time due to communication
> issues between server nodes?
> From the logs, the "no route to host" exception happened because server
> can't connect to client's ports.
> But I didn't see any logs indicating the network issues between server
> nodes.
> I tested connectivity of communication SPI ports(47100 in this case) and
> discovery SPI ports(49500 in this case) between server nodes, it's all
> good.
>
> And on client(spark executor) side, there's no exception log when PME takes
> a long time to finish.
> It will hang forever.
> Spark.log
> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Spark.log>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Ray <ra...@cisco.com>.

Hello Pavel, 

Thanks for the explanation, it's been great help.

Can you take a guess why PME has performed a long time due to communication
issues between server nodes?
From the logs, the "no route to host" exception happened because server
can't connect to client's ports.
But I didn't see any logs indicating the network issues between server
nodes.
I tested connectivity of communication SPI ports(47100 in this case) and
discovery SPI ports(49500 in this case) between server nodes, it's all good.

And on client(spark executor) side, there's no exception log when PME takes
a long time to finish.
It will hang forever.
Spark.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Spark.log>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hello Ray,

It's hard to say that the issue you mentioned is the cause of your problem.
To determine it, it will be very good if you get thread dumps on next such
network glitch both from server and client nodes (using jstack e.g.).
I'm not aware of Ignite Spark DataFrames implementation features, but in
general, in case of any node join/left Ignite cluster triggers process
named PME (Partitions Map Exchange).
During this process, all write operations to the cluster are frozen until
PME end. In your case, PME has performed a long time due to communication
issues between server nodes (server nodes were unable to send an
acknowledgment to each other to continue PME, while Discovery worked well
which is a bit strange). This can cause why you didn't see updates.
In case of client connection exception, a client node should try to
reconnect to another server node and complete his data write futures with
an exception, so Spark Executor which uses client node to stream data to
Ignite should catch this network exception and perform re-connect and
retry of data batch writing.

For more details about Spark DataFrames implementation in Ignite, you may
ask Nikolay Izhikov (I attached his email as the recipient in this letter).


2018-07-26 5:50 GMT+03:00 Ray <ra...@cisco.com>:

> Hello Pavel,
>
> Here's the log for for node ids = [429edc2b-eb14-414f-a978-9bfe35443c8c,
> 6783732c-9a13-466f-800a-ad4c8d9be3bf].
> 6783732c-9a13-466f-800a-ad4c8d9be3bf.zip
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> t1346/6783732c-9a13-466f-800a-ad4c8d9be3bf.zip>
> 429edc2b-eb14-414f-a978-9bfe35443c8c.zip
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> t1346/429edc2b-eb14-414f-a978-9bfe35443c8c.zip>
> I examined the logs and looks like there's a network issue here because
> there's a lot of "java.net.NoRouteToHostException: No route to host"
> exception.
>
> I did a little research and found this ticket may be the cause.
> https://issues.apache.org/jira/browse/IGNITE-8739
>
> Will the client(spark executor in this case) retry data insert if I apply
> this patch when the network glitch is resolved?
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Ray <ra...@cisco.com>.

Hello Pavel,

Here's the log for for node ids = [429edc2b-eb14-414f-a978-9bfe35443c8c,
6783732c-9a13-466f-800a-ad4c8d9be3bf]. 
6783732c-9a13-466f-800a-ad4c8d9be3bf.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/6783732c-9a13-466f-800a-ad4c8d9be3bf.zip>  
429edc2b-eb14-414f-a978-9bfe35443c8c.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/429edc2b-eb14-414f-a978-9bfe35443c8c.zip>  
I examined the logs and looks like there's a network issue here because
there's a lot of "java.net.NoRouteToHostException: No route to host"
exception.

I did a little research and found this ticket may be the cause.
https://issues.apache.org/jira/browse/IGNITE-8739

Will the client(spark executor in this case) retry data insert if I apply
this patch when the network glitch is resolved?




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hello Ray,

According to your attached log, It seems that you have some network
problems. Could you please also share logs from nodes with temporary ids =
[429edc2b-eb14-414f-a978-9bfe35443c8c, 6783732c-9a13-466f-800a-ad4c8d9be3bf].
The root cause should be on those nodes.

2018-07-25 13:03 GMT+03:00 Ray <ra...@cisco.com>:

> I have a three node Ignite 2.6 cluster setup with the following config.
>
>     <bean id="grid.cfg"
> class="org.apache.ignite.configuration.IgniteConfiguration">
>         <property name="segmentationPolicy" value="RESTART_JVM"/>
>         <property name="peerClassLoadingEnabled" value="true"/>
>         <property name="failureDetectionTimeout" value="60000"/>
>         <property name="dataStorageConfiguration">
>             <bean
> class="org.apache.ignite.configuration.DataStorageConfiguration">
>             <property name="storagePath" value="/data/ignite/persistence"/>
>
>             <property name="walPath" value="/wal"/>
>             <property name="walArchivePath" value="/wal/archive"/>
>             <property name="defaultDataRegionConfiguration">
>                 <bean
> class="org.apache.ignite.configuration.DataRegionConfiguration">
>                     <property name="name" value="default_Region"/>
>                     <property name="initialSize" value="#{100L * 1024 *
> 1024
> * 1024}"/>
>                     <property name="maxSize" value="#{460L * 1024 * 1024 *
> 1024}"/>
>                     <property name="persistenceEnabled" value="true"/>
>                     <property name="checkpointPageBufferSize" value="#{8L
> *
> 1024 * 1024 * 1024}"/>
>                 </bean>
>             </property>
>             <property name="walMode" value="BACKGROUND"/>
>             <property name="walFlushFrequency" value="5000"/>
>             <property name="checkpointFrequency" value="600000"/>
>             </bean>
>         </property>
>         <property name="discoverySpi">
>                 <bean
> class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
>                     <property name="localPort" value="49500"/>
>                     <property name="ipFinder">
>                         <bean
> class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
>
>                             <property name="addresses">
>                                 <list>
>                                 <value>node1:49500</value>
>                                 <value>node2:49500</value>
>                                 <value>node3:49500</value>
>                                 </list>
>                             </property>
>                         </bean>
>                     </property>
>                 </bean>
>             </property>
>             <property name="gridLogger">
>             <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
>                 <constructor-arg type="java.lang.String"
> value="config/ignite-log4j2.xml"/>
>             </bean>
>         </property>
>     </bean>
> </beans>
>
> And I used this command to start Ignite service on three nodes.
>
> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
> -J-XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCApplicationConcurrentTime
> -J-Xloggc:/spare/ignite/log/ignitegc-$(date +%Y_%m_%d-%H_%M).log
> config/persistent-config.xml
>
> When I'm using Spark dataframe API to ingest data into this cluster, the
> cluster freezes after some time and no new data can be ingested into
> Ignite.
> Both the client(spark executor) and server are showing the "Unable to await
> partitions release latch within timeout: ServerLatch" exception starts from
> line 51834 in full log like this
>
> [2018-07-25T09:45:42,177][WARN
> ][exchange-worker-#162][GridDhtPartitionsExchangeFuture] Unable to await
> partitions release latch within timeout: ServerLatch [permits=2,
> pendingAcks=[429edc2b-eb14-414f-a978-9bfe35443c8c,
> 6783732c-9a13-466f-800a-ad4c8d9be3bf], super=Completab        leLatch
> [id=exchange, topVer=AffinityTopologyVersion [topVer=239, minorTopVer=0]]]
>
> Here's the full log on server node having the exception.
> 07-25.zip
> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/07-25.zip>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by userx <ga...@gmail.com>.

Hi Pavel

I am encountering the same issue in which it seems like the Server has
entered into an infinite loop and every 10 seconds i am seeing the following
message.


2018-12-06 15:49:23,188 WARN
[exchange-worker-#122%5b9b0820-ec94-493c-ae58-bc31aac873c6%] {}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
- Unable to await partitions release latch within timeout: ClientLatch
[coordinator=TcpDiscoveryNode [id=a8b0f10f-8ad9-45a4-aab3-a0562fd0d202,
addrs=[10.62.21.54, 10.62.44.22, 10.63.216.22, 127.0.0.1],
sockAddrs=[ueu-ip-lapp0002.coresit.msci.org/10.63.216.22:47500,
ueu-ip-lapp0002.mgmt.msci.org/10.62.44.22:47500, /10.62.21.54:47500,
/127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
lastExchangeTime=1544102518093, loc=false, ver=2.6.0#20180710-sha1:669feacc,
isClient=false], ackSent=true, super=CompletableLatch [id=exchange,
topVer=AffinityTopologyVersion [topVer=24, minorTopVer=1]]]


2018-12-06 15:49:33,189 WARN
[exchange-worker-#122%5b9b0820-ec94-493c-ae58-bc31aac873c6%] {}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
- Unable to await partitions release latch within timeout: ClientLatch
[coordinator=TcpDiscoveryNode [id=a8b0f10f-8ad9-45a4-aab3-a0562fd0d202,
addrs=[10.62.21.54, 10.62.44.22, 10.63.216.22, 127.0.0.1],
sockAddrs=[ueu-ip-lapp0002.coresit.msci.org/10.63.216.22:47500,
ueu-ip-lapp0002.mgmt.msci.org/10.62.44.22:47500, /10.62.21.54:47500,
/127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
lastExchangeTime=1544102518093, loc=false, ver=2.6.0#20180710-sha1:669feacc,
isClient=false], ackSent=true, super=CompletableLatch [id=exchange,
topVer=AffinityTopologyVersion [topVer=24, minorTopVer=1]]]


The code in which it seems to be in an infinite loop is 


if (!localJoinExchange()) {
            try {
                while (true) {
                    try {
                        releaseLatch.await(waitTimeout,
TimeUnit.MILLISECONDS);

                        if (log.isInfoEnabled())
                            log.info("Finished waiting for partitions
release latch: " + releaseLatch);

                        break;
                    }
                    catch (IgniteFutureTimeoutCheckedException ignored) {
                        U.warn(log, "Unable to await partitions release
latch within timeout: " + releaseLatch);

                        // Try to resend ack.
                        releaseLatch.countDown();
                    }
                }
            }
            catch (IgniteCheckedException e) {
                U.warn(log, "Stop waiting for partitions release latch: " +
e.getMessage());
            }
        }

Interestingly the PME is only seeming to be stuck with topology version 24.
Although the last topology version was 83 because of a new node added.

Where as for topology version 23, there was no such iteration of 10 seconds
so the PME would have completed just fine.

Can you help me out here ?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hello Ray,

I'm glad that your problem was resolved. I just want to add that on PME
beginning phase we're waiting for all current client operations finishing,
new operations are freezed till PME end. After node finishes all ongoing
client operations it counts down latch that you see in logs which in the
message "Unable to await". When all nodes finish all their operations,
exchange latch completes and PME continues. This latch was added to reach
data consistency on all nodes during main PME phase (partition information
exchange, affinity calculation, etc.). If you have network throttling
between client and server, it becomes hard to notify a client that
his datastreamer operation has finished and latch completing process is
slowed down.

2018-08-02 12:11 GMT+03:00 Ray <ra...@cisco.com>:

> The root cause for this issue is the network throttle between client and
> servers.
>
> When I move the clients to run in the same cluster as the servers, there's
> no such problem any more.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Ray <ra...@cisco.com>.

The root cause for this issue is the network throttle between client and
servers.

When I move the clients to run in the same cluster as the servers, there's
no such problem any more.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/