You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by pg31 <si...@gmail.com> on 2020/02/05 14:03:31 UTC

Ignite Load Hangs at the end while using DataStreamer

Hi

Cluster Configuration:
3 Nodes (112 GB Memory / 512 GB Disk)

Ignite Configuration:
1. Persistence Enabled
2. Version: 2.6.0
3. Configuration is as follows:
            <property name="authenticationEnabled" value="true"/>
            <property name="failureDetectionTimeout" value="30000"/>
            <property name="workDirectory" value="/persistence/work"/>            
            <property name="dataStorageConfiguration">
                <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">                    
                    <property name="defaultDataRegionConfiguration">
                        <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">                            
                            <property name="name" value="Default_Region"/>
                            <property name="persistenceEnabled"
value="true"/>						
                            <property name="maxSize" value="#{80 * 1024 *
1024 * 1024}"/>
                        </bean>
                    </property>
                    <property name="storagePath" value="/persistence"/>
                    <property name="walPath" value="/wal"/>
                    <property name="walArchivePath" value="/wal/archive"/>
                    <property name="walMode" value="LOG_ONLY"/>
                    <property name="walCompactionEnabled" value="true" />
                    <property name="walHistorySize" value="2" />
                </bean>
            </property>


Issue: Data Loading with Spark gets stuck at the end. 
Description:
I am trying to load 65M (million) rows into the Ignite cluster. Everything
runs well till 64.5 Million rows and then all of sudden, the data ingestion
just hangs. (I am able to ingest 64.5 Million rows in about 10 minutes)

There is enough memory which is still free on all the nodes (Approximately
80 GB of memory remains free on all the nodes). 

I am using the following code to ingest data into ignite. 

dataFrame
      .write.format(IgniteDataFrameSettings.FORMAT_IGNITE)
      .option(IgniteDataFrameSettings.OPTION_TABLE, tableName)
      .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE, igniteConfig)
     
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS,
primaryKey)
      .option("user", igniteUsername)
      .option("password", ignitePassword)
      .mode(SaveMode.Overwrite) //Overwriting entire table.
      .save()

I am not sure, but it looks like that the Ignite Thread is hanging
somewhere. 

This is what I can see in the thread dumps of all the executors. 


<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Spark_Thread_Stuck.png> 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Load Hangs at the end while using DataStreamer

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I can see that some data processing is happening in thread dumps, but also
this:

[11:16:11,637][INFO][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.16.1.7:47100,
rmtAddr=/10.139.0.10:38624]
[11:16:12,686][SEVERE][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi]
Failed to process selector key [ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
bytesRcvd=430031923, bytesSent=2154539, bytesRcvd0=6974058,
bytesSent0=1976, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-2, igniteInstanceName=null, finished=false,
heartbeatTs=1581074171663, hashCode=1764437028, interrupted=false,
runner=grid-nio-worker-tcp-comm-2-#26]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=384, resendCnt=0, rcvCnt=422,
sentCnt=413, reserved=true, lastAck=416, nodeLeft=false,
node=TcpDiscoveryNode [id=a66a573a-43dc-48d2-8ee5-232e727acbc9,
addrs=[10.139.64.10, 127.0.0.1], sockAddrs=[/10.139.64.10:0, /127.0.0.1:0],
discPort=0, order=19, intOrder=19, lastExchangeTime=1581073961809,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true],
connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1,
pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=384,
resendCnt=0, rcvCnt=422, sentCnt=413, reserved=true, lastAck=416,
nodeLeft=false, node=TcpDiscoveryNode
[id=a66a573a-43dc-48d2-8ee5-232e727acbc9, addrs=[10.139.64.10, 127.0.0.1],
sockAddrs=[/10.139.64.10:0, /127.0.0.1:0], discPort=0, order=19,
intOrder=19, lastExchangeTime=1581073961809, loc=false,
ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], connected=true,
connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/172.16.1.7:47100, rmtAddr=/
10.139.0.10:37846, createTime=1581073963095, closeTime=0, bytesSent=78611,
bytesRcvd=104294928, bytesSent0=561, bytesRcvd0=916098,
sndSchedTime=1581073963095, lastSndTime=1581074171592,
lastRcvTime=1581074171612, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@672e22f0, directMode=true],
GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:377)
at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)

[11:16:46,612][INFO][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this
node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601,
locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9,
rmtNodeOrder=19]
[11:16:46,928][INFO][grid-nio-worker-tcp-comm-3-#27][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.16.1.7:47100,
rmtAddr=/10.139.0.10:38900]
[11:16:46,985][INFO][grid-nio-worker-tcp-comm-3-#27][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this
node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601,
locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9,
rmtNodeOrder=19]
[11:16:47,301][INFO][grid-nio-worker-tcp-comm-0-#24][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.16.1.7:47100,
rmtAddr=/10.139.0.10:38902]
[11:16:47,359][INFO][grid-nio-worker-tcp-comm-0-#24][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this
node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601,
locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9,
rmtNodeOrder=19]
[11:16:47,675][INFO][grid-nio-worker-tcp-comm-1-#25][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.16.1.7:47100,
rmtAddr=/10.139.0.10:38904]
[11:16:47,733][INFO][grid-nio-worker-tcp-comm-1-#25][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this
node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601,
locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9,
rmtNodeOrder=19]
[11:16:48,049][INFO][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.16.1.7:47100,
rmtAddr=/10.139.0.10:38916]
[11:16:48,106][INFO][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this
node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601,
locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9,
rmtNodeOrder=19]
[11:16:48,423][INFO][grid-nio-worker-tcp-comm-3-#27][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.16.1.7:47100,
rmtAddr=/10.139.0.10:38918]
[11:16:48,481][INFO][grid-nio-worker-tcp-comm-3-#27][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this
node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601,
locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9,
rmtNodeOrder=19]

It's a bad sign. I think you either have network problems, or maxed out
your communication.

I recommend the following configuration change to TcpCommunicationSpi:

socketWriteTimeout 5000
usePairedConnections true
connectionsPerNode 4.

You may also like to assign localAddr to known good (reachable) IP address
of the node, on each node.

Regards,
-- 
Ilya Kasnacheev


пт, 7 февр. 2020 г. в 14:34, pg31 <si...@gmail.com>:

> Thanks Ilya.
>
> I have changed the Client Side Machine to prefer IPv4 Stack and hence that
> error went away. But still the data-streamer-stripes and tcp-comm-worker
> threads keep getting stuck.
>
> I am attaching the logs again. (These contain the thread-dump themselves)
> log.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/log.zip>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Ignite Load Hangs at the end while using DataStreamer

Posted by pg31 <si...@gmail.com>.

Thanks Ilya. 

I have changed the Client Side Machine to prefer IPv4 Stack and hence that
error went away. But still the data-streamer-stripes and tcp-comm-worker
threads keep getting stuck. 

I am attaching the logs again. (These contain the thread-dump themselves)
log.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/log.zip>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Load Hangs at the end while using DataStreamer

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I think your problem as follows:

[15:09:53,418][WARNING][disco-event-worker-#37][GridDiscoveryManager] Local
node's value of 'java.net.preferIPv4Stack' system property differs from
remote node's (all nodes in topology should have identical value)
[locPreferIpV4=true, rmtPreferIpV4=null, locId8=7e2a36bf, rmtId8=c311b665,
rmtAddrs=[10.139.64.39/0:0:0:0:0:0:0:1%lo, /10.139.64.39, /127.0.0.1],
rmtNode=ClusterNode [id=c311b665-2392-4542-b98e-f599242c2946, order=20,
addr=[0:0:0:0:0:0:0:1%lo, 10.139.64.39, 127.0.0.1], daemon=false]]

Caused by: java.lang.NullPointerException
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.isLocalNodeAddress(TcpCommunicationSpi.java:3498)
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3243)
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
... 17 more

I.e. some nodes are advertising IPv6 addresses when other nodes can't use
them at all due to JVM setting. Please set this system property to 'true'
for all your JVMs.

I also think this is related to
https://issues.apache.org/jira/browse/IGNITE-8343 which is regrettably not
fixed.

Regards,
-- 
Ilya Kasnacheev


чт, 6 февр. 2020 г. в 18:48, pg31 <si...@gmail.com>:

> Hi
>
> I thought I was hitting: (
> https://issues.apache.org/jira/browse/IGNITE-8564)
> which got fixed in 2.7.
> I changed the version to 2.7.6 and retried, but it still does not work.
>
> Here are the logs for 2.7.6
> 1. Logs.zip
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Logs.zip>
>
> Here is the thread-dump (Because of file constraint, had to break it up)
> 1.  T11.pdf
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T11.pdf>
> 2.  T12.pdf
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T12.pdf>
> 3.  T13.pdf
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T13.pdf>
>
> Please let me know if anything else is required.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Ignite Load Hangs at the end while using DataStreamer

Posted by pg31 <si...@gmail.com>.

Hi

I thought I was hitting: (https://issues.apache.org/jira/browse/IGNITE-8564)
which got fixed in 2.7.
I changed the version to 2.7.6 and retried, but it still does not work. 

Here are the logs for 2.7.6
1. Logs.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Logs.zip>  

Here is the thread-dump (Because of file constraint, had to break it up)
1.  T11.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T11.pdf>  
2.  T12.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T12.pdf>  
3.  T13.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T13.pdf>  

Please let me know if anything else is required. 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Load Hangs at the end while using DataStreamer

Posted by pg31 <si...@gmail.com>.

Hi Denis

Thank you for the response.

Please find the thread-dumps (Had to split it in 3 files since it was large)
1. T1.pdf <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T1.pdf>  
2. T2.pdf <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T2.pdf>  
3. T3.pdf <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T3.pdf>  


Please find the logs attached. (Zipped the file, since it was large)
Logs.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Logs.zip>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Load Hangs at the end while using DataStreamer

Posted by Denis Magda <dm...@apache.org>.

Please attach all the thread dumps and log files for the investigation.

-
Denis


On Wed, Feb 5, 2020 at 8:03 AM pg31 <si...@gmail.com> wrote:

> Hi
>
> Cluster Configuration:
> 3 Nodes (112 GB Memory / 512 GB Disk)
>
> Ignite Configuration:
> 1. Persistence Enabled
> 2. Version: 2.6.0
> 3. Configuration is as follows:
>             <property name="authenticationEnabled" value="true"/>
>             <property name="failureDetectionTimeout" value="30000"/>
>             <property name="workDirectory" value="/persistence/work"/>
>
>             <property name="dataStorageConfiguration">
>                 <bean
> class="org.apache.ignite.configuration.DataStorageConfiguration">
>
>                     <property name="defaultDataRegionConfiguration">
>                         <bean
> class="org.apache.ignite.configuration.DataRegionConfiguration">
>
>                             <property name="name" value="Default_Region"/>
>                             <property name="persistenceEnabled"
> value="true"/>
>                             <property name="maxSize" value="#{80 * 1024 *
> 1024 * 1024}"/>
>                         </bean>
>                     </property>
>                     <property name="storagePath" value="/persistence"/>
>                     <property name="walPath" value="/wal"/>
>                     <property name="walArchivePath" value="/wal/archive"/>
>                     <property name="walMode" value="LOG_ONLY"/>
>                     <property name="walCompactionEnabled" value="true" />
>                     <property name="walHistorySize" value="2" />
>                 </bean>
>             </property>
>
>
> Issue: Data Loading with Spark gets stuck at the end.
> Description:
> I am trying to load 65M (million) rows into the Ignite cluster. Everything
> runs well till 64.5 Million rows and then all of sudden, the data ingestion
> just hangs. (I am able to ingest 64.5 Million rows in about 10 minutes)
>
> There is enough memory which is still free on all the nodes (Approximately
> 80 GB of memory remains free on all the nodes).
>
> I am using the following code to ingest data into ignite.
>
> dataFrame
>       .write.format(IgniteDataFrameSettings.FORMAT_IGNITE)
>       .option(IgniteDataFrameSettings.OPTION_TABLE, tableName)
>       .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE, igniteConfig)
>
> .option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS,
> primaryKey)
>       .option("user", igniteUsername)
>       .option("password", ignitePassword)
>       .mode(SaveMode.Overwrite) //Overwriting entire table.
>       .save()
>
> I am not sure, but it looks like that the Ignite Thread is hanging
> somewhere.
>
> This is what I can see in the thread dumps of all the executors.
>
>
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Spark_Thread_Stuck.png>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>