You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Jia Zou <ja...@gmail.com> on 2018/11/15 16:33:29 UTC

Run Spark with Ignite Shared RDD on Large Volume of Data

In recent I'm running Spark MLLIb KMeans with Apach Ignite 2.6.0 shared RDD
on ten AWS r4.2xlarge workers.
It works and runs to finish on 1 billion points (within memory), but failed
with 2 billion points (exceeding available memory)

My code for loading data to Ignite Shared RDD is here:

https://github.com/jiazou-bigdata/SparkBench/blob/master/perf-bench/src/main/scala/edu/rice/bench/KMeansDataGenerator.scala#L64

Then My code for running Spark MLLIB KMeans on the Shared RDD is here:

https://github.com/jiazou-bigdata/SparkBench/blob/master/perf-bench/src/main/scala/edu/rice/bench/IgniteRDDKMeans.scala

For running 2 billion points, I enabled swap, the configuration file for
Ignite server is here:
https://github.com/jiazou-bigdata/SparkBench/blob/master/ignite/server/example-cache.xml

I have run the program to load 2 billion points to memory for several times,
but all failed.

One error I met for several times while running 2 billion points is when
loading large data to Ignite shared RDD, one Ignite worker failed without
obvious reason, the screen message is the same with the one in this post:
http://apache-ignite-users.70518.x6.nabble.com/Node-pause-for-no-obvious-reason-td21923.html

The ending part of the log file is like this:

[14:17:13,231][INFO][grid-timeout-worker-#23][IgniteKernal] FreeList
[name=null, buckets=256, dataPages=12247613, reusePages=0]
[14:17:28,710][WARNING][jvm-pause-detector-worker][] Possible too long JVM
pause: 11193 milliseconds.
[14:17:28,834][INFO][tcp-disco-sock-reader-#4][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/172.31.88.4:45550, rmtPort=45550
[14:17:28,834][INFO][tcp-disco-sock-reader-#9][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/172.31.81.91:42661, rmtPort=42661
[14:17:28,948][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
accepted incoming connection [rmtAddr=/172.31.81.91, rmtPort=59539]
[14:17:28,948][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
spawning a new thread for connection [rmtAddr=/172.31.81.91, rmtPort=59539]
[14:17:29,039][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Started
serving remote node connection [rmtAddr=/172.31.81.91:59539, rmtPort=59539]
[14:17:29,167][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
out of topology (probably, due to short-time network problems).
[14:17:29,167][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/172.31.81.91:59539, rmtPort=59539
[14:17:29,167][WARNING][disco-event-worker-#41][GridDiscoveryManager] Local
node SEGMENTED: TcpDiscoveryNode [id=0c1716fe-3b94-440e-905c-36fdca708ea4,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.90.9],
sockAddrs=[ip-172-31-90-9/172.31.90.9:47500, /0:0:0:0:0:0:0:1%lo:47500,
/127.0.0.1:47500], discPort=47500, order=8, intOrder=8,
lastExchangeTime=1542291449160, loc=true, ver=2.6.0#20180710-sha1:669feacc,
isClient=false]
[14:17:29,393][SEVERE][tcp-disco-srvr-#2][] Critical system error detected.
Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
unexpectedly.
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[14:17:29,439][SEVERE][tcp-disco-srvr-#2][] JVM will be halted immediately
due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]


If I disable swap and enable persistence, I can not start Ignite server,
complaining that node having the same consistent ID  has already added to
topology:

Caused by: class org.apache.ignite.spi.IgniteSpiException: Failed to add
node to topology because it has the same hash code for partitioned affinity
as one of existing nodes 


May I know if and how Apache Ignite can run with Spark for large data that
exceeds memory? Any suggestions are highly appreciated!


Thanks,
Jia











--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Run Spark with Ignite Shared RDD on Large Volume of Data

Posted by zaleslaw <za...@gmail.com>.

I could make only simple advise: try Ignite KMeans clusterization over the
data directly stored in Ignite
Please, have look to example  KMeans
<https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/clustering/KMeansClusterizationExample.java>  

If you will have any questions - write to me or create topic here, on user
list



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Run Spark with Ignite Shared RDD on Large Volume of Data

Posted by Jia Zou <ja...@gmail.com>.

Anyone has a clue about this?
Thanks a lot!


Jia



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/