You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by pragmaticbigdata <am...@gmail.com> on 2016/12/05 13:43:28 UTC

Re: Apache Spark & Ignite Integration

I have tried translating my understanding in these two images. Kindly let me
know if the diagrams depict the ignite-spark integration in terms of data
visibility and persistence correctly.

<http://apache-ignite-users.70518.x6.nabble.com/file/n9393/Pbq53oL.png> 

<http://apache-ignite-users.70518.x6.nabble.com/file/n9393/q7e83SI.png> 



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Apache-Spark-Ignite-Integration-tp8556p9393.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Apache Spark & Ignite Integration

Posted by vkulichenko <va...@gmail.com>.

First of all, I'm not sure how it works with Dataframes. Since we don't have
Dataframe support yet, only RDD, using Dataframe can potentially not work as
we expect (I don't have enough Spark expertise to tell if this is the case
or not). The only way to check this is to create tests.

Other than that, just keep in mind that IgniteRDD is basically another API
for Ignite cache. I.e. everything that is true for caches are true here. In
particular, answering your questions:

a. It depends what you use as keys. If a pair saved with the same key twice,
the second write will overwrite the first one.
b. Ignite works in concurrent fashion without locking the world. While you
save new data, you can still read it and run queries.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Apache-Spark-Ignite-Integration-tp8556p9544.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Apache Spark & Ignite Integration

Posted by pragmaticbigdata <am...@gmail.com>.

Sure.

1. The first diagram is for understanding the data visibility aspect of the
spark integration. Given that a cache exists on the ignite node, spark tries
to create a data frame from the IgniteRDD and perform an action (df.show())
on it. Concurrently if there are changes made to the cache (either by
another spark application or by another application using Ignite API) on the
ignite node, the question is would spark worker be able to see those
changes? My understanding based on our discussion so far is that the
df.show() action would not display the latest changes in the cache since the
underlying IgniteRDD might be updated but the dataframe is another layer
about it.

2. The second diagram is to understand the locking and the concurrency
behavior with the spark integration. Given that a cache exists on the ignite
node, spark tries to create a data frame from the IgniteRDD and add a new
column to the data (in the diagram, the email column). Concurrently if there
are changes made to the cache (either by another spark application or by
another application using Ignite API) on the ignite node, the question is
a. What happens when spark tries to persist the RDD back to the ignite cache
through the saveRDD() api? Would the changes made previously to the ignite
cache be lost?
b. What is the locking behavior when updating the ignite cache? Would it
lock all the partitions of the cache preventing read/write access to the
cache or can ignite determine the partitions that are going to be updated
and lock only those?

Thanks.

--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Apache-Spark-Ignite-Integration-tp8556p9502.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Apache Spark & Ignite Integration

Posted by vkulichenko <va...@gmail.com>.

Hi,

To be honest, I don't quite understand these diagrams :) Can you give some
comments?

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Apache-Spark-Ignite-Integration-tp8556p9494.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.