You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by pragmaticbigdata <am...@gmail.com> on 2017/01/17 12:26:14 UTC

Consistency Guarantees & Smart Updates with Spark Integration

I have a couple of questions with about the ignite spark integration
1. What consistency guarantees does ignite provide when saving a RDD to the
data grid? For e.g. Assuming the spark rdd holds 1 million records and I
call  sharedRdd.savePairs() api.

a. What happens if the spark worker crashes after few thousand records are
already saved? Would the data uploaded to the data grid before the crash be
rolled back?
b. What happens if one of the ignite server nodes that is loading some of
the data crashes? Would the data on other nodes in the data grid for this
rdd be rolled back?

2. When updating data into the data grid through the rdd api, can ignite
smartly determine what data is updated (comparing with the previous data in
the data grid) and update only those partitions? Assuming the rdd was
previously loaded from the data grid through
igniteContext.fromCache("partitioned") api.

Thanks.



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Consistency-Guarantees-Smart-Updates-with-Spark-Integration-tp10091.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Consistency Guarantees & Smart Updates with Spark Integration

Posted by vkulichenko <va...@gmail.com>.

1.a. Help with what? Do you know how Spark behaves in this case and what
guarantees does it provide? To be honest, I'm still struggling to understand
why you don't want to use Ignite API directly for updates. Is there a use
case that you tried to implement, but it didn't work for some reason?

1.b. Whether or not you need a transaction, depends on what you're trying to
achieve, but on number of backups. Backups help not to lose data in case of
node failures. Again, it's very hard to discuss without a particular use
case in mind.

2. Ignite can't do this of course, but it sounds like you can filter the RDD
first, and then map it. This way the modified RDD will be smaller and you
will have less updates.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Consistency-Guarantees-Smart-Updates-with-Spark-Integration-tp10091p10121.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Consistency Guarantees & Smart Updates with Spark Integration

Posted by pragmaticbigdata <am...@gmail.com>.

1.a. So basically Ignite cannot help much here. Would wrapping the save
around an IgniteTransaction help? When spark node crashes I can rollback the
transaction so that the data in the data grid is consistent. This also means
I should be using IgniteTransaction api for all the operations performed on
the data grid. Is my understanding right?

1.b. Ok. Assuming I have backups configured, I do not need a transaction to
be wrapped around to keep the data consistent in the data grid.

2. Right, I understand that the data is not loaded in the spark memory when 
igniteContext.fromCache("partitioned") call is made. Assume the data flow is
as below

a. val employeesRDD = igniteContext.fromCache("partitioned")
b. //work on the rdd by modifying the salaries of all employees in
California. This step will generate a new rdd named employeesModifiedRDD
c. employeesModifiedRDD.savePairs() //save the rdd back to the data grid

Assuming the employees cache is partitioned by region. In 2.c. can ignite
determine that only the partition that holds the employee records of
California region are updated, hence I should execute the update only on
that server node that holds those records and not blindly update all records
in the cache. 

Thanks for your inputs.



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Consistency-Guarantees-Smart-Updates-with-Spark-Integration-tp10091p10112.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Consistency Guarantees & Smart Updates with Spark Integration

Posted by vkulichenko <va...@gmail.com>.

Hi,

1.a. I think this depends on Spark and how it handles failover in such
cases. Basically, loading data to Ignite from Spark RDD is a simple
iteration through all partitions in this RDD.

1.b. You will not lose any data if you have at least one backup.

2. Can you clarify this? igniteContext.fromCache("partitioned") does not
load any data, it just wraps an Ignite cache into an RDD API.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Consistency-Guarantees-Smart-Updates-with-Spark-Integration-tp10091p10104.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.