You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by SaiPrasanna <sa...@siemens.com> on 2014/01/16 10:42:22 UTC

Consistency between RDD's and Native File System

Hello, i am a novice to SPARK

Say that we have created an RDD1 from native file system/HDFS and done some
transformations and actions and that resulted in an RDD2. Lets assume RDD1
and RDD2 are persisted, cached in-memory. If the block from where RDD1 was
created was modified at time T1 and RDD1/RDD2 is accessed later at T2 > T1,
is there a way either SPARK ensures consistency or it is upto the programmer
to make it explicit?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Consistency between RDD's and Native File System

Posted by Sai Prasanna <an...@gmail.com>.
Christorpher, things are much clear now. I did go through Journaling.

Thanks...


On Fri, Jan 17, 2014 at 8:59 PM, Christopher Nguyen <ct...@adatao.com> wrote:

> Sai, to be sure, what Mark said regarding lineage and recomputation is
> exactly correct, so if it matters in your use case, you shouldn't ignore
> this behavior, even as a side effect.
>
> It just isn't what I think you were expecting in terms of RDD guarantees,
> e.g., somehow there is a signal sent to your driver or workers that the
> "original source" has changed. Further, there are no guarantees that Spark
> hasn't decided to checkpoint the lineage somewhere and is no longer going
> back to the "original source" to pick up the latest data. The recomputation
> (read "journaling") design goal is reliability, not "data refresh".
>
> Hope that is clear. I do sympathize with a possible reading of your design
> goal; we are working on perhaps a similar design goal where streaming data
> deltas are automatically reflected into a data structure on which the user
> has a single reference (*)
>
> (*) yep this is based on DStream/TD's work and will be available soon.
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jan 16, 2014 at 9:33 PM, Christopher Nguyen <ct...@adatao.com>wrote:
>
>> Mark, that's precisely why I brought up lineage, in order to say I didn't
>> want to muddy the issue there :)
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <http://adatao.com>
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Thu, Jan 16, 2014 at 9:09 PM, Mark Hamstra <ma...@clearstorydata.com>wrote:
>>
>>> I don't agree entirely, Christopher.  Without persisting or
>>> checkpointing RDDs, re-evaluation of the lineage will pick up source
>>> changes.  I'm not saying that working this way is a good idea (in fact,
>>> it's generally not), but you can do things like this:
>>>
>>> 1) Create file silliness.txt containing:
>>>
>>> one line
>>> two line
>>> red line
>>> blue line
>>>
>>> 2) Fire up spark-shell and do this:
>>>
>>> scala> val lines = sc.textFile("silliness.txt")
>>> scala> println(lines.collect.mkString(", "))
>>> .
>>> .
>>> .
>>> one line, two line, red line, blue line
>>>
>>> 3) Edit silliness.txt so that it is now:
>>>
>>> and now
>>> for something
>>> completely
>>> different
>>>
>>> 4) Continue on with spark-shell:
>>>
>>> scala> println(lines.collect.mkString(", "))
>>> .
>>> .
>>> .
>>> and now, for something, completely, different
>>>
>>>
>>> On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <ct...@adatao.com>wrote:
>>>
>>>> Sai, from your question, I infer that you have an interpretation that
>>>> RDDs are somehow an in-memory/cached copy of the underlying data
>>>> source---and so there is some expectation that there is some
>>>> synchronization model between the two.
>>>>
>>>> That would not be what the RDD model is. RDDs are first-class,
>>>> stand-alone (distributed, immutable) datasets. Once created, an RDD exists
>>>> on its own and isn't expected to somehow automatically realize that some
>>>> underlying source has changed. (There is the concept of lineage or
>>>> provenance for recomputation of RDDs, but that's orthogonal to this
>>>> interpretation so I won't muddy the issue here).
>>>>
>>>> If you're looking for a mutable data table model, we will soon be
>>>> releasing to open source something called Distributed DataFrame (DDF, based
>>>> on R's data.frame) on top of RDDs that allows you to, among other useful
>>>> things, load a dataset, perform transformations on it, and save it back,
>>>> all the while holding on to a single DDF reference.
>>>>
>>>> --
>>>> Christopher T. Nguyen
>>>> Co-founder & CEO, Adatao <http://adatao.com>
>>>> linkedin.com/in/ctnguyen
>>>>
>>>>
>>>>
>>>> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <an...@gmail.com>wrote:
>>>>
>>>>> Thanks Patrick, but i think i dint put my question clearly...
>>>>>
>>>>> The question is Say in the native file system or HDFS, i have data
>>>>> describing students who passed, failed and for whom results are with-held
>>>>> for some reason.
>>>>> *Time T1:*
>>>>> x - Pass
>>>>> y - Fail
>>>>> z - With-held.
>>>>>
>>>>> *Time T2:*
>>>>> So i create an RDD1 reflecting this data, run a query to find how many
>>>>> candidates have passed.
>>>>> RESULT = 1. RDD1 is cached or its stored in the file system depending
>>>>> on the availability of space.
>>>>>
>>>>> *Time T3:*
>>>>> In the native file system, now that results of the z are out and
>>>>> declared passed. So HDFS will need to be modified.
>>>>> x - Pass
>>>>> y - Fail
>>>>> z - Pass.
>>>>> Say now i get the RDD1 that is there in file system or cached copy and
>>>>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>>>>
>>>>> So i was asking is there a way SPARK hints that RDD1 is no longer
>>>>> consistent with the file system or that its upto the programmer to recreate
>>>>> the RDD1 if the block from where RDD was created was changed at a later
>>>>> point of time.
>>>>> [T1 < T2 < T3 < T4]
>>>>>
>>>>> Thanks in advance...
>>>>>
>>>>>
>>>>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <pw...@gmail.com>wrote:
>>>>>
>>>>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>>>>> block in-place inside of an RDD. As a result, this particular
>>>>>> consistency issue doesn't come up in Spark.
>>>>>>
>>>>>> - Patrick
>>>>>>
>>>>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <
>>>>>> sai.annamalai@siemens.com> wrote:
>>>>>> > Hello, i am a novice to SPARK
>>>>>> >
>>>>>> > Say that we have created an RDD1 from native file system/HDFS and
>>>>>> done some
>>>>>> > transformations and actions and that resulted in an RDD2. Lets
>>>>>> assume RDD1
>>>>>> > and RDD2 are persisted, cached in-memory. If the block from where
>>>>>> RDD1 was
>>>>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at
>>>>>> T2 > T1,
>>>>>> > is there a way either SPARK ensures consistency or it is upto the
>>>>>> programmer
>>>>>> > to make it explicit?
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>>>>> > Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Sai Prasanna. AN*
>>>>> *II M.Tech (CS), SSSIHL*
>>>>>
>>>>>
>>>>> * Entire water in the ocean can never sink a ship, Unless it gets
>>>>> inside. All the pressures of life can never hurt you, Unless you let them
>>>>> in.*
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*

Re: Consistency between RDD's and Native File System

Posted by Christopher Nguyen <ct...@adatao.com>.
Sai, to be sure, what Mark said regarding lineage and recomputation is
exactly correct, so if it matters in your use case, you shouldn't ignore
this behavior, even as a side effect.

It just isn't what I think you were expecting in terms of RDD guarantees,
e.g., somehow there is a signal sent to your driver or workers that the
"original source" has changed. Further, there are no guarantees that Spark
hasn't decided to checkpoint the lineage somewhere and is no longer going
back to the "original source" to pick up the latest data. The recomputation
(read "journaling") design goal is reliability, not "data refresh".

Hope that is clear. I do sympathize with a possible reading of your design
goal; we are working on perhaps a similar design goal where streaming data
deltas are automatically reflected into a data structure on which the user
has a single reference (*)

(*) yep this is based on DStream/TD's work and will be available soon.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 16, 2014 at 9:33 PM, Christopher Nguyen <ct...@adatao.com> wrote:

> Mark, that's precisely why I brought up lineage, in order to say I didn't
> want to muddy the issue there :)
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jan 16, 2014 at 9:09 PM, Mark Hamstra <ma...@clearstorydata.com>wrote:
>
>> I don't agree entirely, Christopher.  Without persisting or checkpointing
>> RDDs, re-evaluation of the lineage will pick up source changes.  I'm not
>> saying that working this way is a good idea (in fact, it's generally not),
>> but you can do things like this:
>>
>> 1) Create file silliness.txt containing:
>>
>> one line
>> two line
>> red line
>> blue line
>>
>> 2) Fire up spark-shell and do this:
>>
>> scala> val lines = sc.textFile("silliness.txt")
>> scala> println(lines.collect.mkString(", "))
>> .
>> .
>> .
>> one line, two line, red line, blue line
>>
>> 3) Edit silliness.txt so that it is now:
>>
>> and now
>> for something
>> completely
>> different
>>
>> 4) Continue on with spark-shell:
>>
>> scala> println(lines.collect.mkString(", "))
>> .
>> .
>> .
>> and now, for something, completely, different
>>
>>
>> On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <ct...@adatao.com>wrote:
>>
>>> Sai, from your question, I infer that you have an interpretation that
>>> RDDs are somehow an in-memory/cached copy of the underlying data
>>> source---and so there is some expectation that there is some
>>> synchronization model between the two.
>>>
>>> That would not be what the RDD model is. RDDs are first-class,
>>> stand-alone (distributed, immutable) datasets. Once created, an RDD exists
>>> on its own and isn't expected to somehow automatically realize that some
>>> underlying source has changed. (There is the concept of lineage or
>>> provenance for recomputation of RDDs, but that's orthogonal to this
>>> interpretation so I won't muddy the issue here).
>>>
>>> If you're looking for a mutable data table model, we will soon be
>>> releasing to open source something called Distributed DataFrame (DDF, based
>>> on R's data.frame) on top of RDDs that allows you to, among other useful
>>> things, load a dataset, perform transformations on it, and save it back,
>>> all the while holding on to a single DDF reference.
>>>
>>> --
>>> Christopher T. Nguyen
>>> Co-founder & CEO, Adatao <http://adatao.com>
>>> linkedin.com/in/ctnguyen
>>>
>>>
>>>
>>> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <an...@gmail.com>wrote:
>>>
>>>> Thanks Patrick, but i think i dint put my question clearly...
>>>>
>>>> The question is Say in the native file system or HDFS, i have data
>>>> describing students who passed, failed and for whom results are with-held
>>>> for some reason.
>>>> *Time T1:*
>>>> x - Pass
>>>> y - Fail
>>>> z - With-held.
>>>>
>>>> *Time T2:*
>>>> So i create an RDD1 reflecting this data, run a query to find how many
>>>> candidates have passed.
>>>> RESULT = 1. RDD1 is cached or its stored in the file system depending
>>>> on the availability of space.
>>>>
>>>> *Time T3:*
>>>> In the native file system, now that results of the z are out and
>>>> declared passed. So HDFS will need to be modified.
>>>> x - Pass
>>>> y - Fail
>>>> z - Pass.
>>>> Say now i get the RDD1 that is there in file system or cached copy and
>>>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>>>
>>>> So i was asking is there a way SPARK hints that RDD1 is no longer
>>>> consistent with the file system or that its upto the programmer to recreate
>>>> the RDD1 if the block from where RDD was created was changed at a later
>>>> point of time.
>>>> [T1 < T2 < T3 < T4]
>>>>
>>>> Thanks in advance...
>>>>
>>>>
>>>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <pw...@gmail.com>wrote:
>>>>
>>>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>>>> block in-place inside of an RDD. As a result, this particular
>>>>> consistency issue doesn't come up in Spark.
>>>>>
>>>>> - Patrick
>>>>>
>>>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <
>>>>> sai.annamalai@siemens.com> wrote:
>>>>> > Hello, i am a novice to SPARK
>>>>> >
>>>>> > Say that we have created an RDD1 from native file system/HDFS and
>>>>> done some
>>>>> > transformations and actions and that resulted in an RDD2. Lets
>>>>> assume RDD1
>>>>> > and RDD2 are persisted, cached in-memory. If the block from where
>>>>> RDD1 was
>>>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at
>>>>> T2 > T1,
>>>>> > is there a way either SPARK ensures consistency or it is upto the
>>>>> programmer
>>>>> > to make it explicit?
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>>>> > Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Sai Prasanna. AN*
>>>> *II M.Tech (CS), SSSIHL*
>>>>
>>>>
>>>> * Entire water in the ocean can never sink a ship, Unless it gets
>>>> inside. All the pressures of life can never hurt you, Unless you let them
>>>> in.*
>>>>
>>>
>>>
>>
>

Re: Consistency between RDD's and Native File System

Posted by Christopher Nguyen <ct...@adatao.com>.
Mark, that's precisely why I brought up lineage, in order to say I didn't
want to muddy the issue there :)

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 16, 2014 at 9:09 PM, Mark Hamstra <ma...@clearstorydata.com>wrote:

> I don't agree entirely, Christopher.  Without persisting or checkpointing
> RDDs, re-evaluation of the lineage will pick up source changes.  I'm not
> saying that working this way is a good idea (in fact, it's generally not),
> but you can do things like this:
>
> 1) Create file silliness.txt containing:
>
> one line
> two line
> red line
> blue line
>
> 2) Fire up spark-shell and do this:
>
> scala> val lines = sc.textFile("silliness.txt")
> scala> println(lines.collect.mkString(", "))
> .
> .
> .
> one line, two line, red line, blue line
>
> 3) Edit silliness.txt so that it is now:
>
> and now
> for something
> completely
> different
>
> 4) Continue on with spark-shell:
>
> scala> println(lines.collect.mkString(", "))
> .
> .
> .
> and now, for something, completely, different
>
>
> On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <ct...@adatao.com>wrote:
>
>> Sai, from your question, I infer that you have an interpretation that
>> RDDs are somehow an in-memory/cached copy of the underlying data
>> source---and so there is some expectation that there is some
>> synchronization model between the two.
>>
>> That would not be what the RDD model is. RDDs are first-class,
>> stand-alone (distributed, immutable) datasets. Once created, an RDD exists
>> on its own and isn't expected to somehow automatically realize that some
>> underlying source has changed. (There is the concept of lineage or
>> provenance for recomputation of RDDs, but that's orthogonal to this
>> interpretation so I won't muddy the issue here).
>>
>> If you're looking for a mutable data table model, we will soon be
>> releasing to open source something called Distributed DataFrame (DDF, based
>> on R's data.frame) on top of RDDs that allows you to, among other useful
>> things, load a dataset, perform transformations on it, and save it back,
>> all the while holding on to a single DDF reference.
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <http://adatao.com>
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <an...@gmail.com>wrote:
>>
>>> Thanks Patrick, but i think i dint put my question clearly...
>>>
>>> The question is Say in the native file system or HDFS, i have data
>>> describing students who passed, failed and for whom results are with-held
>>> for some reason.
>>> *Time T1:*
>>> x - Pass
>>> y - Fail
>>> z - With-held.
>>>
>>> *Time T2:*
>>> So i create an RDD1 reflecting this data, run a query to find how many
>>> candidates have passed.
>>> RESULT = 1. RDD1 is cached or its stored in the file system depending on
>>> the availability of space.
>>>
>>> *Time T3:*
>>> In the native file system, now that results of the z are out and
>>> declared passed. So HDFS will need to be modified.
>>> x - Pass
>>> y - Fail
>>> z - Pass.
>>> Say now i get the RDD1 that is there in file system or cached copy and
>>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>>
>>> So i was asking is there a way SPARK hints that RDD1 is no longer
>>> consistent with the file system or that its upto the programmer to recreate
>>> the RDD1 if the block from where RDD was created was changed at a later
>>> point of time.
>>> [T1 < T2 < T3 < T4]
>>>
>>> Thanks in advance...
>>>
>>>
>>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <pw...@gmail.com>wrote:
>>>
>>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>>> block in-place inside of an RDD. As a result, this particular
>>>> consistency issue doesn't come up in Spark.
>>>>
>>>> - Patrick
>>>>
>>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <sa...@siemens.com>
>>>> wrote:
>>>> > Hello, i am a novice to SPARK
>>>> >
>>>> > Say that we have created an RDD1 from native file system/HDFS and
>>>> done some
>>>> > transformations and actions and that resulted in an RDD2. Lets assume
>>>> RDD1
>>>> > and RDD2 are persisted, cached in-memory. If the block from where
>>>> RDD1 was
>>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2
>>>> > T1,
>>>> > is there a way either SPARK ensures consistency or it is upto the
>>>> programmer
>>>> > to make it explicit?
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>>> > Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>>
>>>
>>>
>>>
>>> --
>>> *Sai Prasanna. AN*
>>> *II M.Tech (CS), SSSIHL*
>>>
>>>
>>> * Entire water in the ocean can never sink a ship, Unless it gets
>>> inside. All the pressures of life can never hurt you, Unless you let them
>>> in.*
>>>
>>
>>
>

Re: Consistency between RDD's and Native File System

Posted by Mark Hamstra <ma...@clearstorydata.com>.
I don't agree entirely, Christopher.  Without persisting or checkpointing
RDDs, re-evaluation of the lineage will pick up source changes.  I'm not
saying that working this way is a good idea (in fact, it's generally not),
but you can do things like this:

1) Create file silliness.txt containing:

one line
two line
red line
blue line

2) Fire up spark-shell and do this:

scala> val lines = sc.textFile("silliness.txt")
scala> println(lines.collect.mkString(", "))
.
.
.
one line, two line, red line, blue line

3) Edit silliness.txt so that it is now:

and now
for something
completely
different

4) Continue on with spark-shell:

scala> println(lines.collect.mkString(", "))
.
.
.
and now, for something, completely, different


On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <ct...@adatao.com> wrote:

> Sai, from your question, I infer that you have an interpretation that RDDs
> are somehow an in-memory/cached copy of the underlying data source---and so
> there is some expectation that there is some synchronization model between
> the two.
>
> That would not be what the RDD model is. RDDs are first-class, stand-alone
> (distributed, immutable) datasets. Once created, an RDD exists on its own
> and isn't expected to somehow automatically realize that some underlying
> source has changed. (There is the concept of lineage or provenance for
> recomputation of RDDs, but that's orthogonal to this interpretation so I
> won't muddy the issue here).
>
> If you're looking for a mutable data table model, we will soon be
> releasing to open source something called Distributed DataFrame (DDF, based
> on R's data.frame) on top of RDDs that allows you to, among other useful
> things, load a dataset, perform transformations on it, and save it back,
> all the while holding on to a single DDF reference.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <an...@gmail.com>wrote:
>
>> Thanks Patrick, but i think i dint put my question clearly...
>>
>> The question is Say in the native file system or HDFS, i have data
>> describing students who passed, failed and for whom results are with-held
>> for some reason.
>> *Time T1:*
>> x - Pass
>> y - Fail
>> z - With-held.
>>
>> *Time T2:*
>> So i create an RDD1 reflecting this data, run a query to find how many
>> candidates have passed.
>> RESULT = 1. RDD1 is cached or its stored in the file system depending on
>> the availability of space.
>>
>> *Time T3:*
>> In the native file system, now that results of the z are out and declared
>> passed. So HDFS will need to be modified.
>> x - Pass
>> y - Fail
>> z - Pass.
>> Say now i get the RDD1 that is there in file system or cached copy and
>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>
>> So i was asking is there a way SPARK hints that RDD1 is no longer
>> consistent with the file system or that its upto the programmer to recreate
>> the RDD1 if the block from where RDD was created was changed at a later
>> point of time.
>> [T1 < T2 < T3 < T4]
>>
>> Thanks in advance...
>>
>>
>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <pw...@gmail.com>wrote:
>>
>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>> block in-place inside of an RDD. As a result, this particular
>>> consistency issue doesn't come up in Spark.
>>>
>>> - Patrick
>>>
>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <sa...@siemens.com>
>>> wrote:
>>> > Hello, i am a novice to SPARK
>>> >
>>> > Say that we have created an RDD1 from native file system/HDFS and done
>>> some
>>> > transformations and actions and that resulted in an RDD2. Lets assume
>>> RDD1
>>> > and RDD2 are persisted, cached in-memory. If the block from where RDD1
>>> was
>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2
>>> > T1,
>>> > is there a way either SPARK ensures consistency or it is upto the
>>> programmer
>>> > to make it explicit?
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>
>>
>>
>> --
>> *Sai Prasanna. AN*
>> *II M.Tech (CS), SSSIHL*
>>
>>
>> * Entire water in the ocean can never sink a ship, Unless it gets inside.
>> All the pressures of life can never hurt you, Unless you let them in.*
>>
>
>

Re: Consistency between RDD's and Native File System

Posted by Sai Prasanna <an...@gmail.com>.
Wow, got clarity. Thanks Christopher !!


On Fri, Jan 17, 2014 at 9:23 AM, Christopher Nguyen <ct...@adatao.com> wrote:

> Sai, from your question, I infer that you have an interpretation that RDDs
> are somehow an in-memory/cached copy of the underlying data source---and so
> there is some expectation that there is some synchronization model between
> the two.
>
> That would not be what the RDD model is. RDDs are first-class, stand-alone
> (distributed, immutable) datasets. Once created, an RDD exists on its own
> and isn't expected to somehow automatically realize that some underlying
> source has changed. (There is the concept of lineage or provenance for
> recomputation of RDDs, but that's orthogonal to this interpretation so I
> won't muddy the issue here).
>
> If you're looking for a mutable data table model, we will soon be
> releasing to open source something called Distributed DataFrame (DDF, based
> on R's data.frame) on top of RDDs that allows you to, among other useful
> things, load a dataset, perform transformations on it, and save it back,
> all the while holding on to a single DDF reference.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <an...@gmail.com>wrote:
>
>> Thanks Patrick, but i think i dint put my question clearly...
>>
>> The question is Say in the native file system or HDFS, i have data
>> describing students who passed, failed and for whom results are with-held
>> for some reason.
>> *Time T1:*
>> x - Pass
>> y - Fail
>> z - With-held.
>>
>> *Time T2:*
>> So i create an RDD1 reflecting this data, run a query to find how many
>> candidates have passed.
>> RESULT = 1. RDD1 is cached or its stored in the file system depending on
>> the availability of space.
>>
>> *Time T3:*
>> In the native file system, now that results of the z are out and declared
>> passed. So HDFS will need to be modified.
>> x - Pass
>> y - Fail
>> z - Pass.
>> Say now i get the RDD1 that is there in file system or cached copy and
>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>
>> So i was asking is there a way SPARK hints that RDD1 is no longer
>> consistent with the file system or that its upto the programmer to recreate
>> the RDD1 if the block from where RDD was created was changed at a later
>> point of time.
>> [T1 < T2 < T3 < T4]
>>
>> Thanks in advance...
>>
>>
>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <pw...@gmail.com>wrote:
>>
>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>> block in-place inside of an RDD. As a result, this particular
>>> consistency issue doesn't come up in Spark.
>>>
>>> - Patrick
>>>
>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <sa...@siemens.com>
>>> wrote:
>>> > Hello, i am a novice to SPARK
>>> >
>>> > Say that we have created an RDD1 from native file system/HDFS and done
>>> some
>>> > transformations and actions and that resulted in an RDD2. Lets assume
>>> RDD1
>>> > and RDD2 are persisted, cached in-memory. If the block from where RDD1
>>> was
>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2
>>> > T1,
>>> > is there a way either SPARK ensures consistency or it is upto the
>>> programmer
>>> > to make it explicit?
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>
>>
>>
>> --
>> *Sai Prasanna. AN*
>> *II M.Tech (CS), SSSIHL*
>>
>>
>> * Entire water in the ocean can never sink a ship, Unless it gets inside.
>> All the pressures of life can never hurt you, Unless you let them in.*
>>
>
>


-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*

Re: Consistency between RDD's and Native File System

Posted by Christopher Nguyen <ct...@adatao.com>.
Sai, from your question, I infer that you have an interpretation that RDDs
are somehow an in-memory/cached copy of the underlying data source---and so
there is some expectation that there is some synchronization model between
the two.

That would not be what the RDD model is. RDDs are first-class, stand-alone
(distributed, immutable) datasets. Once created, an RDD exists on its own
and isn't expected to somehow automatically realize that some underlying
source has changed. (There is the concept of lineage or provenance for
recomputation of RDDs, but that's orthogonal to this interpretation so I
won't muddy the issue here).

If you're looking for a mutable data table model, we will soon be releasing
to open source something called Distributed DataFrame (DDF, based on R's
data.frame) on top of RDDs that allows you to, among other useful things,
load a dataset, perform transformations on it, and save it back, all the
while holding on to a single DDF reference.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <an...@gmail.com>wrote:

> Thanks Patrick, but i think i dint put my question clearly...
>
> The question is Say in the native file system or HDFS, i have data
> describing students who passed, failed and for whom results are with-held
> for some reason.
> *Time T1:*
> x - Pass
> y - Fail
> z - With-held.
>
> *Time T2:*
> So i create an RDD1 reflecting this data, run a query to find how many
> candidates have passed.
> RESULT = 1. RDD1 is cached or its stored in the file system depending on
> the availability of space.
>
> *Time T3:*
> In the native file system, now that results of the z are out and declared
> passed. So HDFS will need to be modified.
> x - Pass
> y - Fail
> z - Pass.
> Say now i get the RDD1 that is there in file system or cached copy and run
> the same query, i get the RESULT = 1, but ideally RESULT is 2.
>
> So i was asking is there a way SPARK hints that RDD1 is no longer
> consistent with the file system or that its upto the programmer to recreate
> the RDD1 if the block from where RDD was created was changed at a later
> point of time.
> [T1 < T2 < T3 < T4]
>
> Thanks in advance...
>
>
> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <pw...@gmail.com>wrote:
>
>> RDD's are immutable, so there isn't really such a thing as modifying a
>> block in-place inside of an RDD. As a result, this particular
>> consistency issue doesn't come up in Spark.
>>
>> - Patrick
>>
>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <sa...@siemens.com>
>> wrote:
>> > Hello, i am a novice to SPARK
>> >
>> > Say that we have created an RDD1 from native file system/HDFS and done
>> some
>> > transformations and actions and that resulted in an RDD2. Lets assume
>> RDD1
>> > and RDD2 are persisted, cached in-memory. If the block from where RDD1
>> was
>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2 >
>> T1,
>> > is there a way either SPARK ensures consistency or it is upto the
>> programmer
>> > to make it explicit?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Sai Prasanna. AN*
> *II M.Tech (CS), SSSIHL*
>
>
> * Entire water in the ocean can never sink a ship, Unless it gets inside.
> All the pressures of life can never hurt you, Unless you let them in.*
>

Re: Consistency between RDD's and Native File System

Posted by Sai Prasanna <an...@gmail.com>.
Thanks Patrick, but i think i dint put my question clearly...

The question is Say in the native file system or HDFS, i have data
describing students who passed, failed and for whom results are with-held
for some reason.
*Time T1:*
x - Pass
y - Fail
z - With-held.

*Time T2:*
So i create an RDD1 reflecting this data, run a query to find how many
candidates have passed.
RESULT = 1. RDD1 is cached or its stored in the file system depending on
the availability of space.

*Time T3:*
In the native file system, now that results of the z are out and declared
passed. So HDFS will need to be modified.
x - Pass
y - Fail
z - Pass.
Say now i get the RDD1 that is there in file system or cached copy and run
the same query, i get the RESULT = 1, but ideally RESULT is 2.

So i was asking is there a way SPARK hints that RDD1 is no longer
consistent with the file system or that its upto the programmer to recreate
the RDD1 if the block from where RDD was created was changed at a later
point of time.
[T1 < T2 < T3 < T4]

Thanks in advance...


On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <pw...@gmail.com> wrote:

> RDD's are immutable, so there isn't really such a thing as modifying a
> block in-place inside of an RDD. As a result, this particular
> consistency issue doesn't come up in Spark.
>
> - Patrick
>
> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <sa...@siemens.com>
> wrote:
> > Hello, i am a novice to SPARK
> >
> > Say that we have created an RDD1 from native file system/HDFS and done
> some
> > transformations and actions and that resulted in an RDD2. Lets assume
> RDD1
> > and RDD2 are persisted, cached in-memory. If the block from where RDD1
> was
> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2 >
> T1,
> > is there a way either SPARK ensures consistency or it is upto the
> programmer
> > to make it explicit?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*

Re: Consistency between RDD's and Native File System

Posted by Patrick Wendell <pw...@gmail.com>.
RDD's are immutable, so there isn't really such a thing as modifying a
block in-place inside of an RDD. As a result, this particular
consistency issue doesn't come up in Spark.

- Patrick

On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <sa...@siemens.com> wrote:
> Hello, i am a novice to SPARK
>
> Say that we have created an RDD1 from native file system/HDFS and done some
> transformations and actions and that resulted in an RDD2. Lets assume RDD1
> and RDD2 are persisted, cached in-memory. If the block from where RDD1 was
> created was modified at time T1 and RDD1/RDD2 is accessed later at T2 > T1,
> is there a way either SPARK ensures consistency or it is upto the programmer
> to make it explicit?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.