You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mukesh Jha <me...@gmail.com> on 2014/11/25 13:02:20 UTC

Lifecycle of RDD in spark-streaming

Hey Experts,

I wanted to understand in detail about the lifecycle of rdd(s) in a
streaming app.

>From my current understanding
- rdd gets created out of the realtime input stream.
- Transform(s) functions are applied in a lazy fashion on the RDD to
transform into another rdd(s).
- Actions are taken on the final transformed rdds to get the data out of
the system.

Also rdd(s) are stored in the clusters RAM (disc if configured so) and are
cleaned in LRU fashion.

So I have the following questions on the same.
- How spark (streaming) guarantees that all the actions are taken on each
input rdd/batch.
- How does spark determines that the life-cycle of a rdd is complete. Is
there any chance that a RDD will be cleaned out of ram before all actions
are taken on them?

Thanks in advance for all your help. Also, I'm relatively new to scala &
spark so pardon me in case these are naive questions/assumptions.

-- 
Thanks & Regards,

*Mukesh Jha <me...@gmail.com>*

Re: Lifecycle of RDD in spark-streaming

Posted by Harihar Nahak <hn...@wynyardgroup.com>.

When there is new data comes in a stream spark use streams classes to
convert it into RDD and as you mention its follow with transformation and
finally action. Till the time user doesn't destroy or application is alive
All RDD remain in Memory as far as I experienced.


On 26 November 2014 at 20:05, Mukesh Jha [via Apache Spark User List] <
ml-node+s1001560n19835h49@n3.nabble.com> wrote:

> Any pointers guys?
>
> On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19835&i=0>> wrote:
>
>> Hey Experts,
>>
>> I wanted to understand in detail about the lifecycle of rdd(s) in a
>> streaming app.
>>
>> From my current understanding
>> - rdd gets created out of the realtime input stream.
>> - Transform(s) functions are applied in a lazy fashion on the RDD to
>> transform into another rdd(s).
>> - Actions are taken on the final transformed rdds to get the data out of
>> the system.
>>
>> Also rdd(s) are stored in the clusters RAM (disc if configured so) and
>> are cleaned in LRU fashion.
>>
>> So I have the following questions on the same.
>> - How spark (streaming) guarantees that all the actions are taken on each
>> input rdd/batch.
>> - How does spark determines that the life-cycle of a rdd is complete. Is
>> there any chance that a RDD will be cleaned out of ram before all actions
>> are taken on them?
>>
>> Thanks in advance for all your help. Also, I'm relatively new to scala &
>> spark so pardon me in case these are naive questions/assumptions.
>>
>> --
>> Thanks & Regards,
>>
>> *[hidden email] <http://user/SendEmail.jtp?type=node&node=19835&i=1>*
>>
>
>
>
> --
>
>
> Thanks & Regards,
>
> *[hidden email] <http://user/SendEmail.jtp?type=node&node=19835&i=2>*
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19835.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1h95@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hnahak@wynyardgroup.com | Extn: 8019




-----
--Harihar
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19987.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Lifecycle of RDD in spark-streaming

Posted by lalit1303 <la...@sigmoidanalytics.com>.

Hi Mukesh,

Once you create a streming job, a DAG is created which contains your job
plan i.e. all map transformation and all action operations to be performed
on each batch of streaming application.

So, once your job is started, the input dstream take the data input from
specified source and all the transformations/actions are performed according
to the DAG created. Once all the operations on dstream are performed, the
dstream is destroyed in LRU fashion.




-----
Lalit Yadav
lalit@sigmoidanalytics.com
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Lifecycle of RDD in spark-streaming

Posted by Mukesh Jha <me...@gmail.com>.

Any pointers guys?

On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha <me...@gmail.com> wrote:

> Hey Experts,
>
> I wanted to understand in detail about the lifecycle of rdd(s) in a
> streaming app.
>
> From my current understanding
> - rdd gets created out of the realtime input stream.
> - Transform(s) functions are applied in a lazy fashion on the RDD to
> transform into another rdd(s).
> - Actions are taken on the final transformed rdds to get the data out of
> the system.
>
> Also rdd(s) are stored in the clusters RAM (disc if configured so) and are
> cleaned in LRU fashion.
>
> So I have the following questions on the same.
> - How spark (streaming) guarantees that all the actions are taken on each
> input rdd/batch.
> - How does spark determines that the life-cycle of a rdd is complete. Is
> there any chance that a RDD will be cleaned out of ram before all actions
> are taken on them?
>
> Thanks in advance for all your help. Also, I'm relatively new to scala &
> spark so pardon me in case these are naive questions/assumptions.
>
> --
> Thanks & Regards,
>
> *Mukesh Jha <me...@gmail.com>*
>



-- 


Thanks & Regards,

*Mukesh Jha <me...@gmail.com>*

Re: Lifecycle of RDD in spark-streaming

Posted by Tathagata Das <ta...@gmail.com>.

If it regularly fails after 8 hours then could you get me the log4j logs?
To limit the size, set default log level to Warn and the level of logs for
all classes in package o.a.s.streaming to Debug. Then I can take a look.
On Nov 27, 2014 11:01 AM, "Bill Jay" <bi...@gmail.com> wrote:

> Gerard,
>
> That is a good observation. However, the strange thing I meet is if I use
> "MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10
> seconds to process my data of every batch, which is one minute. It fails
> after 10 hours with the "cannot compute split" error.
>
> Bill
>
> On Thu, Nov 27, 2014 at 3:31 AM, Gerard Maas <ge...@gmail.com>
> wrote:
>
>> Hi TD,
>>
>> We also struggled with this error for a long while.  The recurring
>> scenario is when the job takes longer to compute than the job interval and
>> a backlog starts to pile up.
>> Hint: Check
>> If the DStream storage level is set to "MEMORY_ONLY_SER" and memory runs
>> out,  then you will get a 'Cannot compute split: Missing block ...'.
>>
>> What I don't know ATM is whether the new data is dropped or the LRU
>> policy removes data in the system in favor for the incoming data.
>> In any case, the DStream processing still thinks the data is there at the
>> moment the job is scheduled to run and fails to run.
>>
>> In our case, changing storage to "MEMORY_AND_DISK_SER" solved the
>> problem and our streaming job can get through tought times without issues.
>>
>> Regularly checking 'scheduling delay' and 'total delay' on the Streaming
>> tab in the UI is a must.  (And soon we will have that on the metrics report
>> as well!! :-) )
>>
>> -kr, Gerard.
>>
>>
>>
>> On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay <bi...@gmail.com>
>> wrote:
>>
>>> Hi TD,
>>>
>>> I am using Spark Streaming to consume data from Kafka and do some
>>> aggregation and ingest the results into RDS. I do use foreachRDD in the
>>> program. I am planning to use Spark streaming in our production pipeline
>>> and it performs well in generating the results. Unfortunately, we plan to
>>> have a production pipeline 24/7 and Spark streaming job usually fails after
>>> 8-20 hours due to the exception "cannot compute split". In other cases, the
>>> Kafka receiver has failure and the program runs without producing any
>>> result.
>>>
>>> In my pipeline, the batch size is 1 minute and the data volume per
>>> minute from Kafka is 3G. I have been struggling with this issue for more
>>> than a month. It will be great if you can provide some solutions for this.
>>> Thanks!
>>>
>>> Bill
>>>
>>>
>>> On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das <
>>> tathagata.das1565@gmail.com> wrote:
>>>
>>>> Can you elaborate on the usage pattern that lead to "cannot compute
>>>> split" ? Are you using the RDDs generated by DStream, outside the
>>>> DStream logic? Something like running interactive Spark jobs
>>>> (independent of the Spark Streaming ones) on RDDs generated by
>>>> DStreams? If that is the case, what is happening is that Spark
>>>> Streaming is not aware that some of the RDDs (and the raw input data
>>>> that it will need) will be used by Spark jobs unrelated to Spark
>>>> Streaming. Hence Spark Streaming will actively clear off the raw data,
>>>> leading to failures in the unrelated Spark jobs using that data.
>>>>
>>>> In case this is your use case, the cleanest way to solve this, is by
>>>> asking Spark Streaming "remember" stuff for longer, by using
>>>> streamingContext.remember(<duration>). This will ensure that Spark
>>>> Streaming will keep around all the stuff for at least that duration.
>>>> Hope this helps.
>>>>
>>>> TD
>>>>
>>>> On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay <bi...@gmail.com>
>>>> wrote:
>>>> > Just add one more point. If Spark streaming knows when the RDD will
>>>> not be
>>>> > used any more, I believe Spark will not try to retrieve data it will
>>>> not use
>>>> > any more. However, in practice, I often encounter the error of "cannot
>>>> > compute split". Based on my understanding, this is  because Spark
>>>> cleared
>>>> > out data that will be used again. In my case, the data volume is much
>>>> > smaller (30M/s, the batch size is 60 seconds) than the memory (20G
>>>> each
>>>> > executor). If Spark will only keep RDD that are in use, I expect that
>>>> this
>>>> > error may not happen.
>>>> >
>>>> > Bill
>>>> >
>>>> > On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das <
>>>> tathagata.das1565@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> Let me further clarify Lalit's point on when RDDs generated by
>>>> >> DStreams are destroyed, and hopefully that will answer your original
>>>> >> questions.
>>>> >>
>>>> >> 1.  How spark (streaming) guarantees that all the actions are taken
>>>> on
>>>> >> each input rdd/batch.
>>>> >> This is isnt hard! By the time you call streamingContext.start(), you
>>>> >> have already set up the output operations (foreachRDD,
>>>> saveAs***Files,
>>>> >> etc.) that you want to do with the DStream. There are RDD actions
>>>> >> inside the DStream output oeprations that need to be done every batch
>>>> >> interval. So all the systems does is this - after every batch
>>>> >> interval, put all the output operations (that will call RDD actions)
>>>> >> in a job queue, and then keep executing stuff in the queue. If there
>>>> >> is any failure in running the jobs, the streaming context will stop.
>>>> >>
>>>> >> 2.  How does spark determines that the life-cycle of a rdd is
>>>> >> complete. Is there any chance that a RDD will be cleaned out of ram
>>>> >> before all actions are taken on them?
>>>> >> Spark Streaming knows when the all the processing related to batch T
>>>> >> has been completed. And also it keeps track of how much time of the
>>>> >> previous RDDs does it need to remember and keep around in the cache
>>>> >> based on what DStream operations have been done. For example, if you
>>>> >> are using a window 1 minute, the system knows that it needs to keep
>>>> >> around at least last 1 minute data in the memory. Accordingly, it
>>>> >> cleans up the input data (actively unpersisted), and cached RDD
>>>> >> (simply dereferenced from DStream metadata, and then Spark unpersists
>>>> >> them as the RDD object gets GarbageCollected by the JVM).
>>>> >>
>>>> >> TD
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
>>>> >> <tz...@yahoo.com.invalid> wrote:
>>>> >> > I have found this paper seems to answer most of questions about
>>>> life
>>>> >> > duration.
>>>> >> >
>>>> >> >
>>>> https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
>>>> >> >
>>>> >> > Tian
>>>> >> >
>>>> >> >
>>>> >> > On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha
>>>> >> > <me...@gmail.com>
>>>> >> > wrote:
>>>> >> >
>>>> >> >
>>>> >> > Hey Experts,
>>>> >> >
>>>> >> > I wanted to understand in detail about the lifecycle of rdd(s) in a
>>>> >> > streaming app.
>>>> >> >
>>>> >> > From my current understanding
>>>> >> > - rdd gets created out of the realtime input stream.
>>>> >> > - Transform(s) functions are applied in a lazy fashion on the RDD
>>>> to
>>>> >> > transform into another rdd(s).
>>>> >> > - Actions are taken on the final transformed rdds to get the data
>>>> out of
>>>> >> > the
>>>> >> > system.
>>>> >> >
>>>> >> > Also rdd(s) are stored in the clusters RAM (disc if configured so)
>>>> and
>>>> >> > are
>>>> >> > cleaned in LRU fashion.
>>>> >> >
>>>> >> > So I have the following questions on the same.
>>>> >> > - How spark (streaming) guarantees that all the actions are taken
>>>> on
>>>> >> > each
>>>> >> > input rdd/batch.
>>>> >> > - How does spark determines that the life-cycle of a rdd is
>>>> complete. Is
>>>> >> > there any chance that a RDD will be cleaned out of ram before all
>>>> >> > actions
>>>> >> > are taken on them?
>>>> >> >
>>>> >> > Thanks in advance for all your help. Also, I'm relatively new to
>>>> scala &
>>>> >> > spark so pardon me in case these are naive questions/assumptions.
>>>> >> >
>>>> >> > --
>>>> >> > Thanks & Regards,
>>>> >> > Mukesh Jha
>>>> >> >
>>>> >> >
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: Lifecycle of RDD in spark-streaming

Posted by Bill Jay <bi...@gmail.com>.

Gerard,

That is a good observation. However, the strange thing I meet is if I use
"MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10
seconds to process my data of every batch, which is one minute. It fails
after 10 hours with the "cannot compute split" error.

Bill

On Thu, Nov 27, 2014 at 3:31 AM, Gerard Maas <ge...@gmail.com> wrote:

> Hi TD,
>
> We also struggled with this error for a long while.  The recurring
> scenario is when the job takes longer to compute than the job interval and
> a backlog starts to pile up.
> Hint: Check
> If the DStream storage level is set to "MEMORY_ONLY_SER" and memory runs
> out,  then you will get a 'Cannot compute split: Missing block ...'.
>
> What I don't know ATM is whether the new data is dropped or the LRU policy
> removes data in the system in favor for the incoming data.
> In any case, the DStream processing still thinks the data is there at the
> moment the job is scheduled to run and fails to run.
>
> In our case, changing storage to "MEMORY_AND_DISK_SER" solved the problem
> and our streaming job can get through tought times without issues.
>
> Regularly checking 'scheduling delay' and 'total delay' on the Streaming
> tab in the UI is a must.  (And soon we will have that on the metrics report
> as well!! :-) )
>
> -kr, Gerard.
>
>
>
> On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay <bi...@gmail.com>
> wrote:
>
>> Hi TD,
>>
>> I am using Spark Streaming to consume data from Kafka and do some
>> aggregation and ingest the results into RDS. I do use foreachRDD in the
>> program. I am planning to use Spark streaming in our production pipeline
>> and it performs well in generating the results. Unfortunately, we plan to
>> have a production pipeline 24/7 and Spark streaming job usually fails after
>> 8-20 hours due to the exception "cannot compute split". In other cases, the
>> Kafka receiver has failure and the program runs without producing any
>> result.
>>
>> In my pipeline, the batch size is 1 minute and the data volume per minute
>> from Kafka is 3G. I have been struggling with this issue for more than a
>> month. It will be great if you can provide some solutions for this. Thanks!
>>
>> Bill
>>
>>
>> On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> Can you elaborate on the usage pattern that lead to "cannot compute
>>> split" ? Are you using the RDDs generated by DStream, outside the
>>> DStream logic? Something like running interactive Spark jobs
>>> (independent of the Spark Streaming ones) on RDDs generated by
>>> DStreams? If that is the case, what is happening is that Spark
>>> Streaming is not aware that some of the RDDs (and the raw input data
>>> that it will need) will be used by Spark jobs unrelated to Spark
>>> Streaming. Hence Spark Streaming will actively clear off the raw data,
>>> leading to failures in the unrelated Spark jobs using that data.
>>>
>>> In case this is your use case, the cleanest way to solve this, is by
>>> asking Spark Streaming "remember" stuff for longer, by using
>>> streamingContext.remember(<duration>). This will ensure that Spark
>>> Streaming will keep around all the stuff for at least that duration.
>>> Hope this helps.
>>>
>>> TD
>>>
>>> On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay <bi...@gmail.com>
>>> wrote:
>>> > Just add one more point. If Spark streaming knows when the RDD will
>>> not be
>>> > used any more, I believe Spark will not try to retrieve data it will
>>> not use
>>> > any more. However, in practice, I often encounter the error of "cannot
>>> > compute split". Based on my understanding, this is  because Spark
>>> cleared
>>> > out data that will be used again. In my case, the data volume is much
>>> > smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
>>> > executor). If Spark will only keep RDD that are in use, I expect that
>>> this
>>> > error may not happen.
>>> >
>>> > Bill
>>> >
>>> > On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das <
>>> tathagata.das1565@gmail.com>
>>> > wrote:
>>> >>
>>> >> Let me further clarify Lalit's point on when RDDs generated by
>>> >> DStreams are destroyed, and hopefully that will answer your original
>>> >> questions.
>>> >>
>>> >> 1.  How spark (streaming) guarantees that all the actions are taken on
>>> >> each input rdd/batch.
>>> >> This is isnt hard! By the time you call streamingContext.start(), you
>>> >> have already set up the output operations (foreachRDD, saveAs***Files,
>>> >> etc.) that you want to do with the DStream. There are RDD actions
>>> >> inside the DStream output oeprations that need to be done every batch
>>> >> interval. So all the systems does is this - after every batch
>>> >> interval, put all the output operations (that will call RDD actions)
>>> >> in a job queue, and then keep executing stuff in the queue. If there
>>> >> is any failure in running the jobs, the streaming context will stop.
>>> >>
>>> >> 2.  How does spark determines that the life-cycle of a rdd is
>>> >> complete. Is there any chance that a RDD will be cleaned out of ram
>>> >> before all actions are taken on them?
>>> >> Spark Streaming knows when the all the processing related to batch T
>>> >> has been completed. And also it keeps track of how much time of the
>>> >> previous RDDs does it need to remember and keep around in the cache
>>> >> based on what DStream operations have been done. For example, if you
>>> >> are using a window 1 minute, the system knows that it needs to keep
>>> >> around at least last 1 minute data in the memory. Accordingly, it
>>> >> cleans up the input data (actively unpersisted), and cached RDD
>>> >> (simply dereferenced from DStream metadata, and then Spark unpersists
>>> >> them as the RDD object gets GarbageCollected by the JVM).
>>> >>
>>> >> TD
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
>>> >> <tz...@yahoo.com.invalid> wrote:
>>> >> > I have found this paper seems to answer most of questions about life
>>> >> > duration.
>>> >> >
>>> >> >
>>> https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
>>> >> >
>>> >> > Tian
>>> >> >
>>> >> >
>>> >> > On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha
>>> >> > <me...@gmail.com>
>>> >> > wrote:
>>> >> >
>>> >> >
>>> >> > Hey Experts,
>>> >> >
>>> >> > I wanted to understand in detail about the lifecycle of rdd(s) in a
>>> >> > streaming app.
>>> >> >
>>> >> > From my current understanding
>>> >> > - rdd gets created out of the realtime input stream.
>>> >> > - Transform(s) functions are applied in a lazy fashion on the RDD to
>>> >> > transform into another rdd(s).
>>> >> > - Actions are taken on the final transformed rdds to get the data
>>> out of
>>> >> > the
>>> >> > system.
>>> >> >
>>> >> > Also rdd(s) are stored in the clusters RAM (disc if configured so)
>>> and
>>> >> > are
>>> >> > cleaned in LRU fashion.
>>> >> >
>>> >> > So I have the following questions on the same.
>>> >> > - How spark (streaming) guarantees that all the actions are taken on
>>> >> > each
>>> >> > input rdd/batch.
>>> >> > - How does spark determines that the life-cycle of a rdd is
>>> complete. Is
>>> >> > there any chance that a RDD will be cleaned out of ram before all
>>> >> > actions
>>> >> > are taken on them?
>>> >> >
>>> >> > Thanks in advance for all your help. Also, I'm relatively new to
>>> scala &
>>> >> > spark so pardon me in case these are naive questions/assumptions.
>>> >> >
>>> >> > --
>>> >> > Thanks & Regards,
>>> >> > Mukesh Jha
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>> >>
>>> >
>>>
>>
>>
>

Re: Lifecycle of RDD in spark-streaming

Posted by Gerard Maas <ge...@gmail.com>.

Hi TD,

We also struggled with this error for a long while.  The recurring scenario
is when the job takes longer to compute than the job interval and a backlog
starts to pile up.
Hint: Check
If the DStream storage level is set to "MEMORY_ONLY_SER" and memory runs
out,  then you will get a 'Cannot compute split: Missing block ...'.

What I don't know ATM is whether the new data is dropped or the LRU policy
removes data in the system in favor for the incoming data.
In any case, the DStream processing still thinks the data is there at the
moment the job is scheduled to run and fails to run.

In our case, changing storage to "MEMORY_AND_DISK_SER" solved the problem
and our streaming job can get through tought times without issues.

Regularly checking 'scheduling delay' and 'total delay' on the Streaming
tab in the UI is a must.  (And soon we will have that on the metrics report
as well!! :-) )

-kr, Gerard.



On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay <bi...@gmail.com>
wrote:

> Hi TD,
>
> I am using Spark Streaming to consume data from Kafka and do some
> aggregation and ingest the results into RDS. I do use foreachRDD in the
> program. I am planning to use Spark streaming in our production pipeline
> and it performs well in generating the results. Unfortunately, we plan to
> have a production pipeline 24/7 and Spark streaming job usually fails after
> 8-20 hours due to the exception "cannot compute split". In other cases, the
> Kafka receiver has failure and the program runs without producing any
> result.
>
> In my pipeline, the batch size is 1 minute and the data volume per minute
> from Kafka is 3G. I have been struggling with this issue for more than a
> month. It will be great if you can provide some solutions for this. Thanks!
>
> Bill
>
>
> On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> Can you elaborate on the usage pattern that lead to "cannot compute
>> split" ? Are you using the RDDs generated by DStream, outside the
>> DStream logic? Something like running interactive Spark jobs
>> (independent of the Spark Streaming ones) on RDDs generated by
>> DStreams? If that is the case, what is happening is that Spark
>> Streaming is not aware that some of the RDDs (and the raw input data
>> that it will need) will be used by Spark jobs unrelated to Spark
>> Streaming. Hence Spark Streaming will actively clear off the raw data,
>> leading to failures in the unrelated Spark jobs using that data.
>>
>> In case this is your use case, the cleanest way to solve this, is by
>> asking Spark Streaming "remember" stuff for longer, by using
>> streamingContext.remember(<duration>). This will ensure that Spark
>> Streaming will keep around all the stuff for at least that duration.
>> Hope this helps.
>>
>> TD
>>
>> On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay <bi...@gmail.com>
>> wrote:
>> > Just add one more point. If Spark streaming knows when the RDD will not
>> be
>> > used any more, I believe Spark will not try to retrieve data it will
>> not use
>> > any more. However, in practice, I often encounter the error of "cannot
>> > compute split". Based on my understanding, this is  because Spark
>> cleared
>> > out data that will be used again. In my case, the data volume is much
>> > smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
>> > executor). If Spark will only keep RDD that are in use, I expect that
>> this
>> > error may not happen.
>> >
>> > Bill
>> >
>> > On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das <
>> tathagata.das1565@gmail.com>
>> > wrote:
>> >>
>> >> Let me further clarify Lalit's point on when RDDs generated by
>> >> DStreams are destroyed, and hopefully that will answer your original
>> >> questions.
>> >>
>> >> 1.  How spark (streaming) guarantees that all the actions are taken on
>> >> each input rdd/batch.
>> >> This is isnt hard! By the time you call streamingContext.start(), you
>> >> have already set up the output operations (foreachRDD, saveAs***Files,
>> >> etc.) that you want to do with the DStream. There are RDD actions
>> >> inside the DStream output oeprations that need to be done every batch
>> >> interval. So all the systems does is this - after every batch
>> >> interval, put all the output operations (that will call RDD actions)
>> >> in a job queue, and then keep executing stuff in the queue. If there
>> >> is any failure in running the jobs, the streaming context will stop.
>> >>
>> >> 2.  How does spark determines that the life-cycle of a rdd is
>> >> complete. Is there any chance that a RDD will be cleaned out of ram
>> >> before all actions are taken on them?
>> >> Spark Streaming knows when the all the processing related to batch T
>> >> has been completed. And also it keeps track of how much time of the
>> >> previous RDDs does it need to remember and keep around in the cache
>> >> based on what DStream operations have been done. For example, if you
>> >> are using a window 1 minute, the system knows that it needs to keep
>> >> around at least last 1 minute data in the memory. Accordingly, it
>> >> cleans up the input data (actively unpersisted), and cached RDD
>> >> (simply dereferenced from DStream metadata, and then Spark unpersists
>> >> them as the RDD object gets GarbageCollected by the JVM).
>> >>
>> >> TD
>> >>
>> >>
>> >>
>> >> On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
>> >> <tz...@yahoo.com.invalid> wrote:
>> >> > I have found this paper seems to answer most of questions about life
>> >> > duration.
>> >> >
>> >> >
>> https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
>> >> >
>> >> > Tian
>> >> >
>> >> >
>> >> > On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha
>> >> > <me...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >
>> >> > Hey Experts,
>> >> >
>> >> > I wanted to understand in detail about the lifecycle of rdd(s) in a
>> >> > streaming app.
>> >> >
>> >> > From my current understanding
>> >> > - rdd gets created out of the realtime input stream.
>> >> > - Transform(s) functions are applied in a lazy fashion on the RDD to
>> >> > transform into another rdd(s).
>> >> > - Actions are taken on the final transformed rdds to get the data
>> out of
>> >> > the
>> >> > system.
>> >> >
>> >> > Also rdd(s) are stored in the clusters RAM (disc if configured so)
>> and
>> >> > are
>> >> > cleaned in LRU fashion.
>> >> >
>> >> > So I have the following questions on the same.
>> >> > - How spark (streaming) guarantees that all the actions are taken on
>> >> > each
>> >> > input rdd/batch.
>> >> > - How does spark determines that the life-cycle of a rdd is
>> complete. Is
>> >> > there any chance that a RDD will be cleaned out of ram before all
>> >> > actions
>> >> > are taken on them?
>> >> >
>> >> > Thanks in advance for all your help. Also, I'm relatively new to
>> scala &
>> >> > spark so pardon me in case these are naive questions/assumptions.
>> >> >
>> >> > --
>> >> > Thanks & Regards,
>> >> > Mukesh Jha
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: user-help@spark.apache.org
>> >>
>> >
>>
>
>

Re: Lifecycle of RDD in spark-streaming

Posted by Bill Jay <bi...@gmail.com>.

Hi TD,

I am using Spark Streaming to consume data from Kafka and do some
aggregation and ingest the results into RDS. I do use foreachRDD in the
program. I am planning to use Spark streaming in our production pipeline
and it performs well in generating the results. Unfortunately, we plan to
have a production pipeline 24/7 and Spark streaming job usually fails after
8-20 hours due to the exception "cannot compute split". In other cases, the
Kafka receiver has failure and the program runs without producing any
result.

In my pipeline, the batch size is 1 minute and the data volume per minute
from Kafka is 3G. I have been struggling with this issue for more than a
month. It will be great if you can provide some solutions for this. Thanks!

Bill


On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das <ta...@gmail.com>
wrote:

> Can you elaborate on the usage pattern that lead to "cannot compute
> split" ? Are you using the RDDs generated by DStream, outside the
> DStream logic? Something like running interactive Spark jobs
> (independent of the Spark Streaming ones) on RDDs generated by
> DStreams? If that is the case, what is happening is that Spark
> Streaming is not aware that some of the RDDs (and the raw input data
> that it will need) will be used by Spark jobs unrelated to Spark
> Streaming. Hence Spark Streaming will actively clear off the raw data,
> leading to failures in the unrelated Spark jobs using that data.
>
> In case this is your use case, the cleanest way to solve this, is by
> asking Spark Streaming "remember" stuff for longer, by using
> streamingContext.remember(<duration>). This will ensure that Spark
> Streaming will keep around all the stuff for at least that duration.
> Hope this helps.
>
> TD
>
> On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay <bi...@gmail.com>
> wrote:
> > Just add one more point. If Spark streaming knows when the RDD will not
> be
> > used any more, I believe Spark will not try to retrieve data it will not
> use
> > any more. However, in practice, I often encounter the error of "cannot
> > compute split". Based on my understanding, this is  because Spark cleared
> > out data that will be used again. In my case, the data volume is much
> > smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
> > executor). If Spark will only keep RDD that are in use, I expect that
> this
> > error may not happen.
> >
> > Bill
> >
> > On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das <
> tathagata.das1565@gmail.com>
> > wrote:
> >>
> >> Let me further clarify Lalit's point on when RDDs generated by
> >> DStreams are destroyed, and hopefully that will answer your original
> >> questions.
> >>
> >> 1.  How spark (streaming) guarantees that all the actions are taken on
> >> each input rdd/batch.
> >> This is isnt hard! By the time you call streamingContext.start(), you
> >> have already set up the output operations (foreachRDD, saveAs***Files,
> >> etc.) that you want to do with the DStream. There are RDD actions
> >> inside the DStream output oeprations that need to be done every batch
> >> interval. So all the systems does is this - after every batch
> >> interval, put all the output operations (that will call RDD actions)
> >> in a job queue, and then keep executing stuff in the queue. If there
> >> is any failure in running the jobs, the streaming context will stop.
> >>
> >> 2.  How does spark determines that the life-cycle of a rdd is
> >> complete. Is there any chance that a RDD will be cleaned out of ram
> >> before all actions are taken on them?
> >> Spark Streaming knows when the all the processing related to batch T
> >> has been completed. And also it keeps track of how much time of the
> >> previous RDDs does it need to remember and keep around in the cache
> >> based on what DStream operations have been done. For example, if you
> >> are using a window 1 minute, the system knows that it needs to keep
> >> around at least last 1 minute data in the memory. Accordingly, it
> >> cleans up the input data (actively unpersisted), and cached RDD
> >> (simply dereferenced from DStream metadata, and then Spark unpersists
> >> them as the RDD object gets GarbageCollected by the JVM).
> >>
> >> TD
> >>
> >>
> >>
> >> On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
> >> <tz...@yahoo.com.invalid> wrote:
> >> > I have found this paper seems to answer most of questions about life
> >> > duration.
> >> >
> >> >
> https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
> >> >
> >> > Tian
> >> >
> >> >
> >> > On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha
> >> > <me...@gmail.com>
> >> > wrote:
> >> >
> >> >
> >> > Hey Experts,
> >> >
> >> > I wanted to understand in detail about the lifecycle of rdd(s) in a
> >> > streaming app.
> >> >
> >> > From my current understanding
> >> > - rdd gets created out of the realtime input stream.
> >> > - Transform(s) functions are applied in a lazy fashion on the RDD to
> >> > transform into another rdd(s).
> >> > - Actions are taken on the final transformed rdds to get the data out
> of
> >> > the
> >> > system.
> >> >
> >> > Also rdd(s) are stored in the clusters RAM (disc if configured so) and
> >> > are
> >> > cleaned in LRU fashion.
> >> >
> >> > So I have the following questions on the same.
> >> > - How spark (streaming) guarantees that all the actions are taken on
> >> > each
> >> > input rdd/batch.
> >> > - How does spark determines that the life-cycle of a rdd is complete.
> Is
> >> > there any chance that a RDD will be cleaned out of ram before all
> >> > actions
> >> > are taken on them?
> >> >
> >> > Thanks in advance for all your help. Also, I'm relatively new to
> scala &
> >> > spark so pardon me in case these are naive questions/assumptions.
> >> >
> >> > --
> >> > Thanks & Regards,
> >> > Mukesh Jha
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
>

Re: Lifecycle of RDD in spark-streaming

Posted by Tathagata Das <ta...@gmail.com>.

Can you elaborate on the usage pattern that lead to "cannot compute
split" ? Are you using the RDDs generated by DStream, outside the
DStream logic? Something like running interactive Spark jobs
(independent of the Spark Streaming ones) on RDDs generated by
DStreams? If that is the case, what is happening is that Spark
Streaming is not aware that some of the RDDs (and the raw input data
that it will need) will be used by Spark jobs unrelated to Spark
Streaming. Hence Spark Streaming will actively clear off the raw data,
leading to failures in the unrelated Spark jobs using that data.

In case this is your use case, the cleanest way to solve this, is by
asking Spark Streaming "remember" stuff for longer, by using
streamingContext.remember(<duration>). This will ensure that Spark
Streaming will keep around all the stuff for at least that duration.
Hope this helps.

TD

On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay <bi...@gmail.com> wrote:
> Just add one more point. If Spark streaming knows when the RDD will not be
> used any more, I believe Spark will not try to retrieve data it will not use
> any more. However, in practice, I often encounter the error of "cannot
> compute split". Based on my understanding, this is  because Spark cleared
> out data that will be used again. In my case, the data volume is much
> smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
> executor). If Spark will only keep RDD that are in use, I expect that this
> error may not happen.
>
> Bill
>
> On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das <ta...@gmail.com>
> wrote:
>>
>> Let me further clarify Lalit's point on when RDDs generated by
>> DStreams are destroyed, and hopefully that will answer your original
>> questions.
>>
>> 1.  How spark (streaming) guarantees that all the actions are taken on
>> each input rdd/batch.
>> This is isnt hard! By the time you call streamingContext.start(), you
>> have already set up the output operations (foreachRDD, saveAs***Files,
>> etc.) that you want to do with the DStream. There are RDD actions
>> inside the DStream output oeprations that need to be done every batch
>> interval. So all the systems does is this - after every batch
>> interval, put all the output operations (that will call RDD actions)
>> in a job queue, and then keep executing stuff in the queue. If there
>> is any failure in running the jobs, the streaming context will stop.
>>
>> 2.  How does spark determines that the life-cycle of a rdd is
>> complete. Is there any chance that a RDD will be cleaned out of ram
>> before all actions are taken on them?
>> Spark Streaming knows when the all the processing related to batch T
>> has been completed. And also it keeps track of how much time of the
>> previous RDDs does it need to remember and keep around in the cache
>> based on what DStream operations have been done. For example, if you
>> are using a window 1 minute, the system knows that it needs to keep
>> around at least last 1 minute data in the memory. Accordingly, it
>> cleans up the input data (actively unpersisted), and cached RDD
>> (simply dereferenced from DStream metadata, and then Spark unpersists
>> them as the RDD object gets GarbageCollected by the JVM).
>>
>> TD
>>
>>
>>
>> On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
>> <tz...@yahoo.com.invalid> wrote:
>> > I have found this paper seems to answer most of questions about life
>> > duration.
>> >
>> > https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
>> >
>> > Tian
>> >
>> >
>> > On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha
>> > <me...@gmail.com>
>> > wrote:
>> >
>> >
>> > Hey Experts,
>> >
>> > I wanted to understand in detail about the lifecycle of rdd(s) in a
>> > streaming app.
>> >
>> > From my current understanding
>> > - rdd gets created out of the realtime input stream.
>> > - Transform(s) functions are applied in a lazy fashion on the RDD to
>> > transform into another rdd(s).
>> > - Actions are taken on the final transformed rdds to get the data out of
>> > the
>> > system.
>> >
>> > Also rdd(s) are stored in the clusters RAM (disc if configured so) and
>> > are
>> > cleaned in LRU fashion.
>> >
>> > So I have the following questions on the same.
>> > - How spark (streaming) guarantees that all the actions are taken on
>> > each
>> > input rdd/batch.
>> > - How does spark determines that the life-cycle of a rdd is complete. Is
>> > there any chance that a RDD will be cleaned out of ram before all
>> > actions
>> > are taken on them?
>> >
>> > Thanks in advance for all your help. Also, I'm relatively new to scala &
>> > spark so pardon me in case these are naive questions/assumptions.
>> >
>> > --
>> > Thanks & Regards,
>> > Mukesh Jha
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Lifecycle of RDD in spark-streaming

Posted by Bill Jay <bi...@gmail.com>.

Just add one more point. If Spark streaming knows when the RDD will not be
used any more, I believe Spark will not try to retrieve data it will not
use any more. However, in practice, I often encounter the error of "cannot
compute split". Based on my understanding, this is  because Spark cleared
out data that will be used again. In my case, the data volume is much
smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
executor). If Spark will only keep RDD that are in use, I expect that this
error may not happen.

Bill

On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das <ta...@gmail.com>
wrote:

> Let me further clarify Lalit's point on when RDDs generated by
> DStreams are destroyed, and hopefully that will answer your original
> questions.
>
> 1.  How spark (streaming) guarantees that all the actions are taken on
> each input rdd/batch.
> This is isnt hard! By the time you call streamingContext.start(), you
> have already set up the output operations (foreachRDD, saveAs***Files,
> etc.) that you want to do with the DStream. There are RDD actions
> inside the DStream output oeprations that need to be done every batch
> interval. So all the systems does is this - after every batch
> interval, put all the output operations (that will call RDD actions)
> in a job queue, and then keep executing stuff in the queue. If there
> is any failure in running the jobs, the streaming context will stop.
>
> 2.  How does spark determines that the life-cycle of a rdd is
> complete. Is there any chance that a RDD will be cleaned out of ram
> before all actions are taken on them?
> Spark Streaming knows when the all the processing related to batch T
> has been completed. And also it keeps track of how much time of the
> previous RDDs does it need to remember and keep around in the cache
> based on what DStream operations have been done. For example, if you
> are using a window 1 minute, the system knows that it needs to keep
> around at least last 1 minute data in the memory. Accordingly, it
> cleans up the input data (actively unpersisted), and cached RDD
> (simply dereferenced from DStream metadata, and then Spark unpersists
> them as the RDD object gets GarbageCollected by the JVM).
>
> TD
>
>
>
> On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
> <tz...@yahoo.com.invalid> wrote:
> > I have found this paper seems to answer most of questions about life
> > duration.
> >
> https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
> >
> > Tian
> >
> >
> > On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha <
> me.mukesh.jha@gmail.com>
> > wrote:
> >
> >
> > Hey Experts,
> >
> > I wanted to understand in detail about the lifecycle of rdd(s) in a
> > streaming app.
> >
> > From my current understanding
> > - rdd gets created out of the realtime input stream.
> > - Transform(s) functions are applied in a lazy fashion on the RDD to
> > transform into another rdd(s).
> > - Actions are taken on the final transformed rdds to get the data out of
> the
> > system.
> >
> > Also rdd(s) are stored in the clusters RAM (disc if configured so) and
> are
> > cleaned in LRU fashion.
> >
> > So I have the following questions on the same.
> > - How spark (streaming) guarantees that all the actions are taken on each
> > input rdd/batch.
> > - How does spark determines that the life-cycle of a rdd is complete. Is
> > there any chance that a RDD will be cleaned out of ram before all actions
> > are taken on them?
> >
> > Thanks in advance for all your help. Also, I'm relatively new to scala &
> > spark so pardon me in case these are naive questions/assumptions.
> >
> > --
> > Thanks & Regards,
> > Mukesh Jha
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Lifecycle of RDD in spark-streaming

Posted by Tathagata Das <ta...@gmail.com>.

Let me further clarify Lalit's point on when RDDs generated by
DStreams are destroyed, and hopefully that will answer your original
questions.

1.  How spark (streaming) guarantees that all the actions are taken on
each input rdd/batch.
This is isnt hard! By the time you call streamingContext.start(), you
have already set up the output operations (foreachRDD, saveAs***Files,
etc.) that you want to do with the DStream. There are RDD actions
inside the DStream output oeprations that need to be done every batch
interval. So all the systems does is this - after every batch
interval, put all the output operations (that will call RDD actions)
in a job queue, and then keep executing stuff in the queue. If there
is any failure in running the jobs, the streaming context will stop.

2.  How does spark determines that the life-cycle of a rdd is
complete. Is there any chance that a RDD will be cleaned out of ram
before all actions are taken on them?
Spark Streaming knows when the all the processing related to batch T
has been completed. And also it keeps track of how much time of the
previous RDDs does it need to remember and keep around in the cache
based on what DStream operations have been done. For example, if you
are using a window 1 minute, the system knows that it needs to keep
around at least last 1 minute data in the memory. Accordingly, it
cleans up the input data (actively unpersisted), and cached RDD
(simply dereferenced from DStream metadata, and then Spark unpersists
them as the RDD object gets GarbageCollected by the JVM).

TD

On Wed, Nov 26, 2014 at 10:10 AM, tian zhang
<tz...@yahoo.com.invalid> wrote:
> I have found this paper seems to answer most of questions about life
> duration.
> https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
>
> Tian
>
>
> On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha <me...@gmail.com>
> wrote:
>
>
> Hey Experts,
>
> I wanted to understand in detail about the lifecycle of rdd(s) in a
> streaming app.
>
> From my current understanding
> - rdd gets created out of the realtime input stream.
> - Transform(s) functions are applied in a lazy fashion on the RDD to
> transform into another rdd(s).
> - Actions are taken on the final transformed rdds to get the data out of the
> system.
>
> Also rdd(s) are stored in the clusters RAM (disc if configured so) and are
> cleaned in LRU fashion.
>
> So I have the following questions on the same.
> - How spark (streaming) guarantees that all the actions are taken on each
> input rdd/batch.
> - How does spark determines that the life-cycle of a rdd is complete. Is
> there any chance that a RDD will be cleaned out of ram before all actions
> are taken on them?
>
> Thanks in advance for all your help. Also, I'm relatively new to scala &
> spark so pardon me in case these are naive questions/assumptions.
>
> --
> Thanks & Regards,
> Mukesh Jha
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Lifecycle of RDD in spark-streaming

Posted by tian zhang <tz...@yahoo.com.INVALID>.

I have found this paper seems to answer most of questions about life duration.https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

Tian 

     On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha <me...@gmail.com> wrote:
   

 Hey Experts,
I wanted to understand in detail about the lifecycle of rdd(s) in a streaming app.
>From my current understanding- rdd gets created out of the realtime input stream.
- Transform(s) functions are applied in a lazy fashion on the RDD to transform into another rdd(s).- Actions are taken on the final transformed rdds to get the data out of the system.
Also rdd(s) are stored in the clusters RAM (disc if configured so) and are cleaned in LRU fashion.
So I have the following questions on the same.
- How spark (streaming) guarantees that all the actions are taken on each input rdd/batch. - How does spark determines that the life-cycle of a rdd is complete. Is there any chance that a RDD will be cleaned out of ram before all actions are taken on them?
Thanks in advance for all your help. Also, I'm relatively new to scala & spark so pardon me in case these are naive questions/assumptions.

-- 
Thanks & Regards,
Mukesh Jha