You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by RodrigoB <ro...@aspect.com> on 2014/09/22 23:10:14 UTC

RDD data checkpoint cleaning

Hi all,

I've just started to take Spark Streaming recovery more seriously as things
get more serious on the project roll-out. We need to ensure full recovery on
all Spark levels - driver, receiver and worker.

I've started to do some tests today and become concerned with the current
findings.

I have an RDD in memory that gets updated through the updatestatebykey
function which is fed by an actor stream. Checkpoint is done on default
values - 10 secs.

Using the recipe in RecoverableNetworkWordCount I'm recovering that same
RDD. My initial expectation would be that Spark Streaming would be clever
enough to regularly delete old checkpoints as TD mentions on the thread
bellow

http://apache-spark-user-list.1001560.n3.nabble.com/checkpoint-and-not-running-out-of-disk-space-td1525.html

Instead I'm seeing data checkpoint to continuously increase, meaning the
recovery process is taking huge time to conclude as the state based RDD is
getting overwritten multiple times as many times this application was
checkpointed since it first started.
In fact the only version I need is the one from the latest checkpoint.

I rather not have to implement all the recovery outside of Spark Streaming
(as a few other challenges like avoiding IO re-execution and event stream
recovery will need to be done outside), so I really hope to have some strong
control on this part.

How does RDD data checkpoint cleaning happen? Would UpdateStateByKey be a
particular case where there is no cleaning? Would I have to code it to
delete outside of Spark? Sounds dangerous...I haven't looked at the code yet
but if someone already has that knowledge I would greatly appreciate to get
some insight.

Note: I'm solely referring to the data checkpoint and not metadata
checkpoint.

Many Thanks,
Rod

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: RDD data checkpoint cleaning

Posted by Luis Ángel Vicente Sánchez <la...@gmail.com>.

Are there any news about this issue? I was using a local folder in linux
for checkpointing, "file:///opt/sparkfolders/checkpoints". I think that
being able to use the ReliableKafkaReceiver in a 24x7 system without having
to worry about disk getting full is a reasonable expectation.

Regards,

Luis

2014-11-21 15:17 GMT+00:00 Luis Ángel Vicente Sánchez <
langel.groups@gmail.com>:

> I have seen the same behaviour while testing the latest spark 1.2.0
> snapshot.
>
> I'm trying the ReliableKafkaReceiver and it works quite well but the
> checkpoints folder is always increasing in size. The receivedMetaData
> folder remains almost constant in size but the receivedData folder is
> always increasing in size even if I set spark.cleaner.ttl to 300 seconds.
>
> Regards,
>
> Luis
>
> 2014-09-23 22:47 GMT+01:00 RodrigoB <ro...@aspect.com>:
>
>> Just a follow-up.
>>
>> Just to make sure about the RDDs not being cleaned up, I just replayed the
>> app both on the windows remote laptop and then on the linux machine and at
>> the same time was observing the RDD folders in HDFS.
>>
>> Confirming the observed behavior: running on the laptop I could see the
>> RDDs
>> continuously increasing. When I ran on linux, only two RDD folders were
>> there and continuously being recycled.
>>
>> Metadata checkpoints were being cleaned on both scenarios.
>>
>> tnks,
>> Rod
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847p14939.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: RDD data checkpoint cleaning

Posted by Luis Ángel Vicente Sánchez <la...@gmail.com>.

I have seen the same behaviour while testing the latest spark 1.2.0
snapshot.

I'm trying the ReliableKafkaReceiver and it works quite well but the
checkpoints folder is always increasing in size. The receivedMetaData
folder remains almost constant in size but the receivedData folder is
always increasing in size even if I set spark.cleaner.ttl to 300 seconds.

Regards,

Luis

2014-09-23 22:47 GMT+01:00 RodrigoB <ro...@aspect.com>:

> Just a follow-up.
>
> Just to make sure about the RDDs not being cleaned up, I just replayed the
> app both on the windows remote laptop and then on the linux machine and at
> the same time was observing the RDD folders in HDFS.
>
> Confirming the observed behavior: running on the laptop I could see the
> RDDs
> continuously increasing. When I ran on linux, only two RDD folders were
> there and continuously being recycled.
>
> Metadata checkpoints were being cleaned on both scenarios.
>
> tnks,
> Rod
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847p14939.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: RDD data checkpoint cleaning

Posted by RodrigoB <ro...@aspect.com>.

Just a follow-up.

Just to make sure about the RDDs not being cleaned up, I just replayed the
app both on the windows remote laptop and then on the linux machine and at
the same time was observing the RDD folders in HDFS.

Confirming the observed behavior: running on the laptop I could see the RDDs
continuously increasing. When I ran on linux, only two RDD folders were
there and continuously being recycled.

Metadata checkpoints were being cleaned on both scenarios.

tnks,
Rod
 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847p14939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: RDD data checkpoint cleaning

Posted by RodrigoB <ro...@aspect.com>.

Hi TD, tnks for getting back on this.

Yes that's what I was experiencing - data checkpoints were being recovered
from considerable time before the last data checkpoint, probably since the
beginning of the first writes, would have to confirm. I have some
development on this though.

These results are shown when I run the application from my Windows laptop
where I have IntelliJ, while the HDFS file system is on a linux box (with a
very reasonable latency!). Couldn't find any exception in the spark logs and
I did see metadata checkpoints were recycled on the HDFS folder.

Upon recovery I could see the usual Spark streaming timestamp prints on the
console jumping from one data checkpoint moment to the next one very slowly.

Once I moved the app to the linux box where I had HDFS this problem seemed
to go away. If this issue is only happening when running from Windows I
won't be so concerned and could go back testing everything on linux.
My only concern is if because of substantial HDFS latency to the Spark app
there is any kind of race condition between writes and cleanups of HDFS
files that could have lead to this finding.

Hope this description helps

tnks again,
Rod

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847p14935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: RDD data checkpoint cleaning

Posted by Tathagata Das <ta...@gmail.com>.

I am not sure what you mean by data checkpoint continuously increase,
leading to recovery process taking time? Do you mean that in HDFS you are
seeing rdd checkpoint files being continuously written but never being
deleted?

On Tue, Sep 23, 2014 at 2:40 AM, RodrigoB <ro...@aspect.com>
wrote:

> Hi all,
>
> I've just started to take Spark Streaming recovery more seriously as things
> get more serious on the project roll-out. We need to ensure full recovery
> on
> all Spark levels - driver, receiver and worker.
>
> I've started to do some tests today and become concerned with the current
> findings.
>
> I have an RDD in memory that gets updated through the updatestatebykey
> function which is fed by an actor stream. Checkpoint is done on default
> values - 10 secs.
>
> Using the recipe in RecoverableNetworkWordCount I'm recovering that same
> RDD. My initial expectation would be that Spark Streaming would be clever
> enough to regularly delete old checkpoints as TD mentions on the thread
> bellow
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/checkpoint-and-not-running-out-of-disk-space-td1525.html
>
> Instead I'm seeing data checkpoint to continuously increase, meaning the
> recovery process is taking huge time to conclude as the state based RDD is
> getting overwritten multiple times as many times this application was
> checkpointed since it first started.
> In fact the only version I need is the one from the latest checkpoint.
>
> I rather not have to implement all the recovery outside of Spark Streaming
> (as a few other challenges like avoiding IO re-execution and event stream
> recovery will need to be done outside), so I really hope to have some
> strong
> control on this part.
>
> How does RDD data checkpoint cleaning happen? Would UpdateStateByKey be a
> particular case where there is no cleaning? Would I have to code it to
> delete outside of Spark? Sounds dangerous...I haven't looked at the code
> yet
> but if someone already has that knowledge I would greatly appreciate to get
> some insight.
>
> Note: I'm solely referring to the data checkpoint and not metadata
> checkpoint.
>
> Many Thanks,
> Rod
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>