You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Emmanuel <fo...@gmail.com> on 2015/07/21 17:43:30 UTC

Spark Streaming Checkpointing solutions

Hi,

I'm working on a Spark Streaming application and I would like to know what
is the best storage to use
for checkpointing.

For testing purposes we're are using NFS between the worker, the master and
the driver program (in client mode),
but we have some issues with the CheckpointWriter (1 thread dedicated). *My
understanding is that NFS is not a good candidate for this usage.*

1. What is the best solution for checkpointing and what are the alternatives
?

2. Does checkpointings directories need to be shared by the driver
application and the workers too ?

Thanks for your replies



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpointing-solutions-tp23932.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark Streaming Checkpointing solutions

Posted by Emmanuel Fortin <fo...@gmail.com>.
Thank you for your reply. I will consider hdfs for the checkpoint storage.



Le mar. 21 juil. 2015 à 17:51, Dean Wampler <de...@gmail.com> a
écrit :

> TD's Spark Summit talk offers suggestions (
> https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/).
> He recommends using HDFS, because you get the triplicate resiliency it
> offers, albeit with extra overhead. I believe the driver doesn't need
> visibility to the checkpointing directory, e.g., if you're running in
> client mode, but all the cluster nodes would need to see it for recovering
> a lost stage, where it might get started on a different node. Hence, I
> would think NFS could work, if all nodes have the same mount, although
> there would be a lot of network overhead. In some situations, a high
> performance file system appliance, e.g., NAS, could suffice.
>
> My $0.02,
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Tue, Jul 21, 2015 at 10:43 AM, Emmanuel <fo...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm working on a Spark Streaming application and I would like to know what
>> is the best storage to use
>> for checkpointing.
>>
>> For testing purposes we're are using NFS between the worker, the master
>> and
>> the driver program (in client mode),
>> but we have some issues with the CheckpointWriter (1 thread dedicated).
>> *My
>> understanding is that NFS is not a good candidate for this usage.*
>>
>> 1. What is the best solution for checkpointing and what are the
>> alternatives
>> ?
>>
>> 2. Does checkpointings directories need to be shared by the driver
>> application and the workers too ?
>>
>> Thanks for your replies
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpointing-solutions-tp23932.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Re: Spark Streaming Checkpointing solutions

Posted by Dean Wampler <de...@gmail.com>.
TD's Spark Summit talk offers suggestions (
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/).
He recommends using HDFS, because you get the triplicate resiliency it
offers, albeit with extra overhead. I believe the driver doesn't need
visibility to the checkpointing directory, e.g., if you're running in
client mode, but all the cluster nodes would need to see it for recovering
a lost stage, where it might get started on a different node. Hence, I
would think NFS could work, if all nodes have the same mount, although
there would be a lot of network overhead. In some situations, a high
performance file system appliance, e.g., NAS, could suffice.

My $0.02,
dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, Jul 21, 2015 at 10:43 AM, Emmanuel <fo...@gmail.com>
wrote:

> Hi,
>
> I'm working on a Spark Streaming application and I would like to know what
> is the best storage to use
> for checkpointing.
>
> For testing purposes we're are using NFS between the worker, the master and
> the driver program (in client mode),
> but we have some issues with the CheckpointWriter (1 thread dedicated). *My
> understanding is that NFS is not a good candidate for this usage.*
>
> 1. What is the best solution for checkpointing and what are the
> alternatives
> ?
>
> 2. Does checkpointings directories need to be shared by the driver
> application and the workers too ?
>
> Thanks for your replies
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpointing-solutions-tp23932.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>