You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2014/05/16 13:01:39 UTC

[jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Xiangrui Meng created SPARK-1855:
------------------------------------

             Summary: Provide memory-and-local-disk RDD checkpointing
                 Key: SPARK-1855
                 URL: https://issues.apache.org/jira/browse/SPARK-1855
             Project: Spark
          Issue Type: New Feature
          Components: MLlib, Spark Core
    Affects Versions: 1.0.0
            Reporter: Xiangrui Meng


Checkpointing is used to cut long lineage while maintaining fault tolerance. The current implementation is HDFS-based. Using the BlockRDD we can create in-memory-and-local-disk (with replication) checkpoints that are not as reliable as HDFS-based solution but faster.

It can help applications that require many iterations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Matei Zaharia <ma...@gmail.com>.

BTW for what it’s worth I agree this is a good option to add, the only tricky thing will be making sure the checkpoint blocks are not garbage-collected by the block store. I don’t think they will be though.

Matei
On May 17, 2014, at 2:20 PM, Matei Zaharia <ma...@gmail.com> wrote:

> We do actually have replicated StorageLevels in Spark. You can use MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom replication factor.
> 
> BTW you guys should probably have this discussion on the JIRA rather than the dev list; I think the replies somehow ended up on the dev list.
> 
> Matei
> 
> On May 17, 2014, at 1:36 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
> 
>> We don't have 3x replication in spark :-)
>> And if we use replicated storagelevel, while decreasing odds of failure, it
>> does not eliminate it (since we are not doing a great job with replication
>> anyway from fault tolerance point of view).
>> Also it does take a nontrivial performance hit with replicated levels.
>> 
>> Regards,
>> Mridul
>> On 17-May-2014 8:16 am, "Xiangrui Meng" <me...@gmail.com> wrote:
>> 
>>> With 3x replication, we should be able to achieve fault tolerance.
>>> This checkPointed RDD can be cleared if we have another in-memory
>>> checkPointed RDD down the line. It can avoid hitting disk if we have
>>> enough memory to use. We need to investigate more to find a good
>>> solution. -Xiangrui
>>> 
>>> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <mr...@gmail.com>
>>> wrote:
>>>> Effectively this is persist without fault tolerance.
>>>> Failure of any node means complete lack of fault tolerance.
>>>> I would be very skeptical of truncating lineage if it is not reliable.
>>>> On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org> wrote:
>>>> 
>>>>> Xiangrui Meng created SPARK-1855:
>>>>> ------------------------------------
>>>>> 
>>>>>            Summary: Provide memory-and-local-disk RDD checkpointing
>>>>>                Key: SPARK-1855
>>>>>                URL: https://issues.apache.org/jira/browse/SPARK-1855
>>>>>            Project: Spark
>>>>>         Issue Type: New Feature
>>>>>         Components: MLlib, Spark Core
>>>>>   Affects Versions: 1.0.0
>>>>>           Reporter: Xiangrui Meng
>>>>> 
>>>>> 
>>>>> Checkpointing is used to cut long lineage while maintaining fault
>>>>> tolerance. The current implementation is HDFS-based. Using the BlockRDD
>>> we
>>>>> can create in-memory-and-local-disk (with replication) checkpoints that
>>> are
>>>>> not as reliable as HDFS-based solution but faster.
>>>>> 
>>>>> It can help applications that require many iterations.
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> This message was sent by Atlassian JIRA
>>>>> (v6.2#6252)
>>>>> 
>>> 
>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Mridul Muralidharan <mr...@gmail.com>.

My bad ... I was replying via mobile, and I did not realize responses
to JIRA mails were not mirrored to JIRA - unlike PR responses !


Regards,
Mridul

On Sun, May 18, 2014 at 2:50 AM, Matei Zaharia <ma...@gmail.com> wrote:
> We do actually have replicated StorageLevels in Spark. You can use MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom replication factor.
>
> BTW you guys should probably have this discussion on the JIRA rather than the dev list; I think the replies somehow ended up on the dev list.
>
> Matei
>
> On May 17, 2014, at 1:36 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
>
>> We don't have 3x replication in spark :-)
>> And if we use replicated storagelevel, while decreasing odds of failure, it
>> does not eliminate it (since we are not doing a great job with replication
>> anyway from fault tolerance point of view).
>> Also it does take a nontrivial performance hit with replicated levels.
>>
>> Regards,
>> Mridul
>> On 17-May-2014 8:16 am, "Xiangrui Meng" <me...@gmail.com> wrote:
>>
>>> With 3x replication, we should be able to achieve fault tolerance.
>>> This checkPointed RDD can be cleared if we have another in-memory
>>> checkPointed RDD down the line. It can avoid hitting disk if we have
>>> enough memory to use. We need to investigate more to find a good
>>> solution. -Xiangrui
>>>
>>> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <mr...@gmail.com>
>>> wrote:
>>>> Effectively this is persist without fault tolerance.
>>>> Failure of any node means complete lack of fault tolerance.
>>>> I would be very skeptical of truncating lineage if it is not reliable.
>>>> On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org> wrote:
>>>>
>>>>> Xiangrui Meng created SPARK-1855:
>>>>> ------------------------------------
>>>>>
>>>>>             Summary: Provide memory-and-local-disk RDD checkpointing
>>>>>                 Key: SPARK-1855
>>>>>                 URL: https://issues.apache.org/jira/browse/SPARK-1855
>>>>>             Project: Spark
>>>>>          Issue Type: New Feature
>>>>>          Components: MLlib, Spark Core
>>>>>    Affects Versions: 1.0.0
>>>>>            Reporter: Xiangrui Meng
>>>>>
>>>>>
>>>>> Checkpointing is used to cut long lineage while maintaining fault
>>>>> tolerance. The current implementation is HDFS-based. Using the BlockRDD
>>> we
>>>>> can create in-memory-and-local-disk (with replication) checkpoints that
>>> are
>>>>> not as reliable as HDFS-based solution but faster.
>>>>>
>>>>> It can help applications that require many iterations.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> This message was sent by Atlassian JIRA
>>>>> (v6.2#6252)
>>>>>
>>>
>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Matei Zaharia <ma...@gmail.com>.

BTW in Spark the consensus so far was that we’d use the dev@ list for high-level discussions (e.g. change in the development process, major features, proposals of new components, release votes) and keep lower-level issue tracking in JIRA. This is just how the project operated before so it was the easiest way for people to continue.

Matei

On May 18, 2014, at 4:01 PM, Matei Zaharia <ma...@gmail.com> wrote:

> Ah, maybe it’s just different in other Apache projects. All the ones I’ve participated in have had their design discussions on JIRA. For example take a look at https://issues.apache.org/jira/browse/HDFS-4949. (Most design discussions in Hadoop are also on JIRA).
> 
> Hosting it this way is more convenient because most users come in looking at the issue tracker, not at mailing list archives (if only because the issue tracker is much more searchable for issues).
> 
> Matei
> 
> On May 18, 2014, at 2:19 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
>> On Sun, May 18, 2014 at 8:28 PM, Andrew Ash <an...@andrewash.com> wrote:
>>> The nice thing about putting discussion on the Jira is that everything
>>> about the bug is in one place.  So people looking to understand the
>>> discussion a few years from now only have to look on the jira ticket rather
>>> than also search the mailing list archives and hope commenters all put the
>>> string "SPARK-1855" into the messages.
>> 
>> My understanding is that JIRA is not for discussions. In a sense it
>> could be used for a few opinions, but have never seen it elsewhere and
>> am curious if it's an approach for the project (that I might accept
>> ultimately, but that would require some adoption time).
>> 
>> What wrong with linking a discussion thread to a JIRA issue?
>> 
>> Jacek
>> 
>> -- 
>> Jacek Laskowski | http://blog.japila.pl
>> "Never discourage anyone who continually makes progress, no matter how
>> slow." Plato
>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Matei Zaharia <ma...@gmail.com>.

Ah, maybe it’s just different in other Apache projects. All the ones I’ve participated in have had their design discussions on JIRA. For example take a look at https://issues.apache.org/jira/browse/HDFS-4949. (Most design discussions in Hadoop are also on JIRA).

Hosting it this way is more convenient because most users come in looking at the issue tracker, not at mailing list archives (if only because the issue tracker is much more searchable for issues).

Matei

On May 18, 2014, at 2:19 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> On Sun, May 18, 2014 at 8:28 PM, Andrew Ash <an...@andrewash.com> wrote:
>> The nice thing about putting discussion on the Jira is that everything
>> about the bug is in one place.  So people looking to understand the
>> discussion a few years from now only have to look on the jira ticket rather
>> than also search the mailing list archives and hope commenters all put the
>> string "SPARK-1855" into the messages.
> 
> My understanding is that JIRA is not for discussions. In a sense it
> could be used for a few opinions, but have never seen it elsewhere and
> am curious if it's an approach for the project (that I might accept
> ultimately, but that would require some adoption time).
> 
> What wrong with linking a discussion thread to a JIRA issue?
> 
> Jacek
> 
> -- 
> Jacek Laskowski | http://blog.japila.pl
> "Never discourage anyone who continually makes progress, no matter how
> slow." Plato

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Jacek Laskowski <ja...@japila.pl>.

On Sun, May 18, 2014 at 8:28 PM, Andrew Ash <an...@andrewash.com> wrote:
> The nice thing about putting discussion on the Jira is that everything
> about the bug is in one place.  So people looking to understand the
> discussion a few years from now only have to look on the jira ticket rather
> than also search the mailing list archives and hope commenters all put the
> string "SPARK-1855" into the messages.

My understanding is that JIRA is not for discussions. In a sense it
could be used for a few opinions, but have never seen it elsewhere and
am curious if it's an approach for the project (that I might accept
ultimately, but that would require some adoption time).

What wrong with linking a discussion thread to a JIRA issue?

Jacek

-- 
Jacek Laskowski | http://blog.japila.pl
"Never discourage anyone who continually makes progress, no matter how
slow." Plato

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Matei Zaharia <ma...@gmail.com>.

JIRAs comments are mirrored to the issues@spark.apache.org list, so people who want to get them by email can do so. In theory one should also be able to reply to one of those emails and have the message show up in JIRA, but I don’t think ours is configured that way. I’m not sure why it wouldn’t be the “ASF way” when the JIRA instance is hosted by the ASF and mirrored on ASF lists.

Matei

On May 18, 2014, at 11:28 AM, Andrew Ash <an...@andrewash.com> wrote:

> The nice thing about putting discussion on the Jira is that everything
> about the bug is in one place.  So people looking to understand the
> discussion a few years from now only have to look on the jira ticket rather
> than also search the mailing list archives and hope commenters all put the
> string "SPARK-1855" into the messages.
> 
> 
> On Sun, May 18, 2014 at 10:34 AM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
>> Hi,
>> 
>> I'm curious if it's a common approach to have discussions in JIRA not here.
>> I don't think it's the ASF way.
>> 
>> Pozdrawiam,
>> Jacek Laskowski
>> http://blog.japila.pl
>> 17 maj 2014 23:55 "Matei Zaharia" <ma...@gmail.com> napisał(a):
>> 
>>> We do actually have replicated StorageLevels in Spark. You can use
>>> MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom
>>> replication factor.
>>> 
>>> BTW you guys should probably have this discussion on the JIRA rather than
>>> the dev list; I think the replies somehow ended up on the dev list.
>>> 
>>> Matei
>>> 
>>> On May 17, 2014, at 1:36 AM, Mridul Muralidharan <mr...@gmail.com>
>> wrote:
>>> 
>>>> We don't have 3x replication in spark :-)
>>>> And if we use replicated storagelevel, while decreasing odds of
>> failure,
>>> it
>>>> does not eliminate it (since we are not doing a great job with
>>> replication
>>>> anyway from fault tolerance point of view).
>>>> Also it does take a nontrivial performance hit with replicated levels.
>>>> 
>>>> Regards,
>>>> Mridul
>>>> On 17-May-2014 8:16 am, "Xiangrui Meng" <me...@gmail.com> wrote:
>>>> 
>>>>> With 3x replication, we should be able to achieve fault tolerance.
>>>>> This checkPointed RDD can be cleared if we have another in-memory
>>>>> checkPointed RDD down the line. It can avoid hitting disk if we have
>>>>> enough memory to use. We need to investigate more to find a good
>>>>> solution. -Xiangrui
>>>>> 
>>>>> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <
>> mridul@gmail.com>
>>>>> wrote:
>>>>>> Effectively this is persist without fault tolerance.
>>>>>> Failure of any node means complete lack of fault tolerance.
>>>>>> I would be very skeptical of truncating lineage if it is not
>> reliable.
>>>>>> On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org>
>>> wrote:
>>>>>> 
>>>>>>> Xiangrui Meng created SPARK-1855:
>>>>>>> ------------------------------------
>>>>>>> 
>>>>>>>            Summary: Provide memory-and-local-disk RDD checkpointing
>>>>>>>                Key: SPARK-1855
>>>>>>>                URL:
>> https://issues.apache.org/jira/browse/SPARK-1855
>>>>>>>            Project: Spark
>>>>>>>         Issue Type: New Feature
>>>>>>>         Components: MLlib, Spark Core
>>>>>>>   Affects Versions: 1.0.0
>>>>>>>           Reporter: Xiangrui Meng
>>>>>>> 
>>>>>>> 
>>>>>>> Checkpointing is used to cut long lineage while maintaining fault
>>>>>>> tolerance. The current implementation is HDFS-based. Using the
>>> BlockRDD
>>>>> we
>>>>>>> can create in-memory-and-local-disk (with replication) checkpoints
>>> that
>>>>> are
>>>>>>> not as reliable as HDFS-based solution but faster.
>>>>>>> 
>>>>>>> It can help applications that require many iterations.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> This message was sent by Atlassian JIRA
>>>>>>> (v6.2#6252)
>>>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Andrew Ash <an...@andrewash.com>.

The nice thing about putting discussion on the Jira is that everything
about the bug is in one place.  So people looking to understand the
discussion a few years from now only have to look on the jira ticket rather
than also search the mailing list archives and hope commenters all put the
string "SPARK-1855" into the messages.


On Sun, May 18, 2014 at 10:34 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I'm curious if it's a common approach to have discussions in JIRA not here.
> I don't think it's the ASF way.
>
> Pozdrawiam,
> Jacek Laskowski
> http://blog.japila.pl
> 17 maj 2014 23:55 "Matei Zaharia" <ma...@gmail.com> napisał(a):
>
> > We do actually have replicated StorageLevels in Spark. You can use
> > MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom
> > replication factor.
> >
> > BTW you guys should probably have this discussion on the JIRA rather than
> > the dev list; I think the replies somehow ended up on the dev list.
> >
> > Matei
> >
> > On May 17, 2014, at 1:36 AM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
> >
> > > We don't have 3x replication in spark :-)
> > > And if we use replicated storagelevel, while decreasing odds of
> failure,
> > it
> > > does not eliminate it (since we are not doing a great job with
> > replication
> > > anyway from fault tolerance point of view).
> > > Also it does take a nontrivial performance hit with replicated levels.
> > >
> > > Regards,
> > > Mridul
> > > On 17-May-2014 8:16 am, "Xiangrui Meng" <me...@gmail.com> wrote:
> > >
> > >> With 3x replication, we should be able to achieve fault tolerance.
> > >> This checkPointed RDD can be cleared if we have another in-memory
> > >> checkPointed RDD down the line. It can avoid hitting disk if we have
> > >> enough memory to use. We need to investigate more to find a good
> > >> solution. -Xiangrui
> > >>
> > >> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <
> mridul@gmail.com>
> > >> wrote:
> > >>> Effectively this is persist without fault tolerance.
> > >>> Failure of any node means complete lack of fault tolerance.
> > >>> I would be very skeptical of truncating lineage if it is not
> reliable.
> > >>> On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org>
> > wrote:
> > >>>
> > >>>> Xiangrui Meng created SPARK-1855:
> > >>>> ------------------------------------
> > >>>>
> > >>>>             Summary: Provide memory-and-local-disk RDD checkpointing
> > >>>>                 Key: SPARK-1855
> > >>>>                 URL:
> https://issues.apache.org/jira/browse/SPARK-1855
> > >>>>             Project: Spark
> > >>>>          Issue Type: New Feature
> > >>>>          Components: MLlib, Spark Core
> > >>>>    Affects Versions: 1.0.0
> > >>>>            Reporter: Xiangrui Meng
> > >>>>
> > >>>>
> > >>>> Checkpointing is used to cut long lineage while maintaining fault
> > >>>> tolerance. The current implementation is HDFS-based. Using the
> > BlockRDD
> > >> we
> > >>>> can create in-memory-and-local-disk (with replication) checkpoints
> > that
> > >> are
> > >>>> not as reliable as HDFS-based solution but faster.
> > >>>>
> > >>>> It can help applications that require many iterations.
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> This message was sent by Atlassian JIRA
> > >>>> (v6.2#6252)
> > >>>>
> > >>
> >
> >
>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi,

I'm curious if it's a common approach to have discussions in JIRA not here.
I don't think it's the ASF way.

Pozdrawiam,
Jacek Laskowski
http://blog.japila.pl
17 maj 2014 23:55 "Matei Zaharia" <ma...@gmail.com> napisał(a):

> We do actually have replicated StorageLevels in Spark. You can use
> MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom
> replication factor.
>
> BTW you guys should probably have this discussion on the JIRA rather than
> the dev list; I think the replies somehow ended up on the dev list.
>
> Matei
>
> On May 17, 2014, at 1:36 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
>
> > We don't have 3x replication in spark :-)
> > And if we use replicated storagelevel, while decreasing odds of failure,
> it
> > does not eliminate it (since we are not doing a great job with
> replication
> > anyway from fault tolerance point of view).
> > Also it does take a nontrivial performance hit with replicated levels.
> >
> > Regards,
> > Mridul
> > On 17-May-2014 8:16 am, "Xiangrui Meng" <me...@gmail.com> wrote:
> >
> >> With 3x replication, we should be able to achieve fault tolerance.
> >> This checkPointed RDD can be cleared if we have another in-memory
> >> checkPointed RDD down the line. It can avoid hitting disk if we have
> >> enough memory to use. We need to investigate more to find a good
> >> solution. -Xiangrui
> >>
> >> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <mr...@gmail.com>
> >> wrote:
> >>> Effectively this is persist without fault tolerance.
> >>> Failure of any node means complete lack of fault tolerance.
> >>> I would be very skeptical of truncating lineage if it is not reliable.
> >>> On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org>
> wrote:
> >>>
> >>>> Xiangrui Meng created SPARK-1855:
> >>>> ------------------------------------
> >>>>
> >>>>             Summary: Provide memory-and-local-disk RDD checkpointing
> >>>>                 Key: SPARK-1855
> >>>>                 URL: https://issues.apache.org/jira/browse/SPARK-1855
> >>>>             Project: Spark
> >>>>          Issue Type: New Feature
> >>>>          Components: MLlib, Spark Core
> >>>>    Affects Versions: 1.0.0
> >>>>            Reporter: Xiangrui Meng
> >>>>
> >>>>
> >>>> Checkpointing is used to cut long lineage while maintaining fault
> >>>> tolerance. The current implementation is HDFS-based. Using the
> BlockRDD
> >> we
> >>>> can create in-memory-and-local-disk (with replication) checkpoints
> that
> >> are
> >>>> not as reliable as HDFS-based solution but faster.
> >>>>
> >>>> It can help applications that require many iterations.
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> This message was sent by Atlassian JIRA
> >>>> (v6.2#6252)
> >>>>
> >>
>
>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Matei Zaharia <ma...@gmail.com>.

We do actually have replicated StorageLevels in Spark. You can use MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom replication factor.

BTW you guys should probably have this discussion on the JIRA rather than the dev list; I think the replies somehow ended up on the dev list.

Matei

On May 17, 2014, at 1:36 AM, Mridul Muralidharan <mr...@gmail.com> wrote:

> We don't have 3x replication in spark :-)
> And if we use replicated storagelevel, while decreasing odds of failure, it
> does not eliminate it (since we are not doing a great job with replication
> anyway from fault tolerance point of view).
> Also it does take a nontrivial performance hit with replicated levels.
> 
> Regards,
> Mridul
> On 17-May-2014 8:16 am, "Xiangrui Meng" <me...@gmail.com> wrote:
> 
>> With 3x replication, we should be able to achieve fault tolerance.
>> This checkPointed RDD can be cleared if we have another in-memory
>> checkPointed RDD down the line. It can avoid hitting disk if we have
>> enough memory to use. We need to investigate more to find a good
>> solution. -Xiangrui
>> 
>> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <mr...@gmail.com>
>> wrote:
>>> Effectively this is persist without fault tolerance.
>>> Failure of any node means complete lack of fault tolerance.
>>> I would be very skeptical of truncating lineage if it is not reliable.
>>> On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org> wrote:
>>> 
>>>> Xiangrui Meng created SPARK-1855:
>>>> ------------------------------------
>>>> 
>>>>             Summary: Provide memory-and-local-disk RDD checkpointing
>>>>                 Key: SPARK-1855
>>>>                 URL: https://issues.apache.org/jira/browse/SPARK-1855
>>>>             Project: Spark
>>>>          Issue Type: New Feature
>>>>          Components: MLlib, Spark Core
>>>>    Affects Versions: 1.0.0
>>>>            Reporter: Xiangrui Meng
>>>> 
>>>> 
>>>> Checkpointing is used to cut long lineage while maintaining fault
>>>> tolerance. The current implementation is HDFS-based. Using the BlockRDD
>> we
>>>> can create in-memory-and-local-disk (with replication) checkpoints that
>> are
>>>> not as reliable as HDFS-based solution but faster.
>>>> 
>>>> It can help applications that require many iterations.
>>>> 
>>>> 
>>>> 
>>>> --
>>>> This message was sent by Atlassian JIRA
>>>> (v6.2#6252)
>>>> 
>>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Mridul Muralidharan <mr...@gmail.com>.

We don't have 3x replication in spark :-)
And if we use replicated storagelevel, while decreasing odds of failure, it
does not eliminate it (since we are not doing a great job with replication
anyway from fault tolerance point of view).
Also it does take a nontrivial performance hit with replicated levels.

Regards,
Mridul
 On 17-May-2014 8:16 am, "Xiangrui Meng" <me...@gmail.com> wrote:

> With 3x replication, we should be able to achieve fault tolerance.
> This checkPointed RDD can be cleared if we have another in-memory
> checkPointed RDD down the line. It can avoid hitting disk if we have
> enough memory to use. We need to investigate more to find a good
> solution. -Xiangrui
>
> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
> > Effectively this is persist without fault tolerance.
> > Failure of any node means complete lack of fault tolerance.
> > I would be very skeptical of truncating lineage if it is not reliable.
> >  On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org> wrote:
> >
> >> Xiangrui Meng created SPARK-1855:
> >> ------------------------------------
> >>
> >>              Summary: Provide memory-and-local-disk RDD checkpointing
> >>                  Key: SPARK-1855
> >>                  URL: https://issues.apache.org/jira/browse/SPARK-1855
> >>              Project: Spark
> >>           Issue Type: New Feature
> >>           Components: MLlib, Spark Core
> >>     Affects Versions: 1.0.0
> >>             Reporter: Xiangrui Meng
> >>
> >>
> >> Checkpointing is used to cut long lineage while maintaining fault
> >> tolerance. The current implementation is HDFS-based. Using the BlockRDD
> we
> >> can create in-memory-and-local-disk (with replication) checkpoints that
> are
> >> not as reliable as HDFS-based solution but faster.
> >>
> >> It can help applications that require many iterations.
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.2#6252)
> >>
>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Xiangrui Meng <me...@gmail.com>.

With 3x replication, we should be able to achieve fault tolerance.
This checkPointed RDD can be cleared if we have another in-memory
checkPointed RDD down the line. It can avoid hitting disk if we have
enough memory to use. We need to investigate more to find a good
solution. -Xiangrui

On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
> Effectively this is persist without fault tolerance.
> Failure of any node means complete lack of fault tolerance.
> I would be very skeptical of truncating lineage if it is not reliable.
>  On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org> wrote:
>
>> Xiangrui Meng created SPARK-1855:
>> ------------------------------------
>>
>>              Summary: Provide memory-and-local-disk RDD checkpointing
>>                  Key: SPARK-1855
>>                  URL: https://issues.apache.org/jira/browse/SPARK-1855
>>              Project: Spark
>>           Issue Type: New Feature
>>           Components: MLlib, Spark Core
>>     Affects Versions: 1.0.0
>>             Reporter: Xiangrui Meng
>>
>>
>> Checkpointing is used to cut long lineage while maintaining fault
>> tolerance. The current implementation is HDFS-based. Using the BlockRDD we
>> can create in-memory-and-local-disk (with replication) checkpoints that are
>> not as reliable as HDFS-based solution but faster.
>>
>> It can help applications that require many iterations.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>>

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Posted by Mridul Muralidharan <mr...@gmail.com>.

Effectively this is persist without fault tolerance.
Failure of any node means complete lack of fault tolerance.
I would be very skeptical of truncating lineage if it is not reliable.
 On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <ji...@apache.org> wrote:

> Xiangrui Meng created SPARK-1855:
> ------------------------------------
>
>              Summary: Provide memory-and-local-disk RDD checkpointing
>                  Key: SPARK-1855
>                  URL: https://issues.apache.org/jira/browse/SPARK-1855
>              Project: Spark
>           Issue Type: New Feature
>           Components: MLlib, Spark Core
>     Affects Versions: 1.0.0
>             Reporter: Xiangrui Meng
>
>
> Checkpointing is used to cut long lineage while maintaining fault
> tolerance. The current implementation is HDFS-based. Using the BlockRDD we
> can create in-memory-and-local-disk (with replication) checkpoints that are
> not as reliable as HDFS-based solution but faster.
>
> It can help applications that require many iterations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>