You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Harut Martirosyan <ha...@gmail.com> on 2015/03/29 10:07:01 UTC

RDD Persistance synchronization

Hi.

rdd.persist()
rdd.count()

rdd.transform()...

is there a chance transform() runs before persist() is complete?

-- 
RGRDZ Harut

Re: RDD Persistance synchronization

Posted by Sean Owen <so...@cloudera.com>.

I don't think you can guarantee that there is no recomputation. Even
if you persist(), you might lose the block and have to recompute it.

You can persist your UUIDs to storage like HDFS. They won't change
then of course. I suppose you still face a much narrower problem, that
the act of computing the UUIDs in order to immediately save them may
fail, and restart. Downstream processes would only ever observe one
set of UUIDs though, even if inside the process, some UUIDs were
created and lost. I suppose that only might matter if you need
sequential IDs or something, but then, you're getting into territory
where you need a different model of computation from Spark.

On Sun, Mar 29, 2015 at 10:02 AM, Harut Martirosyan
<ha...@gmail.com> wrote:
> Thanks to you again, Sean.
>
> The thing is that, we persist and count that RDD in hope that all later
> actions with it won't trigger previous recalculations, it's not really about
> performance here, it's because recalculations contain UUID generation which
> should be the same for further actions.
>
> I understand that RDD concept is based on linage, and it kind of contradicts
> our goal but, is there ay way to guarantee that it's persisted, or make it
> fail when persisting fails?
>
> On 29 March 2015 at 12:51, Sean Owen <so...@cloudera.com> wrote:
>>
>> persist() completes immediately since it only marks the RDD for
>> persistence. count() triggers computation of rdd, and as rdd is
>> computed it will be persisted. The following transform should
>> therefore only start after count() and therefore after the persistence
>> completes. I think there might be corner cases where you still see
>> some of rdd computed, like, if a persisted block is lost or otherwise
>> unavailable later.
>>
>> On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan
>> <ha...@gmail.com> wrote:
>> > Hi.
>> >
>> > rdd.persist()
>> > rdd.count()
>> >
>> > rdd.transform()...
>> >
>> > is there a chance transform() runs before persist() is complete?
>> >
>> > --
>> > RGRDZ Harut
>
>
>
>
> --
> RGRDZ Harut

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: RDD Persistance synchronization

Posted by Harut Martirosyan <ha...@gmail.com>.

Thanks to you again, Sean.

The thing is that, we persist and count that RDD in hope that all later
actions with it won't trigger previous recalculations, it's not really
about performance here, it's because recalculations contain UUID generation
which should be the same for further actions.

I understand that RDD concept is based on linage, and it kind of
contradicts our goal but, is there ay way to guarantee that it's persisted,
or make it fail when persisting fails?

On 29 March 2015 at 12:51, Sean Owen <so...@cloudera.com> wrote:

> persist() completes immediately since it only marks the RDD for
> persistence. count() triggers computation of rdd, and as rdd is
> computed it will be persisted. The following transform should
> therefore only start after count() and therefore after the persistence
> completes. I think there might be corner cases where you still see
> some of rdd computed, like, if a persisted block is lost or otherwise
> unavailable later.
>
> On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan
> <ha...@gmail.com> wrote:
> > Hi.
> >
> > rdd.persist()
> > rdd.count()
> >
> > rdd.transform()...
> >
> > is there a chance transform() runs before persist() is complete?
> >
> > --
> > RGRDZ Harut
>

-- 
RGRDZ Harut

Re: RDD Persistance synchronization

Posted by Sean Owen <so...@cloudera.com>.

persist() completes immediately since it only marks the RDD for
persistence. count() triggers computation of rdd, and as rdd is
computed it will be persisted. The following transform should
therefore only start after count() and therefore after the persistence
completes. I think there might be corner cases where you still see
some of rdd computed, like, if a persisted block is lost or otherwise
unavailable later.

On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan
<ha...@gmail.com> wrote:
> Hi.
>
> rdd.persist()
> rdd.count()
>
> rdd.transform()...
>
> is there a chance transform() runs before persist() is complete?
>
> --
> RGRDZ Harut

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org