You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shay Seng <sh...@urbanengines.com> on 2014/09/24 21:26:44 UTC

persist before or after checkpoint?

Hey,

I actually have 2 question

(1)  I want to generate unique IDs for each RDD element and I want to
assign them in parallel so I do

rdd.mapPartitionsWithIndex((index, s) => {
      var count = 0L
      s.zipWithIndex.map {
        case (t, i) => {
          count += 1
          (index * GLOBAL.MAX_PARTITION_SIZE + i, t)
        }
      }
  })

This works ok, but we noticed that unless we checkpoint, if a partition is
recomputed, the IDs will get messed up.

Question 1: is there a better way to create unique IDs in a distributed way?

(2) To solve the stability issue with (1) we did:

rdd.persist.checkpoint

The Spark logs suggested that checkpointed RDDs be persisted. should the
persist be before or after checkpointing?

ok I lied,I  have 3 question
(3) We are checkpointing to HDFS. we've noticed that sometimes the
checkpointing works and I see /RDD-1 etc written in HDFS, but other times
we only see the checkpoint dir created and not data ... I suspect (2) but
I'm not certain what is really happening.

Any pointers would be appreciated.
I'm using AWS r3.4xlarge machines with Spark 0.9.2

tks
shay