You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shay Seng <sh...@urbanengines.com> on 2014/09/24 21:26:44 UTC
persist before or after checkpoint?
Hey,
I actually have 2 question
(1) I want to generate unique IDs for each RDD element and I want to
assign them in parallel so I do
rdd.mapPartitionsWithIndex((index, s) => {
var count = 0L
s.zipWithIndex.map {
case (t, i) => {
count += 1
(index * GLOBAL.MAX_PARTITION_SIZE + i, t)
}
}
})
This works ok, but we noticed that unless we checkpoint, if a partition is
recomputed, the IDs will get messed up.
Question 1: is there a better way to create unique IDs in a distributed way?
(2) To solve the stability issue with (1) we did:
rdd.persist.checkpoint
The Spark logs suggested that checkpointed RDDs be persisted. should the
persist be before or after checkpointing?
ok I lied,I have 3 question
(3) We are checkpointing to HDFS. we've noticed that sometimes the
checkpointing works and I see /RDD-1 etc written in HDFS, but other times
we only see the checkpoint dir created and not data ... I suspect (2) but
I'm not certain what is really happening.
Any pointers would be appreciated.
I'm using AWS r3.4xlarge machines with Spark 0.9.2
tks
shay