You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aureliano Buendia <bu...@gmail.com> on 2014/01/05 20:52:30 UTC

Why does saveAfObjectFile() serialize Array[T] instead of T?

Hi,

Given an RDD[T] instance, saveAfObjectFile() passes an instance of Array[T]
to serialize(), and not and instance of T:

  def saveAsObjectFile(path: String) {
    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(*x* => (NullWritable.get(), new BytesWritable(
*Utils.serialize(x)*)))
      .saveAsSequenceFile(path)
  }

Is this array mapping for efficiency reasons, or are there other reasons
for this?

I'm trying to use saveAfObjectFile() to serialize protobuf messages.
Protobuf messages already come with a method that turns them into
Array[Byte] (see
here<https://github.com/osmandapp/Osmand/blob/master/OsmAnd-java/src/com/google/protobuf/AbstractMessageLite.java#L60>),
that is, toByteArray() can be clled on an instance of T. Considering this,
how can a protobuf message instance be serialized in a custom version of
saveAsObjectFile()?

Is it a good idea to drop array mapping?:

def saveAsObjectFile(path: String) {
    this.map(x => (NullWritable.get(), new BytesWritable(*x.toByteArray()*
)))
      .saveAsSequenceFile(path)
  }