You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shay Seng <sh...@1618labs.com> on 2013/10/09 03:16:39 UTC

How would I start writing a RDD[ProtoBuf] and/or sc.newAPIHadoopFile??

Hi,

I would like to store some data as a seq of protobuf objects. I would of
course need to beable to read that into an RDD and write the RDD back out
in some binary format.

First of all, is this supported natively (or through some download)?

If not, are there examples on how I might write my own RDDs? I was hoping I
would be able to accomplish this using some invokation of
sparkContext.newAPIHadoopFile , but the comments there are just too terse.
Are there more verbose examples out there? Either on how to write new RDD
inputFormats, or how to make use of newAPIHadoopFile

tks
Shay

Re: How would I start writing a RDD[ProtoBuf] and/or sc.newAPIHadoopFile??

Posted by Gary Malouf <ma...@gmail.com>.
We built a custom assembly jar with helper functions for reading our
protobuf data.

First of all, serialized protobuf data should be stored as sequence files
in HDFS.  You would then do the following to read the serialized data in:

val myDataRDD = context.sequenceFile[LongWritable, BytesWritable](HdfsPath)


All that is left to do is map the RDD to a collection of the object type
you are deserializing to:

myDataRDD.map {

case (_, b: BytesWritable) => MyMessage.parse(b)

}


Hope this helps.



On Tue, Oct 8, 2013 at 9:16 PM, Shay Seng <sh...@1618labs.com> wrote:

>
> Hi,
>
> I would like to store some data as a seq of protobuf objects. I would of
> course need to beable to read that into an RDD and write the RDD back out
> in some binary format.
>
> First of all, is this supported natively (or through some download)?
>
> If not, are there examples on how I might write my own RDDs? I was hoping
> I would be able to accomplish this using some invokation of
> sparkContext.newAPIHadoopFile , but the comments there are just too terse.
> Are there more verbose examples out there? Either on how to write new RDD
> inputFormats, or how to make use of newAPIHadoopFile
>
> tks
> Shay
>