You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vipul Pandey <vi...@gmail.com> on 2014/03/31 07:11:48 UTC
batching the output
Hi,
I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple "rows" in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same.
e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records instead of 6
What I"m doing is that I'm using mapPartitions by using the grouped function of the iterator by giving it a groupSize.
val protoRDD:RDD[MyProto] = rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq =>{
val profiles = MyProto(...)
seq.foreach(x =>{
val row = new Row(x._1.toString)
row.setFloatValue(x._2)
profiles.addRow(row)
})
profiles
})
)
I haven't been able to test it out because of a separate issue (protobuf version mismatch - in a different thread) - but i'm hoping it will work.
Is there a better/straight-forward way of doing this?
Thanks
Vipul
Re: batching the output
Posted by Patrick Wendell <pw...@gmail.com>.
Ya this is a good way to do it.
On Sun, Mar 30, 2014 at 10:11 PM, Vipul Pandey <vi...@gmail.com> wrote:
> Hi,
>
> I need to batch the values in my final RDD before writing out to hdfs. The
> idea is to batch multiple "rows" in a protobuf and write those batches out
> - mostly to save some space as a lot of metadata is the same.
> e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records
> instead of 6
>
> What I"m doing is that I'm using mapPartitions by using the grouped
> function of the iterator by giving it a groupSize.
>
> val protoRDD:RDD[MyProto] =
> rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq =>{
> val profiles = MyProto(...)
> seq.foreach(x =>{
> val row = new Row(x._1.toString)
> row.setFloatValue(x._2)
> profiles.addRow(row)
> })
> profiles
> })
> )
> I haven't been able to test it out because of a separate issue (protobuf
> version mismatch - in a different thread) - but i'm hoping it will work.
>
> Is there a better/straight-forward way of doing this?
>
> Thanks
> Vipul