You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Vikash Kumar <vi...@gmail.com> on 2016/09/07 17:58:06 UTC

Split RDD by key and save to different files

I need to spilt RDD [keys, Iterable[Value]]  to save each key into
different file.

e.g I have records like: customerId, name, age, sex

111,abc,34,M
122, xyz,32,F
111,def,31,F
122.trp,30,F
133,jkl,35,M

I need to write 3 different files based on customerId
file1:
111,abc,34,M
111,def,31,F

file2:
122, xyz,32,F
122.trp,30,F

file3:
133,jkl,35,M

How I can achieve this in spark scala code?

Re: Split RDD by key and save to different files

Posted by Dhaval Patel <ma...@gmail.com>.

In order to do that, first of all you need to Key RDD by Key. and then use
saveAsHadoopFile in this way:

We can use saveAsHadoopFile(location,classOf[KeyClass],
classOf[ValueClass], classOf[PartitionOutputFormat])

When PartitionOutputFormat is extended from MultipleTextOutputFormat.

Sample for that is below:

class PartitionOutputFormat extends MultipleTextOutputFormat[Any, Any] {
  override def generateActualKey(key: Any, value: Any): Any =
    /// Add logic if you want to create any Key from Key and Value

  override def generateFileNameForKeyValue(key: Any, value: Any, basePath:
String): String = {
   /// Add logic to generate file name from Key and Value, Generally we use
basePath and add Key to it to make filename for that set of keys.
  }
}

On Wed, Sep 7, 2016 at 10:58 AM, Vikash Kumar <vi...@gmail.com> wrote:

> I need to spilt RDD [keys, Iterable[Value]]  to save each key into
> different file.
>
> e.g I have records like: customerId, name, age, sex
>
> 111,abc,34,M
> 122, xyz,32,F
> 111,def,31,F
> 122.trp,30,F
> 133,jkl,35,M
>
> I need to write 3 different files based on customerId
> file1:
> 111,abc,34,M
> 111,def,31,F
>
> file2:
> 122, xyz,32,F
> 122.trp,30,F
>
> file3:
> 133,jkl,35,M
>
> How I can achieve this in spark scala code?
>