You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Arpan Ghosh <ar...@automatic.com> on 2014/11/25 20:30:29 UTC

using MultipleOutputFormat to ensure one output file per key

Hi,

How can I implement a custom MultipleOutputFormat and specify it as the
output of my Spark job so that I can ensure that there is a unique output
file per key (instead of a a unique output file per reducer)?

Thanks

Arpan

Re: using MultipleOutputFormat to ensure one output file per key

Posted by Rafal Kwasny <ma...@entropy.be>.
Hi,

Arpan Ghosh wrote:
> Hi,
>
> How can I implement a custom MultipleOutputFormat and specify it as
> the output of my Spark job so that I can ensure that there is a unique
> output file per key (instead of a a unique output file per reducer)?
>

I use something like this:

class KeyBasedOutput[T >: Null ,V <: AnyRef] extends
MultipleTextOutputFormat[T , V] {
  override protected def generateFileNameForKeyValue(key: T, value: V,
leaf: String) = {
    key.toString()+"/"+leaf
  }
  override protected def generateActualKey(key: T, value: V) = {
    null
  }
  // this could be dangerous and overwrite files
  @throws(classOf[FileAlreadyExistsException])
  @throws(classOf[InvalidJobConfException])
  @throws(classOf[IOException])
  override def checkOutputSpecs(ignored: FileSystem,job: JobConf) ={
  }
}

and then just set a jobconf:

      val jobConf = new JobConf(self.context.hadoopConfiguration)
      jobConf.setOutputKeyClass(classOf[String])
      jobConf.setOutputValueClass(classOf[String])
      jobConf.setOutputFormat(classOf[KeyBasedOutput[String, String]])
      rdd.saveAsHadoopDataset(jobConf)


/Rafal

> Thanks
>
> Arpan


-- 
Regards
RafaƂ Kwasny
mailto:/jabberid: mag@entropy.be

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org