You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Arpan Ghosh <ar...@automatic.com> on 2014/11/25 20:30:29 UTC
using MultipleOutputFormat to ensure one output file per key
Hi,
How can I implement a custom MultipleOutputFormat and specify it as the
output of my Spark job so that I can ensure that there is a unique output
file per key (instead of a a unique output file per reducer)?
Thanks
Arpan
Re: using MultipleOutputFormat to ensure one output file per key
Posted by Rafal Kwasny <ma...@entropy.be>.
Hi,
Arpan Ghosh wrote:
> Hi,
>
> How can I implement a custom MultipleOutputFormat and specify it as
> the output of my Spark job so that I can ensure that there is a unique
> output file per key (instead of a a unique output file per reducer)?
>
I use something like this:
class KeyBasedOutput[T >: Null ,V <: AnyRef] extends
MultipleTextOutputFormat[T , V] {
override protected def generateFileNameForKeyValue(key: T, value: V,
leaf: String) = {
key.toString()+"/"+leaf
}
override protected def generateActualKey(key: T, value: V) = {
null
}
// this could be dangerous and overwrite files
@throws(classOf[FileAlreadyExistsException])
@throws(classOf[InvalidJobConfException])
@throws(classOf[IOException])
override def checkOutputSpecs(ignored: FileSystem,job: JobConf) ={
}
}
and then just set a jobconf:
val jobConf = new JobConf(self.context.hadoopConfiguration)
jobConf.setOutputKeyClass(classOf[String])
jobConf.setOutputValueClass(classOf[String])
jobConf.setOutputFormat(classOf[KeyBasedOutput[String, String]])
rdd.saveAsHadoopDataset(jobConf)
/Rafal
> Thanks
>
> Arpan
--
Regards
RafaĆ Kwasny
mailto:/jabberid: mag@entropy.be
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org