You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Corey J. Nolet (JIRA)" <ji...@apache.org> on 2014/11/10 17:42:33 UTC

[jira] [Created] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

Corey J. Nolet created SPARK-4320:
-------------------------------------

             Summary: JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object 
                 Key: SPARK-4320
                 URL: https://issues.apache.org/jira/browse/SPARK-4320
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output, Spark Core
            Reporter: Corey J. Nolet
             Fix For: 1.1.1, 1.2.0


I am outputting data to Accumulo using a custom outputformat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm store, but rather a dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API.

Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of needing to set up the Job object explicitly, I'm in the camp of having the following method signature:

saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once.

Perhaps an overloaded method signature could be:

saveAsNewHadoopDataset(job : Job)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org