You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by CodingCat <gi...@git.apache.org> on 2014/02/23 08:30:57 UTC

[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

GitHub user CodingCat opened a pull request:

    https://github.com/apache/incubator-spark/pull/636

    [SPARK-1102] Create a saveAsNewAPIHadoopDataset method

    Create a saveAsNewAPIHadoopDataset method
    
    By @mateiz: "Right now RDDs can only be saved as files using the new Hadoop API, not as "datasets" with no filename and just a JobConf. See http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ for an example of how you have to give a bogus filename. For the old Hadoop API, we have saveAsHadoopDataset."

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/CodingCat/incubator-spark SPARK-1102

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-spark/pull/636.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #636
    
----
commit fac89212f5b964eabfb316256daef774dffc7a5f
Author: CodingCat <zh...@gmail.com>
Date:   2014-02-23T07:18:36Z

    Create a saveAsNewAPIHadoopDataset method

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/incubator-spark/pull/636#issuecomment-35972327
  
    any feedback?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/incubator-spark/pull/636#discussion_r9988672
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -686,6 +649,47 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
       }
     
       /**
    +   * Output the RDD to any Hadoop-supported storage system with new Hadoop API, using a Hadoop
    +   * Job object for that storage system. The Job should set an OutputFormat and any output paths
    +   * required (e.g. a table name to write to) in the same way as it would be configured for a Hadoop
    +   * MapReduce job.
    +   */
    +  def saveAsNewAPIHadoopDataset(job: NewAPIHadoopJob) {
    --- End diff --
    
    Hi @mateiz in the new API, the old JobConf is replaced by mapreduce.Job (it's different from mapred.Job), I got this from here http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api (page 10)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/636#issuecomment-35826394
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/incubator-spark/pull/636#issuecomment-35905921
  
    Jenkins....are you OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/incubator-spark/pull/636#issuecomment-35863485
  
    Jenkins, this is OK to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/incubator-spark/pull/636#discussion_r9983025
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -686,6 +649,47 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
       }
     
       /**
    +   * Output the RDD to any Hadoop-supported storage system with new Hadoop API, using a Hadoop
    +   * Job object for that storage system. The Job should set an OutputFormat and any output paths
    +   * required (e.g. a table name to write to) in the same way as it would be configured for a Hadoop
    +   * MapReduce job.
    +   */
    +  def saveAsNewAPIHadoopDataset(job: NewAPIHadoopJob) {
    --- End diff --
    
    In the new Hadoop API, does this really require a Job or just a Configuration? In the old API we only needed a configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---