You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chenghao-intel <gi...@git.apache.org> on 2014/12/24 04:14:24 UTC

[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

GitHub user chenghao-intel opened a pull request:

    https://github.com/apache/spark/pull/3780

    [SPARK-4945] [SQL] Add overwrite option support for SchemaRDD.saveAsParquetFile

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark saveAsParquetFile

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3780.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3780
    
----
commit b7762840a3fc23ae205ea7352378472822d0bb39
Author: Cheng Hao <ha...@intel.com>
Date:   2014-12-24T03:03:17Z

    Add overwrite option support for saveAsParquetFile

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68083563
  
      [Test build #24802 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24802/consoleFull) for   PR 3780 at commit [`a7a380f`](https://github.com/apache/spark/commit/a7a380f16535667e1d2d4738bc790fec47784d0a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68022423
  
      [Test build #24755 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24755/consoleFull) for   PR 3780 at commit [`b776284`](https://github.com/apache/spark/commit/b7762840a3fc23ae205ea7352378472822d0bb39).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3780#discussion_r22765228
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala ---
    @@ -68,12 +68,29 @@ private[sql] trait SchemaRDDLike {
       /**
        * Saves the contents of this `SchemaRDD` as a parquet file, preserving the schema.  Files that
        * are written out using this method can be read back in as a SchemaRDD using the `parquetFile`
    -   * function.
    +   * function. It will raise exception if the specified path already existed.
        *
    +   * @param path   The destination path.
        * @group schema
        */
       def saveAsParquetFile(path: String): Unit = {
    -    sqlContext.executePlan(WriteToFile(path, logicalPlan)).toRdd
    +    // We provide override functions for the ability of default function argument value,
    +    // which is not naturely supported by Java
    +    saveAsParquetFile(path, false)
    +  }
    +
    +  /**
    +   * Saves the contents of this `SchemaRDD` as a parquet file, preserving the schema.  Files that
    +   * are written out using this method can be read back in as a SchemaRDD using the `parquetFile`
    +   * function.
    +   * @param path      The destination path.
    +   * @param overwrite If it's false, an exception will raise if the path already existed,
    --- End diff --
    
    does this save the entire schema rdd as a single parquet file or multiple?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-70767371
  
      [Test build #25860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25860/consoleFull) for   PR 3780 at commit [`8782fef`](https://github.com/apache/spark/commit/8782feff1c1ff521a6b9d055e2b8713bfd728a06).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3780#discussion_r22762698
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala ---
    @@ -68,12 +68,29 @@ private[sql] trait SchemaRDDLike {
       /**
        * Saves the contents of this `SchemaRDD` as a parquet file, preserving the schema.  Files that
        * are written out using this method can be read back in as a SchemaRDD using the `parquetFile`
    -   * function.
    +   * function. It will raise exception if the specified path already existed.
        *
    +   * @param path   The destination path.
        * @group schema
        */
       def saveAsParquetFile(path: String): Unit = {
    -    sqlContext.executePlan(WriteToFile(path, logicalPlan)).toRdd
    +    // We provide override functions for the ability of default function argument value,
    +    // which is not naturely supported by Java
    +    saveAsParquetFile(path, false)
    +  }
    +
    +  /**
    +   * Saves the contents of this `SchemaRDD` as a parquet file, preserving the schema.  Files that
    +   * are written out using this method can be read back in as a SchemaRDD using the `parquetFile`
    +   * function.
    +   * @param path      The destination path.
    +   * @param overwrite If it's false, an exception will raise if the path already existed,
    --- End diff --
    
    I'm not sure if I agree with this semantic as its explained.  I think we actually have to blow away the directory if is it there.  What about when there is a larger dataset being replaced by a smaller one.  You don't want to end up with half and half.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68028612
  
      [Test build #24758 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24758/consoleFull) for   PR 3780 at commit [`52adf46`](https://github.com/apache/spark/commit/52adf4622017c977863a3fa023feaca13bf67df2).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel closed the pull request at:

    https://github.com/apache/spark/pull/3780


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3780#discussion_r23207210
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala ---
    @@ -68,12 +68,28 @@ private[sql] trait SchemaRDDLike {
       /**
        * Saves the contents of this `SchemaRDD` as a parquet file, preserving the schema.  Files that
        * are written out using this method can be read back in as a SchemaRDD using the `parquetFile`
    -   * function.
    +   * function. It will raise exception if the specified path already existed.
        *
    +   * @param path   The destination path.
        * @group schema
        */
       def saveAsParquetFile(path: String): Unit = {
    -    sqlContext.executePlan(WriteToFile(path, logicalPlan)).toRdd
    +    saveAsParquetFile(path, false)
    +  }
    +
    +  /**
    +   * Saves the contents of this `SchemaRDD` as a parquet file, preserving the schema.  Files that
    +   * are written out using this method can be read back in as a SchemaRDD using the `parquetFile`
    +   * function.
    +   * @param path      The destination path.
    --- End diff --
    
    don't do vertical alignment here, and add a blank line before param starts


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3780#discussion_r23206838
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -179,9 +184,24 @@ private[sql] object ParquetRelation {
           throw new IllegalArgumentException(
             s"Unable to create ParquetRelation: incorrectly formatted path $pathStr")
         }
    +
         val path = origPath.makeQualified(fs)
    -    if (!allowExisting && fs.exists(path)) {
    -      sys.error(s"File $pathStr already exists.")
    +    val pathExisted = fs.exists(path)
    +
    +    if (pathExisted) {
    +      if (overwrite) {
    +        try {
    +          fs.delete(path, true)
    +        } catch {
    +          case e: IOException =>
    +            throw new IOException(
    +              s"Unable to clear output directory ${path}")
    --- End diff --
    
    put this on the previous line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68028615
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24758/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68086501
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24802/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-70767766
  
    Or probably we can cleanup those APIs, kind of like:
    ```
    trait Mode
    object Append extends Mode
    object Create extends Mode
    object CreateOrOverwrite extends Mode
    object CreateOrAppend extends Mode
    
    def outputFile(path: String, format: FileFormat=defaultFormat, mode: Mode = CreateOrOverwrite)
    def insertIntoFile(path: String, format: FileFormat) = outputFile(path, format, CreateOrOverwrite)
    def toCsv(path: String, mode: Mode) = outputFile(path, CSV, mode)
    def toParquet(path: String, mode: Mode) = outputFile(path, Parquet, mode)
    ```
    Any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3780#discussion_r23207087
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -151,15 +151,20 @@ private[sql] object ParquetRelation {
        *
        * @param pathString The directory the Parquetfile will be stored in.
        * @param attributes The schema of the relation.
    +   * @param overwrite  Overwrite the existed file path,
    +   *                      If it's false, an exception will raise if the path already existed,
    --- End diff --
    
    the alignment is off here. Just add a space after overwrite, and don't do vertical alignment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3780#discussion_r22278028
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -199,11 +199,15 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
       object ParquetOperations extends Strategy {
         def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
           // TODO: need to support writing to other types of files.  Unify the below code paths.
    -      case logical.WriteToFile(path, child) =>
    +      case logical.WriteToFile(path, child, overwrite) =>
             val relation =
    -          ParquetRelation.create(path, child, sparkContext.hadoopConfiguration, sqlContext)
    -        // Note: overwrite=false because otherwise the metadata we just created will be deleted
    -        InsertIntoParquetTable(relation, planLater(child), overwrite = false) :: Nil
    +          ParquetRelation.createEmpty(
    --- End diff --
    
    Is it safe to replace `create` with `createEmpty`? I see `create` will do some check and then call `createEmpty`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-70921844
  
    I kind of like a proposal like that, but we need to make it Java friendly as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68024602
  
      [Test build #24758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24758/consoleFull) for   PR 3780 at commit [`52adf46`](https://github.com/apache/spark/commit/52adf4622017c977863a3fa023feaca13bf67df2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68022467
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24755/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-70776319
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25860/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68086500
  
      [Test build #24802 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24802/consoleFull) for   PR 3780 at commit [`a7a380f`](https://github.com/apache/spark/commit/a7a380f16535667e1d2d4738bc790fec47784d0a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3780#discussion_r22278489
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -199,11 +199,15 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
       object ParquetOperations extends Strategy {
         def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
           // TODO: need to support writing to other types of files.  Unify the below code paths.
    -      case logical.WriteToFile(path, child) =>
    +      case logical.WriteToFile(path, child, overwrite) =>
             val relation =
    -          ParquetRelation.create(path, child, sparkContext.hadoopConfiguration, sqlContext)
    -        // Note: overwrite=false because otherwise the metadata we just created will be deleted
    -        InsertIntoParquetTable(relation, planLater(child), overwrite = false) :: Nil
    +          ParquetRelation.createEmpty(
    --- End diff --
    
    It should be safe, as we believe the logical plan has been resolved already during the analysis phase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-70776314
  
      [Test build #25860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25860/consoleFull) for   PR 3780 at commit [`8782fef`](https://github.com/apache/spark/commit/8782feff1c1ff521a6b9d055e2b8713bfd728a06).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-70765307
  
    For my understanding, `insertInto` works in the ways of `create new`, `append` or `overwrite` already, but it doesn't give user an option, which allow throwing exception when the file already existed. Probably that's why we  need the `saveAsParquetFile`, which in the semantic of `create new` if the file doesn't exist, or `overwrite` provided by this PR.
    
    I can close this PR if you feel this change is worthless.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-72568306
  
    OK, I will close this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-69551518
  
      [Test build #25409 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25409/consoleFull) for   PR 3780 at commit [`72c4a4b`](https://github.com/apache/spark/commit/72c4a4b0109845033070d243e5544d02a551cd8b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-70617731
  
    The changes lgtm. I left couple small comments. Do you mind fixing those?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-69551534
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25409/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-69541870
  
      [Test build #25409 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25409/consoleFull) for   PR 3780 at commit [`72c4a4b`](https://github.com/apache/spark/commit/72c4a4b0109845033070d243e5544d02a551cd8b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-70731684
  
    @rxin what about the question of `append`, `overwrite`, `error`?  Do we want to expose all three for these types of interfaces?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-69475850
  
    /cc @rxin 
    
    More API questions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3780#discussion_r22795027
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala ---
    @@ -68,12 +68,29 @@ private[sql] trait SchemaRDDLike {
       /**
        * Saves the contents of this `SchemaRDD` as a parquet file, preserving the schema.  Files that
        * are written out using this method can be read back in as a SchemaRDD using the `parquetFile`
    -   * function.
    +   * function. It will raise exception if the specified path already existed.
        *
    +   * @param path   The destination path.
        * @group schema
        */
       def saveAsParquetFile(path: String): Unit = {
    -    sqlContext.executePlan(WriteToFile(path, logicalPlan)).toRdd
    +    // We provide override functions for the ability of default function argument value,
    +    // which is not naturely supported by Java
    +    saveAsParquetFile(path, false)
    +  }
    +
    +  /**
    +   * Saves the contents of this `SchemaRDD` as a parquet file, preserving the schema.  Files that
    +   * are written out using this method can be read back in as a SchemaRDD using the `parquetFile`
    +   * function.
    +   * @param path      The destination path.
    +   * @param overwrite If it's false, an exception will raise if the path already existed,
    --- End diff --
    
    @marmbrus this comment actually is my intention. Not like the API `insertInto`, which work as `overwrite` or `append` mode, `saveAsParquetFile` originally throws exception when the specified path is existed, but in some ETL application, we need to overwrite the existed path without any caution, that's why I add a new API for this purpose. And you're right, the implementation that you were reviewing contains a bug, now it has been fixed.
    
    @rxin We save the entire schema rdd as multiple files under the specified path. 
    And from the API design, I am not sure if we need to merge these 2 APIs `saveAsParquetFile` and `insertInto`, by specifying the mode (`append`,`overwrite`) and file types (parquet format, rcfile etc.). 
    
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4945] [SQL] Add overwrite option suppor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3780#issuecomment-68022466
  
      [Test build #24755 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24755/consoleFull) for   PR 3780 at commit [`b776284`](https://github.com/apache/spark/commit/b7762840a3fc23ae205ea7352378472822d0bb39).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org