You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2016/05/03 01:29:18 UTC

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/12855

    [SPARK-10216][SQL] Avoid creating empty files during overwrite into Hive table with group by query

    ## What changes were proposed in this pull request?
    
    Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files.
    
    This PR makes Hive table overriding avoid creating empty files during overwrite into Hive table with group by query.
    
    ## How was this patch tested?
    
    Unittests in `InsertIntoHiveTableSuite`. This checks whether the given partition has data in it or not
    and creates/writes file only when it actually has data.
    
    Closes #8411

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark pr/8411

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12855.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12855
    
----
commit 80184556a48b22793ef2bdf991595cc845a9aded
Author: Keuntae Park <si...@apache.org>
Date:   2015-08-25T01:21:22Z

    do not make empty file when insert overwrite into Hive table

commit 46f085a2a13dbea616d2620bf8437e05cf99a573
Author: Keuntae Park <si...@apache.org>
Date:   2015-08-25T01:46:01Z

    Merge branch 'master' into NoEmptyInsert

commit 689252a6d531037ba74d998caca0991ad1798d4c
Author: Keuntae Park <si...@apache.org>
Date:   2015-08-25T04:06:04Z

    Merge remote-tracking branch 'upstream/master' into SPARK-10216

commit e2749d7bc78d6b64644cd7fcbf89fbda27e5688b
Author: Keuntae Park <si...@apache.org>
Date:   2015-08-25T04:07:37Z

    change test name to reflect issue number and name

commit acdc53778eb5f752d3a9462d7152aaa36df3ac41
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-05-03T00:55:26Z

    Stash chagnes

commit 9a89ed149300f1cd1699d52d8b0138d1c84fb468
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-05-03T00:56:56Z

    Rebase upstream

commit 57f2eccdcb2b8249142da22509068ceefbef04e1
Author: hyukjinkwon <gu...@gmail.com>
Date:   2016-05-03T01:22:59Z

    Add the function in hiveWriterContainers and polish test codes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219624016
  
    **[Test build #58666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58666/consoleFull)** for PR 12855 at commit [`5f780a7`](https://github.com/apache/spark/commit/5f780a7ba60b2f0518897a9a369d27455cab47cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12855#discussion_r63466408
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -216,6 +215,33 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
         sql("DROP TABLE hiveTableWithStructValue")
       }
     
    +  test("SPARK-10216: Avoid empty files during overwrite into Hive table with group by query") {
    +    val testDataset = hiveContext.sparkContext.parallelize(
    +      (1 to 2).map(i => TestData(i, i.toString))).toDF()
    +    testDataset.registerTempTable("testDataset")
    +
    +    val tmpDir = Utils.createTempDir()
    +    sql(
    +      s"""
    +        |CREATE TABLE table1(key int,value string)
    +        |location '${tmpDir.toURI.toString}'
    +      """.stripMargin)
    +    sql(
    +      """
    +        |INSERT OVERWRITE TABLE table1
    +        |SELECT count(key), value FROM testDataset GROUP BY value
    --- End diff --
    
    Ah, yes. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219616350
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58662/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216421830
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57579/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216422115
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57581/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-218027650
  
    **[Test build #58186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58186/consoleFull)** for PR 12855 at commit [`24e16b7`](https://github.com/apache/spark/commit/24e16b794f61dc629bbeced274cbcab02d044bbc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-218946730
  
    ping @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12855#discussion_r61863848
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala ---
    @@ -363,84 +365,87 @@ private[sql] class DynamicPartitionWriterContainer(
       }
     
       def writeRows(taskContext: TaskContext, iterator: Iterator[InternalRow]): Unit = {
    -    executorSideSetup(taskContext)
    -
    -    // We should first sort by partition columns, then bucket id, and finally sorting columns.
    -    val sortingExpressions: Seq[Expression] = partitionColumns ++ bucketIdExpression ++ sortColumns
    -    val getSortingKey = UnsafeProjection.create(sortingExpressions, inputSchema)
    -
    -    val sortingKeySchema = StructType(sortingExpressions.map {
    -      case a: Attribute => StructField(a.name, a.dataType, a.nullable)
    -      // The sorting expressions are all `Attribute` except bucket id.
    -      case _ => StructField("bucketId", IntegerType, nullable = false)
    -    })
    -
    -    // Returns the data columns to be written given an input row
    -    val getOutputRow = UnsafeProjection.create(dataColumns, inputSchema)
    -
    -    // Returns the partition path given a partition key.
    -    val getPartitionString =
    -      UnsafeProjection.create(Concat(partitionStringExpression) :: Nil, partitionColumns)
    -
    -    // Sorts the data before write, so that we only need one writer at the same time.
    -    // TODO: inject a local sort operator in planning.
    -    val sorter = new UnsafeKVExternalSorter(
    -      sortingKeySchema,
    -      StructType.fromAttributes(dataColumns),
    -      SparkEnv.get.blockManager,
    -      SparkEnv.get.serializerManager,
    -      TaskContext.get().taskMemoryManager().pageSizeBytes)
    -
    -    while (iterator.hasNext) {
    -      val currentRow = iterator.next()
    -      sorter.insertKV(getSortingKey(currentRow), getOutputRow(currentRow))
    -    }
    -    logInfo(s"Sorting complete. Writing out partition files one at a time.")
    -
    -    val getBucketingKey: InternalRow => InternalRow = if (sortColumns.isEmpty) {
    -      identity
    -    } else {
    -      UnsafeProjection.create(sortingExpressions.dropRight(sortColumns.length).zipWithIndex.map {
    -        case (expr, ordinal) => BoundReference(ordinal, expr.dataType, expr.nullable)
    +    if (iterator.hasNext) {
    --- End diff --
    
    Here as well. Simply added `iterator.hasNext` check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12855#discussion_r63462262
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -216,6 +215,33 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
         sql("DROP TABLE hiveTableWithStructValue")
       }
     
    +  test("SPARK-10216: Avoid empty files during overwrite into Hive table with group by query") {
    +    val testDataset = hiveContext.sparkContext.parallelize(
    +      (1 to 2).map(i => TestData(i, i.toString))).toDF()
    +    testDataset.registerTempTable("testDataset")
    +
    +    val tmpDir = Utils.createTempDir()
    +    sql(
    +      s"""
    +        |CREATE TABLE table1(key int,value string)
    +        |location '${tmpDir.toURI.toString}'
    +      """.stripMargin)
    +    sql(
    +      """
    +        |INSERT OVERWRITE TABLE table1
    +        |SELECT count(key), value FROM testDataset GROUP BY value
    --- End diff --
    
    Seems you want to explicitly control the number of shuffle partitions? Otherwise, this test will not testing anything if the number of shuffle partitions is set to 2 by any chance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219614807
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219635881
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58666/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216502973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57624/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216434479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57587/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219635725
  
    **[Test build #58666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58666/consoleFull)** for PR 12855 at commit [`5f780a7`](https://github.com/apache/spark/commit/5f780a7ba60b2f0518897a9a369d27455cab47cb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216492627
  
    **[Test build #57625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57625/consoleFull)** for PR 12855 at commit [`b595b7f`](https://github.com/apache/spark/commit/b595b7f84eb7567f68ed78c42cc2d94f2173fb2c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216411891
  
    I submitted this PR because #8411 looks abandoned and looks the author is not answering from the last comment by a commiter. (It has been inactive almost halt a year). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12855: [SPARK-10216][SQL] Avoid creating empty files during ove...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/12855
  
    @DanielMe Yes, actually it seems a different issue when you use `emptyRDD[Row]`. Apparently, this case does not produce any partitions whereas the code provided by @jurriaan produces some empty partitions.
    
    This was reverted because the latter case fails. So.. it seems the former case has been being failed from older versions and the latter is not being failed after this one is reverted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12855#discussion_r61863832
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala ---
    @@ -239,48 +239,50 @@ private[sql] class DefaultWriterContainer(
       extends BaseWriterContainer(relation, job, isAppend) {
     
       def writeRows(taskContext: TaskContext, iterator: Iterator[InternalRow]): Unit = {
    -    executorSideSetup(taskContext)
    -    val configuration = taskAttemptContext.getConfiguration
    -    configuration.set("spark.sql.sources.output.path", outputPath)
    -    var writer = newOutputWriter(getWorkPath)
    -    writer.initConverter(dataSchema)
    -
    -    // If anything below fails, we should abort the task.
    -    try {
    -      Utils.tryWithSafeFinallyAndFailureCallbacks {
    -        while (iterator.hasNext) {
    -          val internalRow = iterator.next()
    -          writer.writeInternal(internalRow)
    -        }
    -        commitTask()
    -      }(catchBlock = abortTask())
    -    } catch {
    -      case t: Throwable =>
    -        throw new SparkException("Task failed while writing rows", t)
    -    }
    +    if (iterator.hasNext) {
    --- End diff --
    
    Simply added `iterator.hasNext` check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216435343
  
    Should we have the same logic for data sources?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216509271
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-218039550
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12855#discussion_r61863843
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala ---
    @@ -363,84 +365,87 @@ private[sql] class DynamicPartitionWriterContainer(
       }
     
       def writeRows(taskContext: TaskContext, iterator: Iterator[InternalRow]): Unit = {
    -    executorSideSetup(taskContext)
    -
    -    // We should first sort by partition columns, then bucket id, and finally sorting columns.
    -    val sortingExpressions: Seq[Expression] = partitionColumns ++ bucketIdExpression ++ sortColumns
    -    val getSortingKey = UnsafeProjection.create(sortingExpressions, inputSchema)
    -
    -    val sortingKeySchema = StructType(sortingExpressions.map {
    -      case a: Attribute => StructField(a.name, a.dataType, a.nullable)
    -      // The sorting expressions are all `Attribute` except bucket id.
    -      case _ => StructField("bucketId", IntegerType, nullable = false)
    -    })
    -
    -    // Returns the data columns to be written given an input row
    -    val getOutputRow = UnsafeProjection.create(dataColumns, inputSchema)
    -
    -    // Returns the partition path given a partition key.
    -    val getPartitionString =
    -      UnsafeProjection.create(Concat(partitionStringExpression) :: Nil, partitionColumns)
    -
    -    // Sorts the data before write, so that we only need one writer at the same time.
    -    // TODO: inject a local sort operator in planning.
    -    val sorter = new UnsafeKVExternalSorter(
    -      sortingKeySchema,
    -      StructType.fromAttributes(dataColumns),
    -      SparkEnv.get.blockManager,
    -      SparkEnv.get.serializerManager,
    -      TaskContext.get().taskMemoryManager().pageSizeBytes)
    -
    -    while (iterator.hasNext) {
    -      val currentRow = iterator.next()
    -      sorter.insertKV(getSortingKey(currentRow), getOutputRow(currentRow))
    -    }
    -    logInfo(s"Sorting complete. Writing out partition files one at a time.")
    -
    -    val getBucketingKey: InternalRow => InternalRow = if (sortColumns.isEmpty) {
    -      identity
    -    } else {
    -      UnsafeProjection.create(sortingExpressions.dropRight(sortColumns.length).zipWithIndex.map {
    -        case (expr, ordinal) => BoundReference(ordinal, expr.dataType, expr.nullable)
    +    if (iterator.hasNext) {
    +      executorSideSetup(taskContext)
    --- End diff --
    
    Here as well. Simply added `iterator.hasNext` check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-220188177
  
    @jurriaan Oh, thank you. @marmbrus Yes please. You mea n reopening JIRA (it seems I can't reopen a merged PR).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-220192667
  
    Sure, I thought you could reopen PRs you created, but if not feel free to create a new one and link.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219614959
  
    **[Test build #58662 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58662/consoleFull)** for PR 12855 at commit [`24e16b7`](https://github.com/apache/spark/commit/24e16b794f61dc629bbeced274cbcab02d044bbc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216411956
  
    @yhuai Could you please take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216502802
  
    **[Test build #57624 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57624/consoleFull)** for PR 12855 at commit [`dee6a4e`](https://github.com/apache/spark/commit/dee6a4ebf11a28c958d27162e897641db4ac0aec).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216435745
  
    @rxin I thought so but I haven't tested yet. Could I will look into that if this one is merged maybe? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219635877
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216421574
  
    **[Test build #57579 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57579/consoleFull)** for PR 12855 at commit [`57f2ecc`](https://github.com/apache/spark/commit/57f2eccdcb2b8249142da22509068ceefbef04e1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216502970
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219797464
  
    Thanks, merging to master and 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216421905
  
    **[Test build #57581 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57581/consoleFull)** for PR 12855 at commit [`294b447`](https://github.com/apache/spark/commit/294b4474477449fdac192320d022fa28960d87ca).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216412075
  
    **[Test build #57579 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57579/consoleFull)** for PR 12855 at commit [`57f2ecc`](https://github.com/apache/spark/commit/57f2eccdcb2b8249142da22509068ceefbef04e1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216509082
  
    **[Test build #57625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57625/consoleFull)** for PR 12855 at commit [`b595b7f`](https://github.com/apache/spark/commit/b595b7f84eb7567f68ed78c42cc2d94f2173fb2c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-217757354
  
    Hi @marmbrus , Could you please take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-218039422
  
    **[Test build #58186 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58186/consoleFull)** for PR 12855 at commit [`24e16b7`](https://github.com/apache/spark/commit/24e16b794f61dc629bbeced274cbcab02d044bbc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216428980
  
    **[Test build #57587 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57587/consoleFull)** for PR 12855 at commit [`ab2d092`](https://github.com/apache/spark/commit/ab2d0922da40cc4c7377b14cb7175dcd242a8608).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216434477
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216412600
  
    **[Test build #57581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57581/consoleFull)** for PR 12855 at commit [`294b447`](https://github.com/apache/spark/commit/294b4474477449fdac192320d022fa28960d87ca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12855: [SPARK-10216][SQL] Avoid creating empty files during ove...

Posted by DanielMe <gi...@git.apache.org>.

Github user DanielMe commented on the issue:

    https://github.com/apache/spark/pull/12855
  
    I can reproduce the issue that @jurriaan reports on 1.6.0 and on 1.5.2. The issue does not occur on 1.3.1.
    
    I have added a comment to the JIRA issue with more detailed instructions how to reproduce: https://issues.apache.org/jira/browse/SPARK-15393
    
    Note that this might mean that this PR did not cause the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216436047
  
    @rxin Sure, I will thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216435961
  
    Can you look at it together with this? Seems like a good logical grouping and arguably data sources are more important than the Hive ones.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216491865
  
    **[Test build #57624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57624/consoleFull)** for PR 12855 at commit [`dee6a4e`](https://github.com/apache/spark/commit/dee6a4ebf11a28c958d27162e897641db4ac0aec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12855: [SPARK-10216][SQL] Avoid creating empty files during ove...

Posted by milanvdmria <gi...@git.apache.org>.

Github user milanvdmria commented on the issue:

    https://github.com/apache/spark/pull/12855
  
    The issue that @jurriaan reported is still there in Spark 2.1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-220193521
  
    No worries!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12855


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216509273
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57625/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-220177425
  
    I'm going to revert this until we figure out the issues @HyukjinKwon can you reopen?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216422113
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219616331
  
    **[Test build #58662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58662/consoleFull)** for PR 12855 at commit [`24e16b7`](https://github.com/apache/spark/commit/24e16b794f61dc629bbeced274cbcab02d044bbc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216627572
  
    cc @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12855#discussion_r63462283
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -879,6 +879,24 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
           }
         }
       }
    +
    +  test("SPARK-10216: Avoid empty files during overwriting with group by query") {
    +    withTempPath { path =>
    +      val df = sqlContext.range(0, 5)
    +      val groupedDF = df.groupBy("id").count()
    +      groupedDF.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(path.getCanonicalPath)
    --- End diff --
    
    Save as https://github.com/apache/spark/pull/12855/files#r63462262


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-218039551
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58186/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-220193132
  
    @marmbrus Sorry for letting you reverting this, I should have thought of this further before opening this PR. I will try to think more and try more carefully.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-219616349
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by jurriaan <gi...@git.apache.org>.

Github user jurriaan commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-220150712
  
    This breaks writing empty dataframes for me.
    
    Before this PR I could write empty dataframes without any problems. 
    
    Now it only writes a _SUCCESS file, and no metadata. Also, it sometimes throws a NullPointerException:
    
    ```8-May-2016 22:37:14 WARNING: org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary file for file:/Users/jurriaanpruis/Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7-2/test
    java.lang.NullPointerException
    	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
    	at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
    	at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
    	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
    	at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220)
    	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144)
    	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
    	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
    	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
    	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
    	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
    	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
    	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
    	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
    	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    	at py4j.Gateway.invoke(Gateway.java:280)
    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
    	at py4j.GatewayConnection.run(GatewayConnection.java:211)
    	at java.lang.Thread.run(Thread.java:745)```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216491871
  
    @rxin I could find the same issue in internal datasources. I just added the same logics and a test in `HadoopFsRelationTest `.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-220176423
  
    Thanks for reporting.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216434410
  
    **[Test build #57587 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57587/consoleFull)** for PR 12855 at commit [`ab2d092`](https://github.com/apache/spark/commit/ab2d0922da40cc4c7377b14cb7175dcd242a8608).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10216][SQL] Avoid creating empty files ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12855#issuecomment-216421825
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org