You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by liutang123 <gi...@git.apache.org> on 2017/10/10 10:22:22 UTC

[GitHub] spark pull request #19464: Spark 22233

GitHub user liutang123 opened a pull request:

    https://github.com/apache/spark/pull/19464

    Spark 22233

    ## What changes were proposed in this pull request?
    add spark.hadoop.filterOutEmptySplit confituration to allow user to filter out empty split in HadoopRDD.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liutang123/spark SPARK-22233

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19464.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19464
    
----
commit 2317bfdf18fc1a7b21cd43e0ec12f5e957fb1895
Author: liutang123 <li...@yeah.net>
Date:   2017-06-21T04:27:42Z

    Merge pull request #1 from apache/master
    
    20170521 pull request

commit e3f993959fabdb80b966a42bf40d1cb5c6b44d95
Author: liulijia <li...@meituan.com>
Date:   2017-09-28T06:12:04Z

    Merge branch 'master' of https://github.com/apache/spark

commit 8f57d43b6bf127fc67e3e391d851efae3a859206
Author: liulijia <li...@meituan.com>
Date:   2017-10-10T02:16:18Z

    Merge branch 'master' of https://github.com/apache/spark

commit 39999610f78837f4a5623f6d47b9feab1e565ed6
Author: liulijia <li...@meituan.com>
Date:   2017-10-10T10:19:29Z

    allow user to filter empty split in HadoopRDD

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82694/testReport)** for PR 19464 at commit [`4dcfd83`](https://github.com/apache/spark/commit/4dcfd83612661ce47e8c2c1f33590c61dfe4e473).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82716/testReport)** for PR 19464 at commit [`25f98d0`](https://github.com/apache/spark/commit/25f98d0d89e4566339d9ba7701975af4e175c918).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: Spark 22233

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144031728
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,16 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.hadoop.filterOutEmptySplit") {
    +    val sf = new SparkConf()
    +    sf.setAppName("test").setMaster("local").set("spark.hadoop.filterOutEmptySplit", "true")
    +    sc = new SparkContext(sf)
    +    val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1)
    +    emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output")
    +    assert(new File(tempDir.getPath + "/output/part-00000").exists() === true)
    +
    +    val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-00000")
    --- End diff --
    
    We should also add the following test cases:
    1. Ensure that if no split is empty, we don't lose any splits;
    2. Ensure that if part of the splits are empty, we remove the splits correctly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: Spark 22233

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Could you please update the title of this PR appropriately? e.g. `[SPARK-22233][core] ...`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by liutang123 <gi...@git.apache.org>.

Github user liutang123 commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    @kiszk Any other suggestions an can ti PR be merged?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82696 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82696/testReport)** for PR 19464 at commit [`527b367`](https://github.com/apache/spark/commit/527b367ea482261f6afbb7cdf339495f77c4e7f2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82672/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82752/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144353093
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,54 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,
    +                              outputSuffix: Int, checkPart: String, partitionLength: Int): Unit = {
    +      val dataRDD = sc.parallelize(data, numSlices)
    +      val output = new File(tempDir, "output" + outputSuffix)
    +      dataRDD.saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, checkPart).exists() === true)
    +      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
    +      assert(hadoopRDD.partitions.length === partitionLength)
    +    }
    +
    +    // Ensure that if all of the splits are empty, we remove the splits correctly
    +    testIgnoreEmptySplits(Array.empty[Tuple2[String, String]], 1, 0, "part-00000", 0)
    --- End diff --
    
    I'd call it with named arguments, for example,
    
    ```scala
    testIgnoreEmptySplits(
      Array.empty[Tuple2[String, String]],
      numSlices = 1,
      outputSuffix = 0,
      checkPart = "part-00000",
      expectedPartitionNum = 0)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82726/testReport)** for PR 19464 at commit [`534d8fb`](https://github.com/apache/spark/commit/534d8fbcd7dfbdc9af06a4d926f6a353f429fce8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82694/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    I think the optimisation by `spark.sql.files.maxPartitionBytes` sql specific conf includes this concept in `FileScanRDD` and it looks already partially doing it in combining input splits. I'd suggest to avoid putting this conf in `FileScanRDD`, for now, if I didn't miss something.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144604878
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,87 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(
    +      data: Array[Tuple2[String, String]],
    +      actualPartitionNum: Int,
    +      expectedPart: String,
    +      expectedPartitionNum: Int): Unit = {
    +      val output = new File(tempDir, "output")
    +      sc.parallelize(data, actualPartitionNum)
    +        .saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, expectedPart).exists() === true)
    --- End diff --
    
    I don't think we need the `expectedPart` parameter, just
    ```
    for (i <- 0 until actualPartitionNum) {
      assert(new File(output, s"part-0000$i").exists() === true)
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144030646
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -196,7 +196,10 @@ class HadoopRDD[K, V](
         // add the credentials here as this can be called before SparkContext initialized
         SparkHadoopUtil.get.addCredentials(jobConf)
         val inputFormat = getInputFormat(jobConf)
    -    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    +    var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    +    if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
    +      inputSplits = inputSplits.filter(_.getLength>0)
    --- End diff --
    
    nit: extra space around operator.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82696 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82696/testReport)** for PR 19464 at commit [`527b367`](https://github.com/apache/spark/commit/527b367ea482261f6afbb7cdf339495f77c4e7f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    IIUC this issue also existed in `NewHadoopRDD` and `FileScanRDD` (possibly), we'd better also fix them.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82696/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144324123
  
    --- Diff: docs/configuration.md ---
    @@ -1192,6 +1192,14 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +    <td><code>spark.files.filterOutEmptySplit</code></td>
    --- End diff --
    
    yea we can make this conf an internal conf.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144365104
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,83 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,
    +      outputSuffix: Int, checkPart: String, expectedPartitionNum: Int): Unit = {
    +      val dataRDD = sc.parallelize(data, numSlices)
    +      val output = new File(tempDir, "output" + outputSuffix)
    +      dataRDD.saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, checkPart).exists() === true)
    +      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
    +      assert(hadoopRDD.partitions.length === expectedPartitionNum)
    +    }
    --- End diff --
    
    Actually, to me the previous tests were also okay to me as well ..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144357376
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,83 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,
    +      outputSuffix: Int, checkPart: String, expectedPartitionNum: Int): Unit = {
    +      val dataRDD = sc.parallelize(data, numSlices)
    +      val output = new File(tempDir, "output" + outputSuffix)
    +      dataRDD.saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, checkPart).exists() === true)
    +      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
    +      assert(hadoopRDD.partitions.length === expectedPartitionNum)
    +    }
    +
    +    // Ensure that if all of the splits are empty, we remove the splits correctly
    +    testIgnoreEmptySplits(
    +      data = Array.empty[Tuple2[String, String]],
    +      numSlices = 1,
    +      outputSuffix = 0,
    +      checkPart = "part-00000",
    +      expectedPartitionNum = 0)
    +
    +    // Ensure that if no split is empty, we don't lose any splits
    +    testIgnoreEmptySplits(
    +      data = Array(("key1", "a"), ("key2", "a"), ("key3", "b")),
    +      numSlices = 2,
    +      outputSuffix = 1,
    +      checkPart = "part-00001",
    +      expectedPartitionNum = 2)
    +
    +    // Ensure that if part of the splits are empty, we remove the splits correctly
    +    testIgnoreEmptySplits(
    +      data = Array(("key1", "a"), ("key2", "a")),
    +      numSlices = 5,
    +      outputSuffix = 2,
    +      checkPart = "part-00004",
    +      expectedPartitionNum = 2)
    +  }
    +
    +  test("spark.files.ignoreEmptySplits work correctly (new Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,
    +      outputSuffix: Int, checkPart: String, expectedPartitionNum: Int): Unit = {
    +      val dataRDD = sc.parallelize(data, numSlices)
    +      val output = new File(tempDir, "output" + outputSuffix)
    +      dataRDD.saveAsNewAPIHadoopFile[NewTextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, checkPart).exists() === true)
    +      val hadoopRDD = sc.textFile(new File(output, "part-r-*").getPath)
    --- End diff --
    
    I think we should _read_ it with new hadoop API to test `NewHadoopRDD` I guess?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82726/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144311831
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,65 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("allow user to filter out empty split (old Hadoop API)") {
    +    val sf = new SparkConf()
    +    sf.setAppName("test").setMaster("local").set(FILTER_OUT_EMPTY_SPLIT, true)
    +    sc = new SparkContext(sf)
    +
    +    // Ensure that if all of the splits are empty, we remove the splits correctly
    +    val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1)
    +    emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output")
    --- End diff --
    
    don't hardcode the path separator, use `new File(tempDir, output)`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144029852
  
    --- Diff: docs/configuration.md ---
    @@ -1211,6 +1211,14 @@ Apart from these, the following properties are also available, and may be useful
         data may need to be rewritten to pre-existing output directories during checkpoint recovery.</td>
     </tr>
     <tr>
    +    <td><code>spark.hadoop.filterOutEmptySplit</code></td>
    --- End diff --
    
    We should add the config to `internal/config`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144745063
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -270,6 +270,12 @@ package object config {
         .longConf
         .createWithDefault(4 * 1024 * 1024)
     
    +  private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits")
    --- End diff --
    
    I'll send a follow-up PR to fix this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r143984294
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -122,7 +122,10 @@ class NewHadoopRDD[K, V](
           case _ =>
         }
         val jobContext = new JobContextImpl(_conf, jobId)
    -    val rawSplits = inputFormat.getSplits(jobContext).toArray
    +    var rawSplits = inputFormat.getSplits(jobContext).toArray(Array.empty[InputSplit])
    +    if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
    +      rawSplits = rawSplits.filter(_.getLength>0)
    --- End diff --
    
    Space around operator.
    You should filter before making an array.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144357089
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,83 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,
    +      outputSuffix: Int, checkPart: String, expectedPartitionNum: Int): Unit = {
    +      val dataRDD = sc.parallelize(data, numSlices)
    +      val output = new File(tempDir, "output" + outputSuffix)
    +      dataRDD.saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, checkPart).exists() === true)
    +      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
    +      assert(hadoopRDD.partitions.length === expectedPartitionNum)
    +    }
    --- End diff --
    
    Could we maybe do this something like ... as below? (not tested)
    
    ```scala
    def testIgnoreEmptySplits(
        data: Array[Tuple2[String, String]],
        actualPartitionNum: Int,
        expectedName: String,
        expectedPartitionNum: Int): Unit = {
      val output = new File(tempDir, "output")
      sc.parallelize(data, actualPartitionNum)
        .saveAsHadoopFile[TextOutputFormat[String, String]](output.getAbsolutePath)
      assert(new File(output, expectedPart).exists())
      val hadoopRDD = sc.textFile(new File(output, "part-*").getAbsolutePath)
      assert(hadoopRDD.partitions.length === expectedPartitionNum)
    }
    
    ...
    
    testIgnoreEmptySplits(
      data = Array.empty[Tuple2[String, String]],
      actualPartitionNum = 1,
      expectedName = "part-00000",
      expectedPartitionNum = 0)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82752/testReport)** for PR 19464 at commit [`a6818b6`](https://github.com/apache/spark/commit/a6818b60adef7bec35b002846a3a504ae53dd9f9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144617310
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,87 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(
    +      data: Array[Tuple2[String, String]],
    +      actualPartitionNum: Int,
    +      expectedPart: String,
    +      expectedPartitionNum: Int): Unit = {
    --- End diff --
    
    Indentation..
    
    ```scala
    def testIgnoreEmptySplits(
        data: Array[Tuple2[String, String]],
        ...
        expectedPartitionNum: Int): Unit = {
      val output = new File(tempDir, "output")
      ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82694/testReport)** for PR 19464 at commit [`4dcfd83`](https://github.com/apache/spark/commit/4dcfd83612661ce47e8c2c1f33590c61dfe4e473).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144030461
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -196,7 +196,10 @@ class HadoopRDD[K, V](
         // add the credentials here as this can be called before SparkContext initialized
         SparkHadoopUtil.get.addCredentials(jobConf)
         val inputFormat = getInputFormat(jobConf)
    -    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    +    var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    --- End diff --
    
    How about:
    ```
    val inputSplits = if (......) {
        inputFormat.getSplits(jobConf, minPartitions).filter(_.getLength > 0)
    } else {
       inputFormat.getSplits(jobConf, minPartitions)
    }
    ```
    We should alway try to not use `var`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144319836
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,65 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("allow user to filter out empty split (old Hadoop API)") {
    +    val sf = new SparkConf()
    --- End diff --
    
    sf -> conf. You can fix it above too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144319668
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -196,7 +196,11 @@ class HadoopRDD[K, V](
         // add the credentials here as this can be called before SparkContext initialized
         SparkHadoopUtil.get.addCredentials(jobConf)
         val inputFormat = getInputFormat(jobConf)
    -    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    +    val inputSplits = if (sparkContext.getConf.get(FILTER_OUT_EMPTY_SPLIT)) {
    --- End diff --
    
    You can avoid duplicating  `inputFormat.getSplits(jobConf, minPartitions)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144330429
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -270,6 +270,15 @@ package object config {
         .longConf
         .createWithDefault(4 * 1024 * 1024)
     
    +  private [spark] val FILTER_OUT_EMPTY_SPLIT = ConfigBuilder("spark.files.filterOutEmptySplit")
    --- End diff --
    
    nit: how about `ignoreEmptySplits` to be matched with `ignoreCorruptFiles`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82752 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82752/testReport)** for PR 19464 at commit [`a6818b6`](https://github.com/apache/spark/commit/a6818b60adef7bec35b002846a3a504ae53dd9f9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82672/testReport)** for PR 19464 at commit [`31a5d30`](https://github.com/apache/spark/commit/31a5d303f91839124f8957f75d4077be5410524c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82658 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82658/testReport)** for PR 19464 at commit [`cf0c350`](https://github.com/apache/spark/commit/cf0c350daf12ce80fc781fd17fd15506d83c6d02).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19464


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144319503
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -270,6 +270,15 @@ package object config {
         .longConf
         .createWithDefault(4 * 1024 * 1024)
     
    +  private [spark] val FILTER_OUT_EMPTY_SPLIT = ConfigBuilder("spark.files.filterOutEmptySplit")
    --- End diff --
    
    Nit: no space after private
    This doc is much too verbose for a flag. Just say, "If true, methods like that use HadoopRDD and NewHadoopRDD such as SparkContext.textFiles will not create a partition for input splits that are empty."


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by liutang123 <gi...@git.apache.org>.

Github user liutang123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144235417
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -122,7 +122,10 @@ class NewHadoopRDD[K, V](
           case _ =>
         }
         val jobContext = new JobContextImpl(_conf, jobId)
    -    val rawSplits = inputFormat.getSplits(jobContext).toArray
    +    var rawSplits = inputFormat.getSplits(jobContext).toArray(Array.empty[InputSplit])
    +    if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
    +      rawSplits = rawSplits.filter(_.getLength>0)
    --- End diff --
    
    Is there any one use empty file to do something ?
    for example:
    sc.textFile("/somepath/*").mapPartitions(....)
    setting this flag to true by default may change the behavior of user's application.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144031167
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,16 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.hadoop.filterOutEmptySplit") {
    +    val sf = new SparkConf()
    +    sf.setAppName("test").setMaster("local").set("spark.hadoop.filterOutEmptySplit", "true")
    +    sc = new SparkContext(sf)
    +    val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1)
    +    emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output")
    +    assert(new File(tempDir.getPath + "/output/part-00000").exists() === true)
    +
    +    val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-00000")
    +    assert(hadoopRDD.partitions.length === 0)
    --- End diff --
    
    You should recycle the resources you required in the test case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144197159
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -196,7 +196,10 @@ class HadoopRDD[K, V](
         // add the credentials here as this can be called before SparkContext initialized
         SparkHadoopUtil.get.addCredentials(jobConf)
         val inputFormat = getInputFormat(jobConf)
    -    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    +    var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    +    if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
    --- End diff --
    
    I'd use `spark.files` prefix, taken after `spark.files.ignoreCorruptFiles`, `spark.files.maxPartitionBytes` and `spark.files.openCostInBytes`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82672/testReport)** for PR 19464 at commit [`31a5d30`](https://github.com/apache/spark/commit/31a5d303f91839124f8957f75d4077be5410524c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144745022
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -270,6 +270,12 @@ package object config {
         .longConf
         .createWithDefault(4 * 1024 * 1024)
     
    +  private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits")
    --- End diff --
    
    This config should be made internal, and the name should be improved because it's not about spark files. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144472244
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,86 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,
    --- End diff --
    
    nit: one argument per line.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by liutang123 <gi...@git.apache.org>.

Github user liutang123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144683771
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,87 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(
    +      data: Array[Tuple2[String, String]],
    +      actualPartitionNum: Int,
    +      expectedPart: String,
    +      expectedPartitionNum: Int): Unit = {
    +      val output = new File(tempDir, "output")
    +      sc.parallelize(data, actualPartitionNum)
    +        .saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, expectedPart).exists() === true)
    +      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
    +      assert(hadoopRDD.partitions.length === expectedPartitionNum)
    +      Utils.deleteRecursively(output)
    --- End diff --
    
    I think we don't need `try... finally` here. Because `Utils.deleteRecursively(output)` just to ensure
    the success of next invocation of the `testIgnoreEmptySplits`. When test finished, wether be passed or not, the `tempDir` will be deleted in `FileSuite.afterEach()`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144320027
  
    --- Diff: docs/configuration.md ---
    @@ -1192,6 +1192,14 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +    <td><code>spark.files.filterOutEmptySplit</code></td>
    --- End diff --
    
    I don't think I'd document this. It should be just a safety valve flag


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82716/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82726/testReport)** for PR 19464 at commit [`534d8fb`](https://github.com/apache/spark/commit/534d8fbcd7dfbdc9af06a4d926f6a353f429fce8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by liutang123 <gi...@git.apache.org>.

Github user liutang123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144235787
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,16 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.hadoop.filterOutEmptySplit") {
    +    val sf = new SparkConf()
    +    sf.setAppName("test").setMaster("local").set("spark.hadoop.filterOutEmptySplit", "true")
    +    sc = new SparkContext(sf)
    +    val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1)
    +    emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output")
    +    assert(new File(tempDir.getPath + "/output/part-00000").exists() === true)
    +
    +    val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-00000")
    +    assert(hadoopRDD.partitions.length === 0)
    --- End diff --
    
    The resources will be recycled by default in the afterEach function.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82716/testReport)** for PR 19464 at commit [`25f98d0`](https://github.com/apache/spark/commit/25f98d0d89e4566339d9ba7701975af4e175c918).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144617011
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,87 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(
    +      data: Array[Tuple2[String, String]],
    +      actualPartitionNum: Int,
    +      expectedPart: String,
    +      expectedPartitionNum: Int): Unit = {
    +      val output = new File(tempDir, "output")
    +      sc.parallelize(data, actualPartitionNum)
    +        .saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, expectedPart).exists() === true)
    +      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
    +      assert(hadoopRDD.partitions.length === expectedPartitionNum)
    +      Utils.deleteRecursively(output)
    --- End diff --
    
    Maybe:
    
    ```scala
    try {
      ...
    } finally {
      Utils.deleteRecursively(output)
    }
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82658/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Interesting. On the one hand I don't like adding yet another flag that changes behavior, when the user often can't meaningfully decide to set it. There is probably no value in processing an empty partition, sure. Then again it does change behavior slightly, and I wonder if that impacts assumptions that apps rely on somehow. 
    
    If there's no reason to expect downside, we could do this in Spark 3.x, or make the change now but yes introduce a flag as a safety valve to go back to old behavior, leaving the default to true.
    
    But first are there any known impacts to skipping the empty partitions?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    **[Test build #82658 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82658/testReport)** for PR 19464 at commit [`cf0c350`](https://github.com/apache/spark/commit/cf0c350daf12ce80fc781fd17fd15506d83c6d02).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144687101
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,83 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(
    +        data: Array[Tuple2[String, String]],
    +        actualPartitionNum: Int,
    +        expectedPartitionNum: Int): Unit = {
    +      val output = new File(tempDir, "output")
    +      sc.parallelize(data, actualPartitionNum)
    +        .saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      for (i <- 0 until actualPartitionNum) {
    +        assert(new File(output, s"part-0000$i").exists() === true)
    +      }
    +      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
    +      assert(hadoopRDD.partitions.length === expectedPartitionNum)
    +      Utils.deleteRecursively(output)
    +    }
    +
    +    // Ensure that if all of the splits are empty, we remove the splits correctly
    +    testIgnoreEmptySplits(
    +      data = Array.empty[Tuple2[String, String]],
    +      actualPartitionNum = 1,
    +      expectedPartitionNum = 0)
    +
    +    // Ensure that if no split is empty, we don't lose any splits
    +    testIgnoreEmptySplits(
    +      data = Array(("key1", "a"), ("key2", "a"), ("key3", "b")),
    +      actualPartitionNum = 2,
    +      expectedPartitionNum = 2)
    +
    +    // Ensure that if part of the splits are empty, we remove the splits correctly
    +    testIgnoreEmptySplits(
    +      data = Array(("key1", "a"), ("key2", "a")),
    +      actualPartitionNum = 5,
    +      expectedPartitionNum = 2)
    +  }
    +
    +  test("spark.files.ignoreEmptySplits work correctly (new Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(
    +        data: Array[Tuple2[String, String]],
    +        actualPartitionNum: Int,
    +        expectedPartitionNum: Int): Unit = {
    +      val output = new File(tempDir, "output")
    +      sc.parallelize(data, actualPartitionNum)
    +        .saveAsNewAPIHadoopFile[NewTextOutputFormat[String, String]](output.getPath)
    +      for (i <- 0 until actualPartitionNum) {
    +        assert(new File(output, s"part-r-0000$i").exists() === true)
    +      }
    +      val hadoopRDD = sc.newAPIHadoopFile(new File(output, "part-r-*").getPath,
    +        classOf[NewTextInputFormat], classOf[LongWritable], classOf[Text])
    +        .asInstanceOf[NewHadoopRDD[_, _]]
    --- End diff --
    
    nit:
    
    ```scala
    val hadoopRDD = sc.newAPIHadoopFile(
      new File(output, "part-r-*").getPath,
      classOf[NewTextInputFormat],
      classOf[LongWritable],
      classOf[Text]).asInstanceOf[NewHadoopRDD[_, _]]
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144452771
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -270,6 +270,12 @@ package object config {
         .longConf
         .createWithDefault(4 * 1024 * 1024)
     
    +  private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits")
    +    .doc("If true, methods like that use HadoopRDD and NewHadoopRDD such as " +
    --- End diff --
    
    `like that` -> `that`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19464
  
    I can't think of any downside, but it's always safe to avoid behavior changes. LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144181321
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -196,7 +196,10 @@ class HadoopRDD[K, V](
         // add the credentials here as this can be called before SparkContext initialized
         SparkHadoopUtil.get.addCredentials(jobConf)
         val inputFormat = getInputFormat(jobConf)
    -    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    +    var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    +    if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
    --- End diff --
    
    I would suggest not to use the name started by "spark.hadoop", this kind of configurations will be treated as Hadoop configuration and set into Hadoop `Configuration`, it might be better to choose another name.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19464#discussion_r144472362
  
    --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
    @@ -510,4 +510,86 @@ class FileSuite extends SparkFunSuite with LocalSparkContext {
         }
       }
     
    +  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,
    +      outputSuffix: Int, checkPart: String, expectedPartitionNum: Int): Unit = {
    +      val dataRDD = sc.parallelize(data, numSlices)
    +      val output = new File(tempDir, "output" + outputSuffix)
    +      dataRDD.saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
    +      assert(new File(output, checkPart).exists() === true)
    +      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
    +      assert(hadoopRDD.partitions.length === expectedPartitionNum)
    +    }
    +
    +    // Ensure that if all of the splits are empty, we remove the splits correctly
    +    testIgnoreEmptySplits(
    +      data = Array.empty[Tuple2[String, String]],
    +      numSlices = 1,
    +      outputSuffix = 0,
    +      checkPart = "part-00000",
    +      expectedPartitionNum = 0)
    +
    +    // Ensure that if no split is empty, we don't lose any splits
    +    testIgnoreEmptySplits(
    +      data = Array(("key1", "a"), ("key2", "a"), ("key3", "b")),
    +      numSlices = 2,
    +      outputSuffix = 1,
    +      checkPart = "part-00001",
    +      expectedPartitionNum = 2)
    +
    +    // Ensure that if part of the splits are empty, we remove the splits correctly
    +    testIgnoreEmptySplits(
    +      data = Array(("key1", "a"), ("key2", "a")),
    +      numSlices = 5,
    +      outputSuffix = 2,
    +      checkPart = "part-00004",
    +      expectedPartitionNum = 2)
    +  }
    +
    +  test("spark.files.ignoreEmptySplits work correctly (new Hadoop API)") {
    +    val conf = new SparkConf()
    +    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
    +    sc = new SparkContext(conf)
    +
    +    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,
    --- End diff --
    
    ditto.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org