You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jinxing64 <gi...@git.apache.org> on 2017/12/03 07:10:43 UTC

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

GitHub user jinxing64 opened a pull request:

    https://github.com/apache/spark/pull/19868

    [SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true

    ## What changes were proposed in this pull request?
    
    In current code, it will scanning all partition paths when spark.sql.hive.verifyPartitionPath=true.
    e.g. table like below:
    ```
    CREATE TABLE `test`(
    `id` int,
    `age` int,
    `name` string)
    PARTITIONED BY (
    `A` string,
    `B` string)
    load data local inpath '/tmp/data0' into table test partition(A='00', B='00')
    load data local inpath '/tmp/data1' into table test partition(A='01', B='01')
    load data local inpath '/tmp/data2' into table test partition(A='10', B='10')
    load data local inpath '/tmp/data3' into table test partition(A='11', B='11')
    ```
    If I query with SQL – "select * from test where year=2017 and month=12 and day=03", current code will scan all partition paths including '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', '/data/A=11/B=11'. It costs much time and memory cost.
    
    This pr proposes to avoid iterating all partition paths. Convert  /demo/data/year/month/day  to  /demo/data/year/month/*/ when generating path pattern.
    
    ## How was this patch tested?
    
    Manually test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinxing64/spark SPARK-22676

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19868
    
----
commit 57676609faed4512291979a8d639e3be1ec80578
Author: jinxing <ji...@126.com>
Date:   2017-12-03T07:07:12Z

    [SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    @cloud-fan 
    Thanks for ping~ I updated the description. Let me know if I should refine it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181716009
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -197,17 +200,24 @@ class HadoopRDD[K, V](
         val jobConf = getJobConf()
         // add the credentials here as this can be called before SparkContext initialized
         SparkHadoopUtil.get.addCredentials(jobConf)
    -    val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
    -    val inputSplits = if (ignoreEmptySplits) {
    -      allInputSplits.filter(_.getLength > 0)
    -    } else {
    -      allInputSplits
    -    }
    -    val array = new Array[Partition](inputSplits.size)
    -    for (i <- 0 until inputSplits.size) {
    -      array(i) = new HadoopPartition(id, i, inputSplits(i))
    +    try {
    +      val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
    +      val inputSplits = if (ignoreEmptySplits) {
    +        allInputSplits.filter(_.getLength > 0)
    +      } else {
    +        allInputSplits
    +      }
    +      val array = new Array[Partition](inputSplits.size)
    +      for (i <- 0 until inputSplits.size) {
    +        array(i) = new HadoopPartition(id, i, inputSplits(i))
    +      }
    +      array
    +    } catch {
    +      case e: InvalidInputException if ignoreMissingFiles =>
    --- End diff --
    
    Yes, when data dir doesn't exist.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89424/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r182075222
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
    @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
         }
       }
     
    +  test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
    +    withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
    +      sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
    +      val testData = sparkContext.parallelize(
    +        (1 to 10).map(i => TestData(i, i.toString))).toDF()
    +      testData.createOrReplaceTempView("testData")
    +
    +      val tmpDir = Files.createTempDir()
    --- End diff --
    
    we should call `withTempDir`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2330/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181607797
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -260,6 +270,10 @@ class HadoopRDD[K, V](
                 logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
                 finished = true
                 null
    +          case e: FileNotFoundException if ignoreMissingFiles =>
    --- End diff --
    
    `FileNotFoundException` extends `IOException`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Basically we need to introduce this new `spark.sql.files.ignoreMissingFiles` config in detail. And them explain how can we use it to replace `spark.sql.hive.verifyPartitionPath`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    can somebody explain to me what the pr description has to do with missingFiles? I'm probably missing something but i feel the implementation is very different from the pr description.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84399/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r180961987
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    Em... It seems we have to check all the levels unless we have specified a value for each partition column. We can make some improvement here but seems that require more complicated approach.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #84399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84399/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89438/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r182077103
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
    @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
         }
       }
     
    +  test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
    +    withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
    --- End diff --
    
    nit: we usually write `withSQLConf(SQLConf.HIVE_VERIFY_PARTITION_PATH.key -> "false")`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181014951
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    @cloud-fan @jiangxb1987
    Thanks a lot for review.
    > Em... It seems we have to check all the levels unless we have specified a value for each partition column. We can make some improvement here but seems that require more complicated approach.
    Yes, true. In this change, I only optimize when user specify for each partition column, which is very common in the production -- as our user always did: `select xxx from yyy where year=yy and month=mm and day=dd`
    
    I'm not sure about you guys idea:  leave the current logic as it is(at least the code logic now is very simple)? or implement a more complicated approach and defend as many cases as possible? or do some improvement based on this pr and cover some very common cases?
    
    Thanks again for review :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89370/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89371/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r180742133
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    and more trickily, `select * from test where B='01'`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181571746
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -124,17 +126,25 @@ class NewHadoopRDD[K, V](
             configurable.setConf(_conf)
           case _ =>
         }
    -    val allRowSplits = inputFormat.getSplits(new JobContextImpl(_conf, jobId)).asScala
    -    val rawSplits = if (ignoreEmptySplits) {
    -      allRowSplits.filter(_.getLength > 0)
    -    } else {
    -      allRowSplits
    -    }
    -    val result = new Array[Partition](rawSplits.size)
    -    for (i <- 0 until rawSplits.size) {
    -      result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
    +    try {
    --- End diff --
    
    Put original logic in try-catch


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181718520
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -279,6 +293,10 @@ class HadoopRDD[K, V](
               case e: IOException if ignoreCorruptFiles =>
                 logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
                 finished = true
    +          case e: FileNotFoundException if ignoreMissingFiles =>
    +            logWarning(s"Skipped missing file: ${split.inputSplit}", e)
    +            finished = true
    +            null
    --- End diff --
    
    Same logic with FileScanRDD -- `finished=true`, thus `NextIterator.hasNext` returns false.
    
    https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/NextIterator.scala#L31


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89424/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181294385
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    I'm not sure. `spark.sql.hive.verifyPartitionPath` is only for hive table scan and `` is only for `spark.sql.files.ignoreMissingFiles` is only for datasource  scan, is it ? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Can we also update the title?
    ```
    Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true
    ```
    This is not true, we didn't fix the problem of `spark.sql.hive.verifyPartitionPath`, but add a new config to replace it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181787857
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -276,6 +292,12 @@ class HadoopRDD[K, V](
             try {
               finished = !reader.next(key, value)
             } catch {
    +          case e: FileNotFoundException if ignoreMissingFiles =>
    +            logWarning(s"Skipped missing file: ${split.inputSplit}", e)
    +            finished = true
    +            null
    --- End diff --
    
    the return value is not read by anyone


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r182075854
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
    @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
         }
       }
     
    +  test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
    +    withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
    +      sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
    +      val testData = sparkContext.parallelize(
    +        (1 to 10).map(i => TestData(i, i.toString))).toDF()
    +      testData.createOrReplaceTempView("testData")
    +
    +      val tmpDir = Files.createTempDir()
    +      // create the table for test
    +      sql(s"CREATE TABLE table_with_partition(key int,value string) " +
    +          s"PARTITIONED by (ds string) location '${tmpDir.toURI}' ")
    +      sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') " +
    +          "SELECT key,value FROM testData")
    +      sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='2') " +
    +          "SELECT key,value FROM testData")
    +      sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='3') " +
    +          "SELECT key,value FROM testData")
    +      sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='4') " +
    +          "SELECT key,value FROM testData")
    +
    +      // test for the exist path
    +      checkAnswer(sql("select key,value from table_with_partition"),
    +        testData.toDF.collect ++ testData.toDF.collect
    --- End diff --
    
    `testData.union(testData).union(testData).union(testData)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #84399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84399/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19868


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181788194
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -195,6 +205,10 @@ class NewHadoopRDD[K, V](
                   e)
                 finished = true
                 null
    +          case e: FileNotFoundException if ignoreMissingFiles =>
    --- End diff --
    
    shall we also apply https://github.com/apache/spark/pull/19868/files#diff-83eb37f7b0ebed3c14ccb7bff0d577c2R273 here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    cc @cloud-fan @jerryshao  @jiangxb1987 would you take a look at this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181607746
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -197,17 +200,24 @@ class HadoopRDD[K, V](
         val jobConf = getJobConf()
         // add the credentials here as this can be called before SparkContext initialized
         SparkHadoopUtil.get.addCredentials(jobConf)
    -    val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
    -    val inputSplits = if (ignoreEmptySplits) {
    -      allInputSplits.filter(_.getLength > 0)
    -    } else {
    -      allInputSplits
    -    }
    -    val array = new Array[Partition](inputSplits.size)
    -    for (i <- 0 until inputSplits.size) {
    -      array(i) = new HadoopPartition(id, i, inputSplits(i))
    +    try {
    +      val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
    +      val inputSplits = if (ignoreEmptySplits) {
    +        allInputSplits.filter(_.getLength > 0)
    +      } else {
    +        allInputSplits
    +      }
    +      val array = new Array[Partition](inputSplits.size)
    +      for (i <- 0 until inputSplits.size) {
    +        array(i) = new HadoopPartition(id, i, inputSplits(i))
    +      }
    +      array
    +    } catch {
    +      case e: InvalidInputException if ignoreMissingFiles =>
    --- End diff --
    
    when will this happen? the root path doesn't exist?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181272430
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    Actually do we still need this check? now we have something like `spark.sql.files.ignoreMissingFiles`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Sure, let me do it today.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181607863
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -279,6 +293,10 @@ class HadoopRDD[K, V](
               case e: IOException if ignoreCorruptFiles =>
                 logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
                 finished = true
    +          case e: FileNotFoundException if ignoreMissingFiles =>
    +            logWarning(s"Skipped missing file: ${split.inputSplit}", e)
    +            finished = true
    +            null
    --- End diff --
    
    why returning null here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89424/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181571713
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -197,17 +200,24 @@ class HadoopRDD[K, V](
         val jobConf = getJobConf()
         // add the credentials here as this can be called before SparkContext initialized
         SparkHadoopUtil.get.addCredentials(jobConf)
    -    val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
    -    val inputSplits = if (ignoreEmptySplits) {
    -      allInputSplits.filter(_.getLength > 0)
    -    } else {
    -      allInputSplits
    -    }
    -    val array = new Array[Partition](inputSplits.size)
    -    for (i <- 0 until inputSplits.size) {
    -      array(i) = new HadoopPartition(id, i, inputSplits(i))
    +    try {
    --- End diff --
    
    Put original logic in try-catch


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    We switched to a totally different approach in the middle and forgot to update the PR description... @jinxing64 can you update it? thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181939704
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -195,6 +205,10 @@ class NewHadoopRDD[K, V](
                   e)
                 finished = true
                 null
    +          case e: FileNotFoundException if ignoreMissingFiles =>
    --- End diff --
    
    Yes, I should change this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89397/testReport)** for PR 19868 at commit [`5893a2d`](https://github.com/apache/spark/commit/5893a2d992d6ad47e30465c0092579c006a605b2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181939661
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -276,6 +292,12 @@ class HadoopRDD[K, V](
             try {
               finished = !reader.next(key, value)
             } catch {
    +          case e: FileNotFoundException if ignoreMissingFiles =>
    +            logWarning(s"Skipped missing file: ${split.inputSplit}", e)
    +            finished = true
    +            null
    --- End diff --
    
    I removed this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r180741999
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    how about `select * from test where A='00'`? Shall we check one more level?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2366/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    @cloud-fan
    Thanks a lot for merging. 
    I will address the left comments today.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r180962121
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    If the partition columns are `A/B/C/D`, unless you specify a value for `A`, you have to check that level. The same case for `B`/`C`/`D`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181538066
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    Sure, that would be great. Is there some existing pr/jira working on that? if not, I can make the change :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181371382
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    can we make `spark.sql.files.ignoreMissingFiles` work for hive scan? it seems a better solution than this check.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89370/testReport)** for PR 19868 at commit [`d8d4096`](https://github.com/apache/spark/commit/d8d40966e42e826a37a8f06ce46e55cbdbb529db).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181788258
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -276,6 +292,12 @@ class HadoopRDD[K, V](
             try {
               finished = !reader.next(key, value)
             } catch {
    +          case e: FileNotFoundException if ignoreMissingFiles =>
    +            logWarning(s"Skipped missing file: ${split.inputSplit}", e)
    +            finished = true
    +            null
    --- End diff --
    
    the return value is not read by anyone


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Sure, updated.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2331/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Replace spark.sql.hive.verifyPartitionPath...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Sure, updated.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89372/testReport)** for PR 19868 at commit [`921cfc5`](https://github.com/apache/spark/commit/921cfc507236461ec11e0d94facf7447a08a662e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r180733361
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    @cloud-fan 
    Thanks a lot for review;
    Like I mentioned in description, "/demo/data/year/month/day" will not be converted to "/demo/data/*/*/*/", instead it's converted to  "/demo/data/year/month/*/" as a path pattern. When this path pattern is passed to `updateExistPathSetByPathPattern`, less matched paths will be returned. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84408/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #84408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84408/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #84408 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84408/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89438 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89438/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89438 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89438/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89370/testReport)** for PR 19868 at commit [`d8d4096`](https://github.com/apache/spark/commit/d8d40966e42e826a37a8f06ce46e55cbdbb529db).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89372/testReport)** for PR 19868 at commit [`921cfc5`](https://github.com/apache/spark/commit/921cfc507236461ec11e0d94facf7447a08a662e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2345/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19868
  
     retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r182075379
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
    @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
         }
       }
     
    +  test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
    +    withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
    +      sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
    +      val testData = sparkContext.parallelize(
    +        (1 to 10).map(i => TestData(i, i.toString))).toDF()
    +      testData.createOrReplaceTempView("testData")
    +
    +      val tmpDir = Files.createTempDir()
    +      // create the table for test
    +      sql(s"CREATE TABLE table_with_partition(key int,value string) " +
    --- End diff --
    
    we should call `withTable`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    @jinxing64 Let's not talk too much about the problem of `spark.sql.hive.verifyPartitionPath`. We should just introduce `spark.sql.files.ignoreMissingFiles` and say this can replace `spark.sql.hive.verifyPartitionPath`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    @cloud-fan @jiangxb1987 
    I updated and add a config `spark.files.ignoreMissingFiles`. It works for HadoopRDD and NewHadoopRDD in two cases: 
    1. "file not found" when `getPartitions`
    2. "file not found" when `compute`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89372/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r182075129
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
    @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
         }
       }
     
    +  test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
    +    withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
    +      sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
    +      val testData = sparkContext.parallelize(
    +        (1 to 10).map(i => TestData(i, i.toString))).toDF()
    +      testData.createOrReplaceTempView("testData")
    --- End diff --
    
    we should call `withTempView`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    @cloud-fan Thanks again for review;
    I updated according to your comments  and please take another look.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89397 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89397/testReport)** for PR 19868 at commit [`5893a2d`](https://github.com/apache/spark/commit/5893a2d992d6ad47e30465c0092579c006a605b2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2379/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    nvm, the new test follows the existing code style in this file. Feel free to address them in followup PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89371 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89371/testReport)** for PR 19868 at commit [`921cfc5`](https://github.com/apache/spark/commit/921cfc507236461ec11e0d94facf7447a08a662e).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181607994
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
    @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
         }
       }
     
    +  test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
    --- End diff --
    
    we should add some document for `spark.sql.hive.verifyPartitionPath`, say it's replaced by `spark.files.ignoreMissingFiles` and will be removed in future releases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r180713699
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
    @@ -176,12 +176,13 @@ class HadoopTableReader(
                   val matches = fs.globStatus(pathPattern)
                   matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
                 }
    -            // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
    +            // convert  /demo/data/year/month/day  to  /demo/data/year/month/*/
    --- End diff --
    
    This is a pretty old logic. Can you explain what's going on here and why your change works? It can help other people to understand your change quickly,


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19868#discussion_r181787994
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
    @@ -218,6 +232,10 @@ class NewHadoopRDD[K, V](
                     s"Skipped the rest content in the corrupted file: ${split.serializableHadoopSplit}",
                     e)
                   finished = true
    +            case e: FileNotFoundException if ignoreMissingFiles =>
    +              logWarning(s"Skipped missing file: ${split.serializableHadoopSplit}", e)
    +              finished = true
    +              null
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    **[Test build #89371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89371/testReport)** for PR 19868 at commit [`921cfc5`](https://github.com/apache/spark/commit/921cfc507236461ec11e0d94facf7447a08a662e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89397/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19868
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org