You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jinxing64 <gi...@git.apache.org> on 2017/12/03 07:10:43 UTC
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
GitHub user jinxing64 opened a pull request:
https://github.com/apache/spark/pull/19868
[SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true
## What changes were proposed in this pull request?
In current code, it will scanning all partition paths when spark.sql.hive.verifyPartitionPath=true.
e.g. table like below:
```
CREATE TABLE `test`(
`id` int,
`age` int,
`name` string)
PARTITIONED BY (
`A` string,
`B` string)
load data local inpath '/tmp/data0' into table test partition(A='00', B='00')
load data local inpath '/tmp/data1' into table test partition(A='01', B='01')
load data local inpath '/tmp/data2' into table test partition(A='10', B='10')
load data local inpath '/tmp/data3' into table test partition(A='11', B='11')
```
If I query with SQL – "select * from test where year=2017 and month=12 and day=03", current code will scan all partition paths including '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', '/data/A=11/B=11'. It costs much time and memory cost.
This pr proposes to avoid iterating all partition paths. Convert /demo/data/year/month/day to /demo/data/year/month/*/ when generating path pattern.
## How was this patch tested?
Manually test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jinxing64/spark SPARK-22676
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19868.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19868
----
commit 57676609faed4512291979a8d639e3be1ec80578
Author: jinxing <ji...@126.com>
Date: 2017-12-03T07:07:12Z
[SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
@cloud-fan
Thanks for ping~ I updated the description. Let me know if I should refine it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181716009
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -197,17 +200,24 @@ class HadoopRDD[K, V](
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
- val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
- val inputSplits = if (ignoreEmptySplits) {
- allInputSplits.filter(_.getLength > 0)
- } else {
- allInputSplits
- }
- val array = new Array[Partition](inputSplits.size)
- for (i <- 0 until inputSplits.size) {
- array(i) = new HadoopPartition(id, i, inputSplits(i))
+ try {
+ val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
+ val inputSplits = if (ignoreEmptySplits) {
+ allInputSplits.filter(_.getLength > 0)
+ } else {
+ allInputSplits
+ }
+ val array = new Array[Partition](inputSplits.size)
+ for (i <- 0 until inputSplits.size) {
+ array(i) = new HadoopPartition(id, i, inputSplits(i))
+ }
+ array
+ } catch {
+ case e: InvalidInputException if ignoreMissingFiles =>
--- End diff --
Yes, when data dir doesn't exist.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89424/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r182075222
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
}
}
+ test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
+ withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
+ sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
+ val testData = sparkContext.parallelize(
+ (1 to 10).map(i => TestData(i, i.toString))).toDF()
+ testData.createOrReplaceTempView("testData")
+
+ val tmpDir = Files.createTempDir()
--- End diff --
we should call `withTempDir`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2330/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181607797
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -260,6 +270,10 @@ class HadoopRDD[K, V](
logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
finished = true
null
+ case e: FileNotFoundException if ignoreMissingFiles =>
--- End diff --
`FileNotFoundException` extends `IOException`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19868
Basically we need to introduce this new `spark.sql.files.ignoreMissingFiles` config in detail. And them explain how can we use it to replace `spark.sql.hive.verifyPartitionPath`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/19868
can somebody explain to me what the pr description has to do with missingFiles? I'm probably missing something but i feel the implementation is very different from the pr description.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84399/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r180961987
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
Em... It seems we have to check all the levels unless we have specified a value for each partition column. We can make some improvement here but seems that require more complicated approach.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #84399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84399/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89438/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19868
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r182077103
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
}
}
+ test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
+ withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
--- End diff --
nit: we usually write `withSQLConf(SQLConf.HIVE_VERIFY_PARTITION_PATH.key -> "false")`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181014951
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
@cloud-fan @jiangxb1987
Thanks a lot for review.
> Em... It seems we have to check all the levels unless we have specified a value for each partition column. We can make some improvement here but seems that require more complicated approach.
Yes, true. In this change, I only optimize when user specify for each partition column, which is very common in the production -- as our user always did: `select xxx from yyy where year=yy and month=mm and day=dd`
I'm not sure about you guys idea: leave the current logic as it is(at least the code logic now is very simple)? or implement a more complicated approach and defend as many cases as possible? or do some improvement based on this pr and cover some very common cases?
Thanks again for review :)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89370/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89371/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r180742133
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
and more trickily, `select * from test where B='01'`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181571746
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -124,17 +126,25 @@ class NewHadoopRDD[K, V](
configurable.setConf(_conf)
case _ =>
}
- val allRowSplits = inputFormat.getSplits(new JobContextImpl(_conf, jobId)).asScala
- val rawSplits = if (ignoreEmptySplits) {
- allRowSplits.filter(_.getLength > 0)
- } else {
- allRowSplits
- }
- val result = new Array[Partition](rawSplits.size)
- for (i <- 0 until rawSplits.size) {
- result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
+ try {
--- End diff --
Put original logic in try-catch
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181718520
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -279,6 +293,10 @@ class HadoopRDD[K, V](
case e: IOException if ignoreCorruptFiles =>
logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
finished = true
+ case e: FileNotFoundException if ignoreMissingFiles =>
+ logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+ finished = true
+ null
--- End diff --
Same logic with FileScanRDD -- `finished=true`, thus `NextIterator.hasNext` returns false.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/NextIterator.scala#L31
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89424/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181294385
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
I'm not sure. `spark.sql.hive.verifyPartitionPath` is only for hive table scan and `` is only for `spark.sql.files.ignoreMissingFiles` is only for datasource scan, is it ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19868
Can we also update the title?
```
Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true
```
This is not true, we didn't fix the problem of `spark.sql.hive.verifyPartitionPath`, but add a new config to replace it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181787857
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -276,6 +292,12 @@ class HadoopRDD[K, V](
try {
finished = !reader.next(key, value)
} catch {
+ case e: FileNotFoundException if ignoreMissingFiles =>
+ logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+ finished = true
+ null
--- End diff --
the return value is not read by anyone
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r182075854
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
}
}
+ test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
+ withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
+ sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
+ val testData = sparkContext.parallelize(
+ (1 to 10).map(i => TestData(i, i.toString))).toDF()
+ testData.createOrReplaceTempView("testData")
+
+ val tmpDir = Files.createTempDir()
+ // create the table for test
+ sql(s"CREATE TABLE table_with_partition(key int,value string) " +
+ s"PARTITIONED by (ds string) location '${tmpDir.toURI}' ")
+ sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') " +
+ "SELECT key,value FROM testData")
+ sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='2') " +
+ "SELECT key,value FROM testData")
+ sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='3') " +
+ "SELECT key,value FROM testData")
+ sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='4') " +
+ "SELECT key,value FROM testData")
+
+ // test for the exist path
+ checkAnswer(sql("select key,value from table_with_partition"),
+ testData.toDF.collect ++ testData.toDF.collect
--- End diff --
`testData.union(testData).union(testData).union(testData)`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #84399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84399/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19868
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181788194
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -195,6 +205,10 @@ class NewHadoopRDD[K, V](
e)
finished = true
null
+ case e: FileNotFoundException if ignoreMissingFiles =>
--- End diff --
shall we also apply https://github.com/apache/spark/pull/19868/files#diff-83eb37f7b0ebed3c14ccb7bff0d577c2R273 here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
cc @cloud-fan @jerryshao @jiangxb1987 would you take a look at this?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181607746
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -197,17 +200,24 @@ class HadoopRDD[K, V](
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
- val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
- val inputSplits = if (ignoreEmptySplits) {
- allInputSplits.filter(_.getLength > 0)
- } else {
- allInputSplits
- }
- val array = new Array[Partition](inputSplits.size)
- for (i <- 0 until inputSplits.size) {
- array(i) = new HadoopPartition(id, i, inputSplits(i))
+ try {
+ val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
+ val inputSplits = if (ignoreEmptySplits) {
+ allInputSplits.filter(_.getLength > 0)
+ } else {
+ allInputSplits
+ }
+ val array = new Array[Partition](inputSplits.size)
+ for (i <- 0 until inputSplits.size) {
+ array(i) = new HadoopPartition(id, i, inputSplits(i))
+ }
+ array
+ } catch {
+ case e: InvalidInputException if ignoreMissingFiles =>
--- End diff --
when will this happen? the root path doesn't exist?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181272430
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
Actually do we still need this check? now we have something like `spark.sql.files.ignoreMissingFiles`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
Sure, let me do it today.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181607863
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -279,6 +293,10 @@ class HadoopRDD[K, V](
case e: IOException if ignoreCorruptFiles =>
logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
finished = true
+ case e: FileNotFoundException if ignoreMissingFiles =>
+ logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+ finished = true
+ null
--- End diff --
why returning null here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89424/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181571713
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -197,17 +200,24 @@ class HadoopRDD[K, V](
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
- val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
- val inputSplits = if (ignoreEmptySplits) {
- allInputSplits.filter(_.getLength > 0)
- } else {
- allInputSplits
- }
- val array = new Array[Partition](inputSplits.size)
- for (i <- 0 until inputSplits.size) {
- array(i) = new HadoopPartition(id, i, inputSplits(i))
+ try {
--- End diff --
Put original logic in try-catch
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19868
We switched to a totally different approach in the middle and forgot to update the PR description... @jinxing64 can you update it? thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181939704
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -195,6 +205,10 @@ class NewHadoopRDD[K, V](
e)
finished = true
null
+ case e: FileNotFoundException if ignoreMissingFiles =>
--- End diff --
Yes, I should change this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89397/testReport)** for PR 19868 at commit [`5893a2d`](https://github.com/apache/spark/commit/5893a2d992d6ad47e30465c0092579c006a605b2).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181939661
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -276,6 +292,12 @@ class HadoopRDD[K, V](
try {
finished = !reader.next(key, value)
} catch {
+ case e: FileNotFoundException if ignoreMissingFiles =>
+ logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+ finished = true
+ null
--- End diff --
I removed this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r180741999
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
how about `select * from test where A='00'`? Shall we check one more level?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2366/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
@cloud-fan
Thanks a lot for merging.
I will address the left comments today.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r180962121
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
If the partition columns are `A/B/C/D`, unless you specify a value for `A`, you have to check that level. The same case for `B`/`C`/`D`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181538066
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
Sure, that would be great. Is there some existing pr/jira working on that? if not, I can make the change :)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181371382
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
can we make `spark.sql.files.ignoreMissingFiles` work for hive scan? it seems a better solution than this check.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89370/testReport)** for PR 19868 at commit [`d8d4096`](https://github.com/apache/spark/commit/d8d40966e42e826a37a8f06ce46e55cbdbb529db).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181788258
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -276,6 +292,12 @@ class HadoopRDD[K, V](
try {
finished = !reader.next(key, value)
} catch {
+ case e: FileNotFoundException if ignoreMissingFiles =>
+ logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+ finished = true
+ null
--- End diff --
the return value is not read by anyone
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
Sure, updated.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2331/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Replace spark.sql.hive.verifyPartitionPath...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
Sure, updated.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19868
thanks, merging to master!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89372/testReport)** for PR 19868 at commit [`921cfc5`](https://github.com/apache/spark/commit/921cfc507236461ec11e0d94facf7447a08a662e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r180733361
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
@cloud-fan
Thanks a lot for review;
Like I mentioned in description, "/demo/data/year/month/day" will not be converted to "/demo/data/*/*/*/", instead it's converted to "/demo/data/year/month/*/" as a path pattern. When this path pattern is passed to `updateExistPathSetByPathPattern`, less matched paths will be returned.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84408/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #84408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84408/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #84408 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84408/testReport)** for PR 19868 at commit [`5767660`](https://github.com/apache/spark/commit/57676609faed4512291979a8d639e3be1ec80578).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89438 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89438/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89438 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89438/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89370/testReport)** for PR 19868 at commit [`d8d4096`](https://github.com/apache/spark/commit/d8d40966e42e826a37a8f06ce46e55cbdbb529db).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89372/testReport)** for PR 19868 at commit [`921cfc5`](https://github.com/apache/spark/commit/921cfc507236461ec11e0d94facf7447a08a662e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2345/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19868
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r182075379
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
}
}
+ test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
+ withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
+ sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
+ val testData = sparkContext.parallelize(
+ (1 to 10).map(i => TestData(i, i.toString))).toDF()
+ testData.createOrReplaceTempView("testData")
+
+ val tmpDir = Files.createTempDir()
+ // create the table for test
+ sql(s"CREATE TABLE table_with_partition(key int,value string) " +
--- End diff --
we should call `withTable`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
Jenkins, retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19868
@jinxing64 Let's not talk too much about the problem of `spark.sql.hive.verifyPartitionPath`. We should just introduce `spark.sql.files.ignoreMissingFiles` and say this can replace `spark.sql.hive.verifyPartitionPath`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
@cloud-fan @jiangxb1987
I updated and add a config `spark.files.ignoreMissingFiles`. It works for HadoopRDD and NewHadoopRDD in two cases:
1. "file not found" when `getPartitions`
2. "file not found" when `compute`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89372/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r182075129
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
}
}
+ test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
+ withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
+ sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
+ val testData = sparkContext.parallelize(
+ (1 to 10).map(i => TestData(i, i.toString))).toDF()
+ testData.createOrReplaceTempView("testData")
--- End diff --
we should call `withTempView`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/19868
@cloud-fan Thanks again for review;
I updated according to your comments and please take another look.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89397 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89397/testReport)** for PR 19868 at commit [`5893a2d`](https://github.com/apache/spark/commit/5893a2d992d6ad47e30465c0092579c006a605b2).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2379/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19868
nvm, the new test follows the existing code style in this file. Feel free to address them in followup PR.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89371 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89371/testReport)** for PR 19868 at commit [`921cfc5`](https://github.com/apache/spark/commit/921cfc507236461ec11e0d94facf7447a08a662e).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181607994
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl
}
}
+ test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") {
--- End diff --
we should add some document for `spark.sql.hive.verifyPartitionPath`, say it's replaced by `spark.files.ignoreMissingFiles` and will be removed in future releases.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r180713699
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
val matches = fs.globStatus(pathPattern)
matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString)
}
- // convert /demo/data/year/month/day to /demo/data/*/*/*/
+ // convert /demo/data/year/month/day to /demo/data/year/month/*/
--- End diff --
This is a pretty old logic. Can you explain what's going on here and why your change works? It can help other people to understand your change quickly,
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19868#discussion_r181787994
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -218,6 +232,10 @@ class NewHadoopRDD[K, V](
s"Skipped the rest content in the corrupted file: ${split.serializableHadoopSplit}",
e)
finished = true
+ case e: FileNotFoundException if ignoreMissingFiles =>
+ logWarning(s"Skipped missing file: ${split.serializableHadoopSplit}", e)
+ finished = true
+ null
--- End diff --
ditto
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19868
**[Test build #89371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89371/testReport)** for PR 19868 at commit [`921cfc5`](https://github.com/apache/spark/commit/921cfc507236461ec11e0d94facf7447a08a662e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89397/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19868
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org