You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/01 19:49:38 UTC

[GitHub] [spark] MaxGekk opened a new pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

MaxGekk opened a new pull request #31423:
URL: https://github.com/apache/spark/pull/31423


   ### What changes were proposed in this pull request?
   Create new file index after partition schema inferring, and use the inferred partition schema in it.
   
   ### Why are the changes needed?
   The changes fix the bug demonstrated by the example:
   1. There are partitioned parquet files (file format doesn't matter):
   ```
   /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d
   ├── _SUCCESS
   ├── part=-0
   │   └── part-00001-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
   └── part=AA
       └── part-00000-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
   ```
   placed to two partitions "AA" and **"-0"**.
   
   2. When reading them w/o specified schema:
   ```
   val df = spark.read.parquet(path)
   df.printSchema()
   root
    |-- id: integer (nullable = true)
    |-- part: string (nullable = true)
   ```
   the inferred type of the partition column `part` is the **string** type.
   3. The expected values in the column `part` are "AA" and "-0" but we get:
   ```
   df.show(false)
   +---+----+
   |id |part|
   +---+----+
   |0  |AA  |
   |1  |0   |
   +---+----+
   ```
   So, Spark returns **"0"** instead of **"-0"**.
   
   ### Does this PR introduce _any_ user-facing change?
   This PR can change query results.
   
   ### How was this patch tested?
   By running new test and existing test suites:
   ```
   $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *FileIndexSuite"
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31423:
URL: https://github.com/apache/spark/pull/31423#discussion_r568362026



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
##########
@@ -370,7 +370,7 @@ class PartitionedTablePerfStatsSuite
           assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 0)
 
           // reads and caches all the files initially
-          assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 5)
+          assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 10)

Review comment:
       The second partition discovery should re-use file statuses from the cache if the cache is NoOp cache




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #31423:
URL: https://github.com/apache/spark/pull/31423


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31423:
URL: https://github.com/apache/spark/pull/31423#discussion_r568362026



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
##########
@@ -370,7 +370,7 @@ class PartitionedTablePerfStatsSuite
           assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 0)
 
           // reads and caches all the files initially
-          assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 5)
+          assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 10)

Review comment:
       The second partition discovery should re-use file statuses from the cache if the cache is not the NoOp cache




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-777398209


   > ... shall we collect the original string and the inferred type, then find the common type and cast the original string to the common type directly?
   
   Here is the PR https://github.com/apache/spark/pull/31549


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31423:
URL: https://github.com/apache/spark/pull/31423#discussion_r568362848



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -413,9 +413,13 @@ case class DataSource(
         } else {
           val globbedPaths = checkAndGlobPathIfNecessary(
             checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
-          val index = createInMemoryFileIndex(globbedPaths)
+          val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
+          val indexInSchemaInferring = new InMemoryFileIndex(
+            sparkSession, globbedPaths, options, userSpecifiedSchema, fileStatusCache)
           val (resultDataSchema, resultPartitionSchema) =
-            getOrInferFileFormatSchema(format, () => index)
+            getOrInferFileFormatSchema(format, () => indexInSchemaInferring)
+          val index = new InMemoryFileIndex(
+            sparkSession, globbedPaths, options, Some(resultPartitionSchema), fileStatusCache)

Review comment:
       New index re-uses the file statuses cache `fileStatusCache`, so, this should allow to avoid additional accesses to the file  system.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772496292


   **[Test build #134824 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134824/testReport)** for PR 31423 at commit [`eeda34a`](https://github.com/apache/spark/commit/eeda34a4dca66c9ae747c0f5fa1b963f877f6c87).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771483498


   @MaxGekk can you provide more code-wise details about how this bug is triggered?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771146144


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39330/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771183445


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39330/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771239522


   **[Test build #134744 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134744/testReport)** for PR 31423 at commit [`66ee353`](https://github.com/apache/spark/commit/66ee3533f18011158cfc4e86c449d05a504810ef).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771115304


   **[Test build #134744 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134744/testReport)** for PR 31423 at commit [`66ee353`](https://github.com/apache/spark/commit/66ee3533f18011158cfc4e86c449d05a504810ef).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771156648


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39330/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31423:
URL: https://github.com/apache/spark/pull/31423#discussion_r568432026



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -413,9 +413,13 @@ case class DataSource(
         } else {
           val globbedPaths = checkAndGlobPathIfNecessary(
             checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
-          val index = createInMemoryFileIndex(globbedPaths)
+          val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
+          val indexInSchemaInferring = new InMemoryFileIndex(

Review comment:
       `indexForSchemaInference`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772383176


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39411/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31423:
URL: https://github.com/apache/spark/pull/31423#discussion_r568432026



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -413,9 +413,13 @@ case class DataSource(
         } else {
           val globbedPaths = checkAndGlobPathIfNecessary(
             checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
-          val index = createInMemoryFileIndex(globbedPaths)
+          val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
+          val indexInSchemaInferring = new InMemoryFileIndex(

Review comment:
       `indexForSchemaInference`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771242164






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771336470


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134748/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31423:
URL: https://github.com/apache/spark/pull/31423#discussion_r568362026



##########
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
##########
@@ -370,7 +370,7 @@ class PartitionedTablePerfStatsSuite
           assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 0)
 
           // reads and caches all the files initially
-          assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 5)
+          assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 10)

Review comment:
       The second partition discovery should re-use file statuses from the cache if the cache is NoOp cache

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -413,9 +413,13 @@ case class DataSource(
         } else {
           val globbedPaths = checkAndGlobPathIfNecessary(
             checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
-          val index = createInMemoryFileIndex(globbedPaths)
+          val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
+          val indexInSchemaInferring = new InMemoryFileIndex(
+            sparkSession, globbedPaths, options, userSpecifiedSchema, fileStatusCache)
           val (resultDataSchema, resultPartitionSchema) =
-            getOrInferFileFormatSchema(format, () => index)
+            getOrInferFileFormatSchema(format, () => indexInSchemaInferring)
+          val index = new InMemoryFileIndex(
+            sparkSession, globbedPaths, options, Some(resultPartitionSchema), fileStatusCache)

Review comment:
       New index re-uses the file statuses cache `fileStatusCache`, so, this should allow to avoid additional accesses to the file  system.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771115304






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772381589


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39411/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771215978


   **[Test build #134748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134748/testReport)** for PR 31423 at commit [`36a4a7f`](https://github.com/apache/spark/commit/36a4a7fc4d39b6efaeed7f81ba55d108cf80023c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772346012


   **[Test build #134824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134824/testReport)** for PR 31423 at commit [`eeda34a`](https://github.com/apache/spark/commit/eeda34a4dca66c9ae747c0f5fa1b963f877f6c87).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771242164


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134744/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31423:
URL: https://github.com/apache/spark/pull/31423#discussion_r569201445



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -413,9 +413,13 @@ case class DataSource(
         } else {
           val globbedPaths = checkAndGlobPathIfNecessary(
             checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
-          val index = createInMemoryFileIndex(globbedPaths)
+          val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
+          val indexInSchemaInferring = new InMemoryFileIndex(
+            sparkSession, globbedPaths, options, userSpecifiedSchema, fileStatusCache)

Review comment:
       What's diff from calling `createInMemoryFileIndex` here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772323962


   Looks good to me


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31423:
URL: https://github.com/apache/spark/pull/31423#discussion_r569209389



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -413,9 +413,13 @@ case class DataSource(
         } else {
           val globbedPaths = checkAndGlobPathIfNecessary(
             checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
-          val index = createInMemoryFileIndex(globbedPaths)
+          val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
+          val indexInSchemaInferring = new InMemoryFileIndex(
+            sparkSession, globbedPaths, options, userSpecifiedSchema, fileStatusCache)

Review comment:
       `fileStatusCache`. I want to re-use it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772383176


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39411/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771388471


   @cloud-fan @HyukjinKwon Please, have a look at the bug fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772346012


   **[Test build #134824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134824/testReport)** for PR 31423 at commit [`eeda34a`](https://github.com/apache/spark/commit/eeda34a4dca66c9ae747c0f5fa1b963f877f6c87).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771320577


   **[Test build #134748 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134748/testReport)** for PR 31423 at commit [`36a4a7f`](https://github.com/apache/spark/commit/36a4a7fc4d39b6efaeed7f81ba55d108cf80023c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772378495


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39411/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771483498


   @MaxGekk can you provide more code-wise details about how this bug is triggered?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771183445






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771388471


   @cloud-fan @HyukjinKwon Please, have a look at the bug fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772405194


   > can you provide more code-wise details about how this bug is triggered?
   
   @cloud-fan We have in the file system:
   ```
   $ tree /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-f57b515d-7449-4ace-98e6-becde9490095
   /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-f57b515d-7449-4ace-98e6-becde9490095
   ├── _SUCCESS
   ├── part=-0
   │   └── part-00001-6612e1e6-c81a-4763-81d8-ea9b446a525e.c000.snappy.parquet
   └── part=AA
       └── part-00000-6612e1e6-c81a-4763-81d8-ea9b446a525e.c000.snappy.parquet
   ```
   
   `getOrInferFileFormatSchema(format, () => index)` -> `tempFileIndex.partitionSchema` -> `cachedPartitionSpec = inferPartitioning()`:
   https://github.com/apache/spark/blob/467d7589737d2430d09f1ffbd33bf801d179f990/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L73
   -> parsePartitions():
   https://github.com/apache/spark/blob/a093d6feefb0e086d19c86ae53bf92df12ccf2fa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L138-L142
   
   `parsePartition()` parses every partition individually and finds the "best" type for every partition value:
   - `part=AA` -> `parsePartitionColumn` -> `inferPartitionColumnValue` returns: `Literal.create("AA", StringType)`
   - `part=-0` -> `parsePartitionColumn` -> `inferPartitionColumnValue` returns: `Literal.create(0, IntegerType)`
   
   Inferring the common type:
   ```
   val resolvedPartitionValues =
           resolvePartitions(pathsWithPartitionValues, caseSensitive, zoneId)
   ```
   After inferring the string type, and casting the partition value literals, we get:
   https://github.com/apache/spark/blob/a093d6feefb0e086d19c86ae53bf92df12ccf2fa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L196
   ```
   PartitionSpec(
       partitionColumns = StructType(StructField(StringType, ...)),
       partitions = Seq(
           UTF8String("AA") -> path.`part=AA`,
           UTF8String("0") -> path.`part=-0`
       )
   )
   ```
   Then the partitions are cached, and used in scans.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771266963


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39334/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772529620


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134824/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771266963


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39334/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772529620


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134824/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771215978


   **[Test build #134748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134748/testReport)** for PR 31423 at commit [`36a4a7f`](https://github.com/apache/spark/commit/36a4a7fc4d39b6efaeed7f81ba55d108cf80023c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771239522






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31423: [SPARK-34314][SQL] Create new file index after partition schema inferring w/ the schema

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-772638569


   @MaxGekk shall we fix the merging of partition inference results? Instead of collecting a bunch of literals, find the common type, and cast the literals, shall we collect the original string and the inferred type, then find the common type and cast the original string to the common type directly?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771336470


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134748/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771115304


   **[Test build #134744 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134744/testReport)** for PR 31423 at commit [`66ee353`](https://github.com/apache/spark/commit/66ee3533f18011158cfc4e86c449d05a504810ef).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31423: [SPARK-34314][SQL] Rebuild the file index after partition schema inferring

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31423:
URL: https://github.com/apache/spark/pull/31423#issuecomment-771183445






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org