You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gengliangwang <gi...@git.apache.org> on 2018/11/28 16:21:48 UTC

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/23165

    [SPARK-26188][SQL] FileIndex: don't infer data types of partition columns if user specifies schema

    ## What changes were proposed in this pull request?
    
    This PR is to fix a regression introduced in: https://github.com/apache/spark/pull/21004/files#r236998030
    
    If user specifies schema, Spark don't need to infer data type for of partition columns, otherwise the data type might not match with the one user provided.
    E.g. for partition directory `p=4d`, after data type inference  the column value will be `4.0`.
    See https://issues.apache.org/jira/browse/SPARK-26188 for more details.
    
    ## How was this patch tested?
    
    Add unit test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark fixFileIndex

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23165.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23165
    
----
commit 2866a9e1c1a7d42e6cf53474733c6f39e812c680
Author: Gengliang Wang <ge...@...>
Date:   2018-11-28T16:11:22Z

    fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99460/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    thanks, merging to master/2.4!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99458 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99458/testReport)** for PR 23165 at commit [`7c5bdd1`](https://github.com/apache/spark/commit/7c5bdd155bd7b3b4a4fa87a259feac229aaa16e3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237515959
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -250,7 +276,13 @@ object PartitioningUtils {
           val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
           assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'")
     
    -      val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone)
    +      val literal = if (userSpecifiedDataTypes.contains(columnName)) {
    +        // SPARK-26188: if user provides corresponding column schema, process the column as String
    +        //              type and cast it as user specified data type later.
    --- End diff --
    
    1. the function returns `Option[(String, Literal)]`
    2. the function `inferPartitionColumnValue` is quite complex, don't want change it or write duplicated logic.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237532982
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -147,13 +163,13 @@ object PartitioningUtils {
             columnNames.zip(literals).map { case (name, Literal(_, dataType)) =>
               // We always assume partition columns are nullable since we've no idea whether null values
               // will be appended in the future.
    -          StructField(name, dataType, nullable = true)
    +          StructField(name, userSpecifiedDataTypes.getOrElse(name, dataType), nullable = true)
             }
           }
     
           // Finally, we create `Partition`s based on paths and resolved partition values.
           val partitions = resolvedPartitionValues.zip(pathsWithPartitionValues).map {
    -        case (PartitionValues(_, literals), (path, _)) =>
    +        case (PartitionValues(columnNames, literals), (path, _)) =>
    --- End diff --
    
    unnecessary change?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99451/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99452 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99452/testReport)** for PR 23165 at commit [`0a20c9a`](https://github.com/apache/spark/commit/0a20c9a60a6264c6c122f16883cb682c605bcf88).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99460 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99460/testReport)** for PR 23165 at commit [`36c65e5`](https://github.com/apache/spark/commit/36c65e5a4d8b173b331aee2d5658e1b25e6ce795).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237510162
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -250,7 +276,13 @@ object PartitioningUtils {
           val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
           assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'")
     
    -      val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone)
    +      val literal = if (userSpecifiedDataTypes.contains(columnName)) {
    +        // SPARK-26188: if user provides corresponding column schema, process the column as String
    +        //              type and cast it as user specified data type later.
    --- End diff --
    
    can we do the cast here? It's a good practise to put related code together


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237512471
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -94,18 +94,34 @@ object PartitioningUtils {
           paths: Seq[Path],
           typeInference: Boolean,
           basePaths: Set[Path],
    +      userSpecifiedSchema: Option[StructType],
    +      caseSensitive: Boolean,
           timeZoneId: String): PartitionSpec = {
    -    parsePartitions(paths, typeInference, basePaths, DateTimeUtils.getTimeZone(timeZoneId))
    +    parsePartitions(paths, typeInference, basePaths, userSpecifiedSchema,
    +      caseSensitive, DateTimeUtils.getTimeZone(timeZoneId))
       }
     
       private[datasources] def parsePartitions(
           paths: Seq[Path],
           typeInference: Boolean,
           basePaths: Set[Path],
    +      userSpecifiedSchema: Option[StructType],
    +      caseSensitive: Boolean,
           timeZone: TimeZone): PartitionSpec = {
    +    val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) {
    +      val nameToDataType = userSpecifiedSchema.get.fields.map(f => f.name -> f.dataType).toMap
    +      if (caseSensitive) {
    +        CaseInsensitiveMap(nameToDataType)
    --- End diff --
    
    Yes, thanks for pointing it out :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5521/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237339038
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala ---
    @@ -126,13 +126,14 @@ abstract class PartitioningAwareFileIndex(
         val caseInsensitiveOptions = CaseInsensitiveMap(parameters)
         val timeZoneId = caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION)
           .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone)
    -    val inferredPartitionSpec = PartitioningUtils.parsePartitions(
    -      leafDirs,
    -      typeInference = sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
    -      basePaths = basePaths,
    -      timeZoneId = timeZoneId)
    +
         userSpecifiedSchema match {
           case Some(userProvidedSchema) if userProvidedSchema.nonEmpty =>
    +        val inferredPartitionSpec = PartitioningUtils.parsePartitions(
    +          leafDirs,
    +          typeInference = false,
    --- End diff --
    
    can you add some comment, so that we don't make the same mistake in the future?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99458/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    @cloud-fan @mgaido91 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99385/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99452/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237513657
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -94,18 +94,34 @@ object PartitioningUtils {
           paths: Seq[Path],
           typeInference: Boolean,
           basePaths: Set[Path],
    +      userSpecifiedSchema: Option[StructType],
    +      caseSensitive: Boolean,
           timeZoneId: String): PartitionSpec = {
    -    parsePartitions(paths, typeInference, basePaths, DateTimeUtils.getTimeZone(timeZoneId))
    +    parsePartitions(paths, typeInference, basePaths, userSpecifiedSchema,
    +      caseSensitive, DateTimeUtils.getTimeZone(timeZoneId))
       }
     
       private[datasources] def parsePartitions(
           paths: Seq[Path],
           typeInference: Boolean,
           basePaths: Set[Path],
    +      userSpecifiedSchema: Option[StructType],
    +      caseSensitive: Boolean,
           timeZone: TimeZone): PartitionSpec = {
    +    val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) {
    --- End diff --
    
    Personally I prefer to make the parameter simple and easy to understand. So that the logic of caller(outside the `PartitioningUtils`) looks cleaner.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    @mgaido91 The user specified schema might not match the full data schema. For the missing columns, we still need to infer their data types.
    I will come up with a solution soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237516228
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -250,7 +276,13 @@ object PartitioningUtils {
           val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
           assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'")
     
    -      val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone)
    +      val literal = if (userSpecifiedDataTypes.contains(columnName)) {
    +        // SPARK-26188: if user provides corresponding column schema, process the column as String
    +        //              type and cast it as user specified data type later.
    +        inferPartitionColumnValue(rawColumnValue, false, timeZone)
    --- End diff --
    
    See my reasons above.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99451/testReport)** for PR 23165 at commit [`f25730e`](https://github.com/apache/spark/commit/f25730e6d73391a78c269a912a3cef49e94ed7d5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/23165


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237510705
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -94,18 +94,34 @@ object PartitioningUtils {
           paths: Seq[Path],
           typeInference: Boolean,
           basePaths: Set[Path],
    +      userSpecifiedSchema: Option[StructType],
    +      caseSensitive: Boolean,
           timeZoneId: String): PartitionSpec = {
    -    parsePartitions(paths, typeInference, basePaths, DateTimeUtils.getTimeZone(timeZoneId))
    +    parsePartitions(paths, typeInference, basePaths, userSpecifiedSchema,
    +      caseSensitive, DateTimeUtils.getTimeZone(timeZoneId))
       }
     
       private[datasources] def parsePartitions(
           paths: Seq[Path],
           typeInference: Boolean,
           basePaths: Set[Path],
    +      userSpecifiedSchema: Option[StructType],
    +      caseSensitive: Boolean,
           timeZone: TimeZone): PartitionSpec = {
    +    val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) {
    --- End diff --
    
    can we build this at the caller side out of `PartitioningUtils`? Then we only need one extra parameter.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5522/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99451/testReport)** for PR 23165 at commit [`f25730e`](https://github.com/apache/spark/commit/f25730e6d73391a78c269a912a3cef49e94ed7d5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237510736
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -250,7 +276,13 @@ object PartitioningUtils {
           val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
           assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'")
     
    -      val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone)
    +      val literal = if (userSpecifiedDataTypes.contains(columnName)) {
    +        // SPARK-26188: if user provides corresponding column schema, process the column as String
    +        //              type and cast it as user specified data type later.
    +        inferPartitionColumnValue(rawColumnValue, false, timeZone)
    --- End diff --
    
    can't we add the cast here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237518419
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -250,7 +276,13 @@ object PartitioningUtils {
           val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
           assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'")
     
    -      val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone)
    +      val literal = if (userSpecifiedDataTypes.contains(columnName)) {
    +        // SPARK-26188: if user provides corresponding column schema, process the column as String
    +        //              type and cast it as user specified data type later.
    +        inferPartitionColumnValue(rawColumnValue, false, timeZone)
    --- End diff --
    
    can't we make it returning `Option[(String, Literal)]`? If not, what about `Literal(Cast(...).eval())`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237508120
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala ---
    @@ -126,33 +126,15 @@ abstract class PartitioningAwareFileIndex(
         val caseInsensitiveOptions = CaseInsensitiveMap(parameters)
         val timeZoneId = caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION)
           .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone)
    -    val inferredPartitionSpec = PartitioningUtils.parsePartitions(
    +
    +    val caseSensitive = sparkSession.sqlContext.conf.caseSensitiveAnalysis
    +    PartitioningUtils.parsePartitions(
           leafDirs,
           typeInference = sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
           basePaths = basePaths,
    +      userSpecifiedSchema = userSpecifiedSchema,
    +      caseSensitive = caseSensitive,
           timeZoneId = timeZoneId)
    -    userSpecifiedSchema match {
    -      case Some(userProvidedSchema) if userProvidedSchema.nonEmpty =>
    -        val userPartitionSchema =
    -          combineInferredAndUserSpecifiedPartitionSchema(inferredPartitionSpec)
    --- End diff --
    
    we can remove `combineInferredAndUserSpecifiedPartitionSchema` now


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5460/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99385/testReport)** for PR 23165 at commit [`2866a9e`](https://github.com/apache/spark/commit/2866a9e1c1a7d42e6cf53474733c6f39e812c680).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99452 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99452/testReport)** for PR 23165 at commit [`0a20c9a`](https://github.com/apache/spark/commit/0a20c9a60a6264c6c122f16883cb682c605bcf88).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237339152
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ---
    @@ -49,6 +50,21 @@ class FileIndexSuite extends SharedSQLContext {
         }
       }
     
    +  test("SPARK-26188: don't infer data types of partition columns if user specifies schema") {
    +    withTempDir { dir =>
    +      val partitionDirectory = new File(dir, s"a=4d")
    +      partitionDirectory.mkdir()
    +      val file = new File(partitionDirectory, "text.txt")
    +      stringToFile(file, "text")
    +      val path = new Path(dir.getCanonicalPath)
    +      val schema = StructType(Seq(StructField("a", StringType, false)))
    +      val catalog = new InMemoryFileIndex(spark, Seq(path), Map.empty, Some(schema))
    --- End diff --
    
    `catalog` ->`fileIndex`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5528/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99385/testReport)** for PR 23165 at commit [`2866a9e`](https://github.com/apache/spark/commit/2866a9e1c1a7d42e6cf53474733c6f39e812c680).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99458 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99458/testReport)** for PR 23165 at commit [`7c5bdd1`](https://github.com/apache/spark/commit/7c5bdd155bd7b3b4a4fa87a259feac229aaa16e3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    **[Test build #99460 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99460/testReport)** for PR 23165 at commit [`36c65e5`](https://github.com/apache/spark/commit/36c65e5a4d8b173b331aee2d5658e1b25e6ce795).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23165
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237523177
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -250,7 +276,13 @@ object PartitioningUtils {
           val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
           assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'")
     
    -      val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone)
    +      val literal = if (userSpecifiedDataTypes.contains(columnName)) {
    +        // SPARK-26188: if user provides corresponding column schema, process the column as String
    +        //              type and cast it as user specified data type later.
    +        inferPartitionColumnValue(rawColumnValue, false, timeZone)
    --- End diff --
    
    Thanks. I will use the Cast one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23165#discussion_r237510234
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala ---
    @@ -94,18 +94,34 @@ object PartitioningUtils {
           paths: Seq[Path],
           typeInference: Boolean,
           basePaths: Set[Path],
    +      userSpecifiedSchema: Option[StructType],
    +      caseSensitive: Boolean,
           timeZoneId: String): PartitionSpec = {
    -    parsePartitions(paths, typeInference, basePaths, DateTimeUtils.getTimeZone(timeZoneId))
    +    parsePartitions(paths, typeInference, basePaths, userSpecifiedSchema,
    +      caseSensitive, DateTimeUtils.getTimeZone(timeZoneId))
       }
     
       private[datasources] def parsePartitions(
           paths: Seq[Path],
           typeInference: Boolean,
           basePaths: Set[Path],
    +      userSpecifiedSchema: Option[StructType],
    +      caseSensitive: Boolean,
           timeZone: TimeZone): PartitionSpec = {
    +    val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) {
    +      val nameToDataType = userSpecifiedSchema.get.fields.map(f => f.name -> f.dataType).toMap
    +      if (caseSensitive) {
    +        CaseInsensitiveMap(nameToDataType)
    --- End diff --
    
    isn't this if `!caseSensitive`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org