You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2018/01/09 18:35:45 UTC

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/20208

    [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based data sources

    ## What changes were proposed in this pull request?
    
    A schema can evolve in several ways and the followings are already supported in file-based data sources.
    
       1. Add a column
       2. Remove a column
       3. Change a column position
       4. Change a column type
    
    This issue aims to guarantee users a backward-compatible schema evolution coverage on file-based data sources and to prevent future regressions by *adding schema evolution test suites explicitly*.
    
    Here, we consider safe evolution without data loss. For example, data type evolution should be from small types to larger types like `int`-to-`long`, not vice versa.
    
    As of today, in the master branch, file-based data sources have schema evolution coverages like the followings.
    
    File Format | Coverage  | Note
    ----------- | ---------- | ------------------------------------------------
    TEXT          | N/A            | Schema consists of a single string column.
    CSV            | 1, 2, 4        |
    JSON          | 1, 2, 3, 4    |
    ORC            | 1, 2, 3, 4    | Native vectorized ORC reader has the widest coverage.
    PARQUET   | 1, 2, 3        |
    
    
    ## How was this patch tested?
    
    Pass the jenkins with newly added test suites.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-SCHEMA-EVOLUTION

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20208
    
----
commit 499801e7fdd545ac5918dd5f7a9294db2d5373be
Author: Dongjoon Hyun <do...@...>
Date:   2018-01-07T00:02:09Z

    [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based data sources

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add read schema suite fo...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r201499289
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaSuite.scala ---
    @@ -0,0 +1,181 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import org.apache.spark.sql.internal.SQLConf
    +
    +/**
    + * Read schema suites have the following hierarchy and aims to guarantee users
    + * a backward-compatible read-schema change coverage on file-based data sources, and
    + * to prevent future regressions.
    + *
    + *   ReadSchemaSuite
    + *     -> CSVReadSchemaSuite
    + *     -> HeaderCSVReadSchemaSuite
    + *
    + *     -> JsonReadSchemaSuite
    + *
    + *     -> OrcReadSchemaSuite
    + *     -> VectorizedOrcReadSchemaSuite
    + *
    + *     -> ParquetReadSchemaSuite
    + *     -> VectorizedParquetReadSchemaSuite
    + *     -> MergedParquetReadSchemaSuite
    + */
    +
    +/**
    + * All file-based data sources supports column addition and removal at the end.
    + */
    +abstract class ReadSchemaSuite
    --- End diff --
    
    @gatorsmile . Now, it becomes `ReadSchemaSuite`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Yes


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Finally, Spark 2.3 passes the vote. Could you review this, @gatorsmile ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r162837578
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,406 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    --- End diff --
    
    Ohaaa, the schema is explicitly set here. Sorry, I missed it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92744 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92744/testReport)** for PR 20208 at commit [`ebd239e`](https://github.com/apache/spark/commit/ebd239eab0aa2b03b211cd470eb33d5a538f594a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1285/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92731/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86599/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86144 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86144/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    @gatorsmile . Please let me know if I need to do more.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    We are working on the Spark 2.3 release. Could you ping us after the release? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87752/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r162835707
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,406 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    --- End diff --
    
    Correct, and this is not about schema merging.
    The final correct schema is given by users (or Hive).
    In this PR, all schema is given by users, but for Hive table, we uses the Hive Metastore Schema.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #85983 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85983/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86259/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86725/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92827/testReport)** for PR 20208 at commit [`767d7ba`](https://github.com/apache/spark/commit/767d7ba9b5d5cfd659461ae8cf1e735aa0551f18).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait ReadSchemaTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnTest extends ReadSchemaTest `
      * `trait HideColumnAtTheEndTest extends ReadSchemaTest `
      * `trait HideColumnInTheMiddleTest extends ReadSchemaTest `
      * `trait ChangePositionTest extends ReadSchemaTest `
      * `trait BooleanTypeTest extends ReadSchemaTest `
      * `trait ToStringTypeTest extends ReadSchemaTest `
      * `trait IntegralTypeTest extends ReadSchemaTest `
      * `trait ToDoubleTypeTest extends ReadSchemaTest `
      * `trait ToDecimalTypeTest extends ReadSchemaTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Hi, @rxin , @cloud-fan , @sameeragarwal , @HyukjinKwon .
    
    Could you give me some opinions about this PR? I know that Xiao Li is busy for this period, so I didn't ping hime. For me, this PR is important. Sorry for annoying you guys.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87969 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87969/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Let us write the document in this PR? Document our current behaviors with the related test cases. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86259 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86259/testReport)** for PR 20208 at commit [`22eb772`](https://github.com/apache/spark/commit/22eb7726828ea8cf6e3b4206a6bed3cf93f77fd6).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Will do it after 2.3 release


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #88548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88548/testReport)** for PR 20208 at commit [`6085986`](https://github.com/apache/spark/commit/6085986a3d0c5b00c281b2543f3bfe6ed4e1813c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    I'll update like the followings.
    - Remove `Remove a column` part from the description parts (docs/testsuite file doc) while keeping the test cases.
    - Add a clear description about `partition columns` position rules.
    - Mention `upcast` for `Change a column type` part.
    
    For `docs/sql-programming-guide.md`, I'll keep during review period.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86281/testReport)** for PR 20208 at commit [`22eb772`](https://github.com/apache/spark/commit/22eb7726828ea8cf6e3b4206a6bed3cf93f77fd6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Thank you for review, @gatorsmile .
    Is there any concern to shake 2.3 release? This is a *test case* only PR to build a clear consensus since Apache Spark 2.3.0. I think it's safe to be part of 2.3.0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87906/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/301/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Thanks! Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86144/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3890/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90919/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92828/testReport)** for PR 20208 at commit [`a7064ac`](https://github.com/apache/spark/commit/a7064ac0bc7b56ffffa3e322f31bda8a45bd9517).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1129/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    The test suite is designed like the following according to their features.
    ```scala
    class CSVReadSchemaSuite
      extends ReadSchemaSuite
      with IntegralTypeTest
      with ToDoubleTypeTest
      with ToDecimalTypeTest
      with ToStringTypeTest {
    
      override val format: String = "csv"
    }
    ```
    To add a negative test case, we need to do something like `with NoBooleanTypeTest`. How do you think about that?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176933506
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    +
    +Users can control schema evolution in several ways. For example, new file can have additional
    +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text`
    +data source supports this. Note that `text` data source always has a fixed single string column
    +schema.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +val df1 = Seq("a", "b").toDF("col1")
    +val df2 = df1.withColumn("col2", lit("x"))
    +
    +df1.write.save("/tmp/evolved_data/part=1")
    +df2.write.save("/tmp/evolved_data/part=2")
    +
    +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show
    ++----+----+----+
    +|col1|col2|part|
    ++----+----+----+
    +|   a|   x|   2|
    +|   b|   x|   2|
    +|   a|null|   1|
    +|   b|null|   1|
    ++----+----+----+
    +</div>
    +
    +</div>
    +
    +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based
    +data sources.
    +
    +  1. Add a column
    +  2. Remove a column
    --- End diff --
    
    Right. The test case doen't aim to cover those cases so far.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86413 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86413/testReport)** for PR 20208 at commit [`e1d6f2a`](https://github.com/apache/spark/commit/e1d6f2a5ba0cae28b0ce4ed3612429a593828c0f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    @gatorsmile . Yes. It's officially supported Hive like the following compatible cases which means upgrading wider types. As you see in my PR, Apache Spark 2.3 also supports, but the degree of support depends on data sources.
    
    ```sql
    hive> CREATE TABLE t1(a int) STORED AS ORC;
    
    hive> INSERT INTO t1 VALUES(1);
    
    hive> ALTER TABLE t1 CHANGE a a bigint;
    
    hive> SELECT * FROM t1;
    OK
    1
    Time taken: 0.137 seconds, Fetched: 1 row(s)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86599 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86599/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86599 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86599/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Sorry for the delay. I updated the PR according to the comments, @gatorsmile .
    Could you review this once more?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3429/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/57/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #90919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90919/testReport)** for PR 20208 at commit [`e136bc3`](https://github.com/apache/spark/commit/e136bc3ba7c2cbdf7b9d19f34cdfc2a1d0204942).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86144 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86144/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85865/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87292 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87292/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86603 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86603/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90920/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    cc @sameeragarwal for reviewing too. I vaguely remember we had a talk about this before. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Rebased to the master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1729/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87292/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87779 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87779/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3430/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92939 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92939/testReport)** for PR 20208 at commit [`a7064ac`](https://github.com/apache/spark/commit/a7064ac0bc7b56ffffa3e322f31bda8a45bd9517).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176931126
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    --- End diff --
    
    Based on the current behavior, we do not support schema evolution. Schema evolution is a well-defined term. It sounds like this PR is try to test the behaviors when users provide the schema that does not exactly match the physical schema. This is different from the definition of schema evolution.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92744/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #90920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90920/testReport)** for PR 20208 at commit [`ea9047a`](https://github.com/apache/spark/commit/ea9047a2f693708f1ea92d854ef4003eea572cf7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToStringTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r175579537
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,406 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    + *
    + * This aims to provide an explicit test coverage for schema evolution on file-based data sources.
    + * Since a file format has its own coverage of schema evolution, we need a test suite
    + * for each file-based data source with corresponding supported test case traits.
    + *
    + * The following is a hierarchy of test traits.
    + *
    + *   SchemaEvolutionTest
    + *     -> AddColumnEvolutionTest
    + *     -> RemoveColumnEvolutionTest
    + *     -> ChangePositionEvolutionTest
    + *     -> BooleanTypeEvolutionTest
    + *     -> IntegralTypeEvolutionTest
    + *     -> ToDoubleTypeEvolutionTest
    + *     -> ToDecimalTypeEvolutionTest
    + */
    +
    +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext {
    +  val format: String
    +  val options: Map[String, String] = Map.empty[String, String]
    +}
    +
    +/**
    + * Add column (Case 1).
    + * This test suite assumes that the missing column should be `null`.
    + */
    +trait AddColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("append column at the end") {
    +    withTempPath { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq("a", "b").toDF("col1")
    +      val df2 = df1.withColumn("col2", lit("x"))
    +      val df3 = df2.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +      val dir3 = s"$path${File.separator}part=three"
    +
    +      df1.write.format(format).options(options).save(dir1)
    +      df2.write.format(format).options(options).save(dir2)
    +      df3.write.format(format).options(options).save(dir3)
    +
    +      val df = spark.read
    +        .schema(df3.schema)
    --- End diff --
    
    @gatorsmile . Please see this. This is not about **schema inferencing**.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add read schema suite fo...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r201513138
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala ---
    @@ -0,0 +1,493 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * The reader schema is said to be evolved (or projected) when it changed after the data is
    + * written by writers. The followings are supported in file-based data sources.
    + * Note that partition columns are not maintained in files. Here, `column` means non-partition
    + * column.
    + *
    + *   1. Add a column
    + *   2. Hide a column
    + *   3. Change a column position
    + *   4. Change a column type (Upcast)
    + *
    + * Here, we consider safe changes without data loss. For example, data type changes should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have the following coverages.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    --- End diff --
    
    Yes. Right. Since the main purpose of this PR is preventing regressions, it consists of positive-only. The errors are case-by-case for each data sources.
    
    For `BooleanTypeTest` example, Parquet raises higher exceptions due to `ClassCastException` (at the bottom). JSON raises `Results do not match` test case failures without exceptions.
    
    - Parquet
    ```
    org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details: 
    ...
    Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/private/var/folders/dc/1pz9m69x14q_gw8t7m143t1c0000gn/T/spark-4b3d788b-1d7e-4ca2-9c01-88f639daf02f/part-00000-975391e5-1f1d-49f5-8e12-3213281618ed-c000.snappy.parquet
    ...
    Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableByte cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableBoolean
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86099 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86099/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Before adding the test cases, schema evolution is officially supported? Could you describe it in details? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176931027
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    +
    +Users can control schema evolution in several ways. For example, new file can have additional
    +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text`
    +data source supports this. Note that `text` data source always has a fixed single string column
    +schema.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +val df1 = Seq("a", "b").toDF("col1")
    +val df2 = df1.withColumn("col2", lit("x"))
    +
    +df1.write.save("/tmp/evolved_data/part=1")
    +df2.write.save("/tmp/evolved_data/part=2")
    +
    +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show
    ++----+----+----+
    +|col1|col2|part|
    ++----+----+----+
    +|   a|   x|   2|
    +|   b|   x|   2|
    +|   a|null|   1|
    +|   b|null|   1|
    ++----+----+----+
    +</div>
    +
    +</div>
    +
    +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based
    +data sources.
    +
    +  1. Add a column
    +  2. Remove a column
    +  3. Change a column position
    +  4. Change a column type (`byte` -> `short` -> `int` -> `long`, `float` -> `double`)
    --- End diff --
    
    These are just upcast.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86414/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92939/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Of course! Those command are an illustration of use case of schema evolution in Hive-side in general.
    
    This PR aims to provide *schema evolution* test coverage in Spark-side. As you see, this is for only `sql/core`, not for `sql/hive` module at all.
    
    Apache Spark 2.3 already supports the same schema evolution in terms of **userGivenSchema**. (In Hive case, it's HMS schema instead of file schema.)
    ```scala
    val df = spark.read
            .schema(userGivenSchema)
            .format(format)
            .options(options)
            .load(path)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86725/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92939/testReport)** for PR 20208 at commit [`a7064ac`](https://github.com/apache/spark/commit/a7064ac0bc7b56ffffa3e322f31bda8a45bd9517).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    @dongjoon-hyun This PR is to improve the test coverage. LGTM. 
    
    When the schema do not match with the schemas of underlying data source, the current error messages might be weird. This is a common issue, I think. Please submit a separate PR to improve the error handling in these cases? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/761/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176930962
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    +
    +Users can control schema evolution in several ways. For example, new file can have additional
    +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text`
    +data source supports this. Note that `text` data source always has a fixed single string column
    +schema.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +val df1 = Seq("a", "b").toDF("col1")
    +val df2 = df1.withColumn("col2", lit("x"))
    +
    +df1.write.save("/tmp/evolved_data/part=1")
    +df2.write.save("/tmp/evolved_data/part=2")
    +
    +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show
    ++----+----+----+
    +|col1|col2|part|
    ++----+----+----+
    +|   a|   x|   2|
    +|   b|   x|   2|
    +|   a|null|   1|
    +|   b|null|   1|
    ++----+----+----+
    +</div>
    +
    +</div>
    +
    +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based
    +data sources.
    +
    +  1. Add a column
    +  2. Remove a column
    --- End diff --
    
    In SQL standard, when we remove a column, all the data are removed. However, we do not support it. Users could still see the data after they add the column with the same name like what they removed previously. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92828/testReport)** for PR 20208 at commit [`a7064ac`](https://github.com/apache/spark/commit/a7064ac0bc7b56ffffa3e322f31bda8a45bd9517).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/829/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Hi, @gatorsmile , @HyukjinKwon , @cloud-fan .
    Since 2.3 is officially announced, I'm pinging you guys again. :)
    Please let me know if there is something for me to do here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86413/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86725/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    After merging this PR, I'm going to make a PR on SQL part for Apache Spark 2.4.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86413/testReport)** for PR 20208 at commit [`e1d6f2a`](https://github.com/apache/spark/commit/e1d6f2a5ba0cae28b0ce4ed3612429a593828c0f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r162786270
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,436 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    + *
    + * This aims to provide an explicit test coverage for schema evolution on file-based data sources.
    + * Since a file format has its own coverage of schema evolution, we need a test suite
    + * for each file-based data source with corresponding supported test case traits.
    + *
    + * The following is a hierarchy of test traits.
    + *
    + *   SchemaEvolutionTest
    + *     -> AddColumnEvolutionTest
    + *     -> RemoveColumnEvolutionTest
    + *     -> ChangePositionEvolutionTest
    + *     -> BooleanTypeEvolutionTest
    + *     -> IntegralTypeEvolutionTest
    + *     -> ToDoubleTypeEvolutionTest
    + *     -> ToDecimalTypeEvolutionTest
    + */
    +
    +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext {
    +  val format: String
    +  val options: Map[String, String] = Map.empty[String, String]
    +}
    +
    +/**
    + * Add column.
    + * This test suite assumes that the missing column should be `null`.
    + */
    +trait AddColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("append column at the end") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq("a", "b").toDF("col1")
    +      val df2 = df1.withColumn("col2", lit("x"))
    +      val df3 = df2.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +      val dir3 = s"$path${File.separator}part=three"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +      df3.write.mode("overwrite").format(format).options(options).save(dir3)
    +
    +      val df = spark.read
    +        .schema(df3.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +
    +      checkAnswer(df, Seq(
    +        Row("a", null, null, "one"),
    +        Row("b", null, null, "one"),
    +        Row("a", "x", null, "two"),
    +        Row("b", "x", null, "two"),
    +        Row("a", "x", "y", "three"),
    +        Row("b", "x", "y", "three")))
    +    }
    +  }
    +}
    +
    +/**
    + * Remove column.
    + * This test suite is identical with AddColumnEvolutionTest,
    + * but this test suite ensures that the schema and result are truncated to the given schema.
    + */
    +trait RemoveColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("remove column at the end") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq(("1", "a"), ("2", "b")).toDF("col1", "col2")
    +      val df2 = df1.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=two"
    +      val dir2 = s"$path${File.separator}part=three"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +
    +      val df = spark.read
    +        .schema(df1.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +
    +      checkAnswer(df, Seq(
    +        Row("1", "a", "two"),
    +        Row("2", "b", "two"),
    +        Row("1", "a", "three"),
    +        Row("2", "b", "three")))
    +    }
    +  }
    +}
    +
    +/**
    + * Change column positions.
    + * This suite assumes that all data set have the same number of columns.
    + */
    +trait ChangePositionEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("change column position") {
    +    withTempDir { dir =>
    +      // val path = dir.getCanonicalPath
    +      val path = "/tmp/change"
    +
    +      val df1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("col1", "col2")
    +      val df2 = Seq(("d", "4"), ("e", "5"), ("f", "6")).toDF("col2", "col1")
    +      val unionDF = df1.unionByName(df2)
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +
    +      val df = spark.read
    +        .schema(unionDF.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +        .select("col1", "col2")
    +
    +      checkAnswer(df, unionDF)
    +    }
    +  }
    +}
    +
    +trait BooleanTypeEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("boolean to byte/short/int/long") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val values = (1 to 10).map(_ % 2)
    +      val booleanDF = (1 to 10).map(_ % 2 == 1).toDF("col1")
    +      val byteDF = values.map(_.toByte).toDF("col1")
    +      val shortDF = values.map(_.toShort).toDF("col1")
    +      val intDF = values.toDF("col1")
    +      val longDF = values.map(_.toLong).toDF("col1")
    +
    +      booleanDF.write.mode("overwrite").format(format).options(options).save(path)
    +
    +      val df1 = spark.read
    +        .schema("col1 byte")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df1, byteDF)
    +
    +      val df2 = spark.read
    +        .schema("col1 short")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df2, shortDF)
    +
    +      val df3 = spark.read
    +        .schema("col1 int")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df3, intDF)
    +
    +      val df4 = spark.read
    +        .schema("col1 long")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df4, longDF)
    +    }
    +  }
    +}
    +
    +trait IntegralTypeEvolutionTest extends SchemaEvolutionTest {
    +
    +  import testImplicits._
    +
    +  test("change column type from `byte` to `short/int/long`") {
    --- End diff --
    
    Ur, for this, when we put the variables (byteDF, ...) outside of `test` functions, it seems to cause SQLContext errors.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add read schema suite fo...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20208


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87779/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Do not have enough review bandwidth on this test-only PRs before Spark 2.3 release


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r162781286
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,436 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    + *
    + * This aims to provide an explicit test coverage for schema evolution on file-based data sources.
    + * Since a file format has its own coverage of schema evolution, we need a test suite
    + * for each file-based data source with corresponding supported test case traits.
    + *
    + * The following is a hierarchy of test traits.
    + *
    + *   SchemaEvolutionTest
    + *     -> AddColumnEvolutionTest
    + *     -> RemoveColumnEvolutionTest
    + *     -> ChangePositionEvolutionTest
    + *     -> BooleanTypeEvolutionTest
    + *     -> IntegralTypeEvolutionTest
    + *     -> ToDoubleTypeEvolutionTest
    + *     -> ToDecimalTypeEvolutionTest
    + */
    +
    +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext {
    +  val format: String
    +  val options: Map[String, String] = Map.empty[String, String]
    +}
    +
    +/**
    + * Add column.
    + * This test suite assumes that the missing column should be `null`.
    + */
    +trait AddColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("append column at the end") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq("a", "b").toDF("col1")
    +      val df2 = df1.withColumn("col2", lit("x"))
    +      val df3 = df2.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +      val dir3 = s"$path${File.separator}part=three"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +      df3.write.mode("overwrite").format(format).options(options).save(dir3)
    +
    +      val df = spark.read
    +        .schema(df3.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +
    +      checkAnswer(df, Seq(
    +        Row("a", null, null, "one"),
    +        Row("b", null, null, "one"),
    +        Row("a", "x", null, "two"),
    +        Row("b", "x", null, "two"),
    +        Row("a", "x", "y", "three"),
    +        Row("b", "x", "y", "three")))
    +    }
    +  }
    +}
    +
    +/**
    + * Remove column.
    + * This test suite is identical with AddColumnEvolutionTest,
    + * but this test suite ensures that the schema and result are truncated to the given schema.
    + */
    +trait RemoveColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("remove column at the end") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq(("1", "a"), ("2", "b")).toDF("col1", "col2")
    +      val df2 = df1.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=two"
    +      val dir2 = s"$path${File.separator}part=three"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +
    +      val df = spark.read
    +        .schema(df1.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +
    +      checkAnswer(df, Seq(
    +        Row("1", "a", "two"),
    +        Row("2", "b", "two"),
    +        Row("1", "a", "three"),
    +        Row("2", "b", "three")))
    +    }
    +  }
    +}
    +
    +/**
    + * Change column positions.
    + * This suite assumes that all data set have the same number of columns.
    + */
    +trait ChangePositionEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("change column position") {
    +    withTempDir { dir =>
    +      // val path = dir.getCanonicalPath
    +      val path = "/tmp/change"
    +
    +      val df1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("col1", "col2")
    +      val df2 = Seq(("d", "4"), ("e", "5"), ("f", "6")).toDF("col2", "col1")
    +      val unionDF = df1.unionByName(df2)
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +
    +      val df = spark.read
    +        .schema(unionDF.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +        .select("col1", "col2")
    +
    +      checkAnswer(df, unionDF)
    +    }
    +  }
    +}
    +
    +trait BooleanTypeEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("boolean to byte/short/int/long") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val values = (1 to 10).map(_ % 2)
    +      val booleanDF = (1 to 10).map(_ % 2 == 1).toDF("col1")
    +      val byteDF = values.map(_.toByte).toDF("col1")
    +      val shortDF = values.map(_.toShort).toDF("col1")
    +      val intDF = values.toDF("col1")
    +      val longDF = values.map(_.toLong).toDF("col1")
    +
    +      booleanDF.write.mode("overwrite").format(format).options(options).save(path)
    +
    +      val df1 = spark.read
    +        .schema("col1 byte")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df1, byteDF)
    +
    +      val df2 = spark.read
    +        .schema("col1 short")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df2, shortDF)
    +
    +      val df3 = spark.read
    +        .schema("col1 int")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df3, intDF)
    +
    +      val df4 = spark.read
    +        .schema("col1 long")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df4, longDF)
    +    }
    +  }
    +}
    +
    +trait IntegralTypeEvolutionTest extends SchemaEvolutionTest {
    +
    +  import testImplicits._
    +
    +  test("change column type from `byte` to `short/int/long`") {
    --- End diff --
    
    nit: `` `byte` `` -> `byte` or the opposite for consistency with the same instances. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86414 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86414/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    @gatorsmile , @HyukjinKwon .
    Could you review this again for Spark 2.4?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Also, thank you, @HyukjinKwon .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Rebased to the master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r162835448
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,406 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    --- End diff --
    
    @dongjoon-hyun, how do we guarantee schema change in Parquet and ORC?
    
    I thought we (roughly) randomly pick up a file, read its footer and then use it. So, I was thinking we don't properly support this. It makes sense to Parquet with `mergeSchema` tho.
    
    I think it's not even guaranteed in CSV too because we will rely on its header from one file. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #88548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88548/testReport)** for PR 20208 at commit [`6085986`](https://github.com/apache/spark/commit/6085986a3d0c5b00c281b2543f3bfe6ed4e1813c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87969/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Thank you so much for review and approval, @HyukjinKwon !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Thank you for review, @HyukjinKwon . I'll update like that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86259 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86259/testReport)** for PR 20208 at commit [`22eb772`](https://github.com/apache/spark/commit/22eb7726828ea8cf6e3b4206a6bed3cf93f77fd6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Thank you for retriggering, @HyukjinKwon !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Hi, @gatorsmile , @cloud-fan , @sameeragarwal , @HyukjinKwon .
    The PR is ready for review again. Spark commit log seems to be a little quiet since yesterday.
    Could you squeeze some time to give for this Schema Evolution suite? Thank you in advance for any advice!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92827/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91654/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87752 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87752/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176933560
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    +
    +Users can control schema evolution in several ways. For example, new file can have additional
    +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text`
    +data source supports this. Note that `text` data source always has a fixed single string column
    +schema.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +val df1 = Seq("a", "b").toDF("col1")
    +val df2 = df1.withColumn("col2", lit("x"))
    +
    +df1.write.save("/tmp/evolved_data/part=1")
    +df2.write.save("/tmp/evolved_data/part=2")
    +
    +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show
    ++----+----+----+
    +|col1|col2|part|
    ++----+----+----+
    +|   a|   x|   2|
    +|   b|   x|   2|
    +|   a|null|   1|
    +|   b|null|   1|
    ++----+----+----+
    +</div>
    +
    +</div>
    +
    +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based
    +data sources.
    +
    +  1. Add a column
    +  2. Remove a column
    +  3. Change a column position
    --- End diff --
    
    Correct, we need to clarify that partition columns are always at the end.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86603/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #85865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85865/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #90920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90920/testReport)** for PR 20208 at commit [`ea9047a`](https://github.com/apache/spark/commit/ea9047a2f693708f1ea92d854ef4003eea572cf7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #90919 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90919/testReport)** for PR 20208 at commit [`e136bc3`](https://github.com/apache/spark/commit/e136bc3ba7c2cbdf7b9d19f34cdfc2a1d0204942).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToStringTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92827/testReport)** for PR 20208 at commit [`767d7ba`](https://github.com/apache/spark/commit/767d7ba9b5d5cfd659461ae8cf1e735aa0551f18).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176931012
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    +
    +Users can control schema evolution in several ways. For example, new file can have additional
    +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text`
    +data source supports this. Note that `text` data source always has a fixed single string column
    +schema.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +val df1 = Seq("a", "b").toDF("col1")
    +val df2 = df1.withColumn("col2", lit("x"))
    +
    +df1.write.save("/tmp/evolved_data/part=1")
    +df2.write.save("/tmp/evolved_data/part=2")
    +
    +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show
    ++----+----+----+
    +|col1|col2|part|
    ++----+----+----+
    +|   a|   x|   2|
    +|   b|   x|   2|
    +|   a|null|   1|
    +|   b|null|   1|
    ++----+----+----+
    +</div>
    +
    +</div>
    +
    +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based
    +data sources.
    +
    +  1. Add a column
    +  2. Remove a column
    +  3. Change a column position
    --- End diff --
    
    Do we support it? When people issuing `select * from tab`, we automatically reorder the partition columns to the end of the schema. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176933664
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    +
    +Users can control schema evolution in several ways. For example, new file can have additional
    +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text`
    +data source supports this. Note that `text` data source always has a fixed single string column
    +schema.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +val df1 = Seq("a", "b").toDF("col1")
    +val df2 = df1.withColumn("col2", lit("x"))
    +
    +df1.write.save("/tmp/evolved_data/part=1")
    +df2.write.save("/tmp/evolved_data/part=2")
    +
    +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show
    ++----+----+----+
    +|col1|col2|part|
    ++----+----+----+
    +|   a|   x|   2|
    +|   b|   x|   2|
    +|   a|null|   1|
    +|   b|null|   1|
    ++----+----+----+
    +</div>
    +
    +</div>
    +
    +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based
    +data sources.
    +
    +  1. Add a column
    +  2. Remove a column
    +  3. Change a column position
    +  4. Change a column type (`byte` -> `short` -> `int` -> `long`, `float` -> `double`)
    --- End diff --
    
    Yep. `Upcast`s are safe. This PR doesn't aim to cover or guarantee unsafe casting at this stage. Although these are straight-forward `upcast`s, not all Spark file-based data sources seems to support them (based on the test cases). This PR is trying to set the clear boundary and to clarify those missed things.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Basically, these commands are not issued from Spark. We only have the support for ALTER TABLE ADD COLUMN. This PR is just testing the capability of our automatic schema inference, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87969/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88548/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87292 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87292/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/828/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Thank you so much, @gatorsmile . Sure. I'll make a PR to improve error handling for that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92744 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92744/testReport)** for PR 20208 at commit [`ebd239e`](https://github.com/apache/spark/commit/ebd239eab0aa2b03b211cd470eb33d5a538f594a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToStringTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/778/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r201487246
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    --- End diff --
    
    Thank you for review, @gatorsmile . I'll update like that.
    
    For write operation, we cannot specify schema like read path, `.schema`. Spark writes the new file into the directory additionally or overwrites the directory.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/901/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86099/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #91654 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91654/testReport)** for PR 20208 at commit [`a8026b8`](https://github.com/apache/spark/commit/a8026b8bd716b667bdc5d7aca226267ce342db97).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176933493
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    --- End diff --
    
    Thank you so much for review, @gatorsmile . I waited for this moment. :)
    I agree all of your comments. The main reason of those limitation is because Spark file-based data sources doesn't have a capability to manage multi-version schema and the column default values here. In fact, that is beyond of Spark data sources' role. Thus, this PR is trying to add a test coverage for AS-IS capability in order to prevent future regression and to make a foundation to trust and to build on later. I don't think this is worthy of documentation at the beginning. It's a start.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86281/testReport)** for PR 20208 at commit [`22eb772`](https://github.com/apache/spark/commit/22eb7726828ea8cf6e3b4206a6bed3cf93f77fd6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86414 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86414/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r162781551
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,436 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    + *
    + * This aims to provide an explicit test coverage for schema evolution on file-based data sources.
    + * Since a file format has its own coverage of schema evolution, we need a test suite
    + * for each file-based data source with corresponding supported test case traits.
    + *
    + * The following is a hierarchy of test traits.
    + *
    + *   SchemaEvolutionTest
    + *     -> AddColumnEvolutionTest
    + *     -> RemoveColumnEvolutionTest
    + *     -> ChangePositionEvolutionTest
    + *     -> BooleanTypeEvolutionTest
    + *     -> IntegralTypeEvolutionTest
    + *     -> ToDoubleTypeEvolutionTest
    + *     -> ToDecimalTypeEvolutionTest
    + */
    +
    +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext {
    +  val format: String
    +  val options: Map[String, String] = Map.empty[String, String]
    +}
    +
    +/**
    + * Add column.
    --- End diff --
    
    Shall we leave the number given above in this comment like `(case 1.)`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Hi, @gatorsmile , @cloud-fan , @HyukjinKwon , @viirya .
    Could you review this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #85983 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85983/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92731/testReport)** for PR 20208 at commit [`ebd239e`](https://github.com/apache/spark/commit/ebd239eab0aa2b03b211cd470eb33d5a538f594a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87906/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/202/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/198/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Which document do you have in mind? A section in `docs/sql-programming-guide.md`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86281/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87906 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87906/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Also, ping @rxin , too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/56/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #85865 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85865/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #87779 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87779/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86099 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86099/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #91654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91654/testReport)** for PR 20208 at commit [`a8026b8`](https://github.com/apache/spark/commit/a8026b8bd716b667bdc5d7aca226267ce342db97).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToStringTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #86603 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86603/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/769/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r176834608
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    --- End diff --
    
    @gatorsmile . I rebased to the master and added this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r162781308
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,436 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    + *
    + * This aims to provide an explicit test coverage for schema evolution on file-based data sources.
    + * Since a file format has its own coverage of schema evolution, we need a test suite
    + * for each file-based data source with corresponding supported test case traits.
    + *
    + * The following is a hierarchy of test traits.
    + *
    + *   SchemaEvolutionTest
    + *     -> AddColumnEvolutionTest
    + *     -> RemoveColumnEvolutionTest
    + *     -> ChangePositionEvolutionTest
    + *     -> BooleanTypeEvolutionTest
    + *     -> IntegralTypeEvolutionTest
    + *     -> ToDoubleTypeEvolutionTest
    + *     -> ToDecimalTypeEvolutionTest
    + */
    +
    +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext {
    +  val format: String
    +  val options: Map[String, String] = Map.empty[String, String]
    +}
    +
    +/**
    + * Add column.
    + * This test suite assumes that the missing column should be `null`.
    + */
    +trait AddColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("append column at the end") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq("a", "b").toDF("col1")
    +      val df2 = df1.withColumn("col2", lit("x"))
    +      val df3 = df2.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +      val dir3 = s"$path${File.separator}part=three"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +      df3.write.mode("overwrite").format(format).options(options).save(dir3)
    +
    +      val df = spark.read
    +        .schema(df3.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +
    +      checkAnswer(df, Seq(
    +        Row("a", null, null, "one"),
    +        Row("b", null, null, "one"),
    +        Row("a", "x", null, "two"),
    +        Row("b", "x", null, "two"),
    +        Row("a", "x", "y", "three"),
    +        Row("b", "x", "y", "three")))
    +    }
    +  }
    +}
    +
    +/**
    + * Remove column.
    + * This test suite is identical with AddColumnEvolutionTest,
    + * but this test suite ensures that the schema and result are truncated to the given schema.
    + */
    +trait RemoveColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("remove column at the end") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq(("1", "a"), ("2", "b")).toDF("col1", "col2")
    +      val df2 = df1.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=two"
    +      val dir2 = s"$path${File.separator}part=three"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +
    +      val df = spark.read
    +        .schema(df1.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +
    +      checkAnswer(df, Seq(
    +        Row("1", "a", "two"),
    +        Row("2", "b", "two"),
    +        Row("1", "a", "three"),
    +        Row("2", "b", "three")))
    +    }
    +  }
    +}
    +
    +/**
    + * Change column positions.
    + * This suite assumes that all data set have the same number of columns.
    + */
    +trait ChangePositionEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("change column position") {
    +    withTempDir { dir =>
    +      // val path = dir.getCanonicalPath
    +      val path = "/tmp/change"
    +
    +      val df1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("col1", "col2")
    +      val df2 = Seq(("d", "4"), ("e", "5"), ("f", "6")).toDF("col2", "col1")
    +      val unionDF = df1.unionByName(df2)
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +
    +      val df = spark.read
    +        .schema(unionDF.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +        .select("col1", "col2")
    +
    +      checkAnswer(df, unionDF)
    +    }
    +  }
    +}
    +
    +trait BooleanTypeEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("boolean to byte/short/int/long") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val values = (1 to 10).map(_ % 2)
    +      val booleanDF = (1 to 10).map(_ % 2 == 1).toDF("col1")
    +      val byteDF = values.map(_.toByte).toDF("col1")
    +      val shortDF = values.map(_.toShort).toDF("col1")
    +      val intDF = values.toDF("col1")
    +      val longDF = values.map(_.toLong).toDF("col1")
    +
    +      booleanDF.write.mode("overwrite").format(format).options(options).save(path)
    +
    +      val df1 = spark.read
    +        .schema("col1 byte")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df1, byteDF)
    +
    +      val df2 = spark.read
    +        .schema("col1 short")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df2, shortDF)
    +
    +      val df3 = spark.read
    +        .schema("col1 int")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df3, intDF)
    +
    +      val df4 = spark.read
    +        .schema("col1 long")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df4, longDF)
    +    }
    +  }
    +}
    +
    +trait IntegralTypeEvolutionTest extends SchemaEvolutionTest {
    +
    +  import testImplicits._
    +
    +  test("change column type from `byte` to `short/int/long`") {
    --- End diff --
    
    Here seems many tests have some duplicated codes .. can we maybe do such as something like as below?
    
    ```scala
    Seq(byteDF, ...).zip("byte").foreach { case (df, t) =>
      test(s"boolean to $t") {
        spark.read
          .schema("col1 long")
          .format(format)
          .options(options)
          .load(path)
        checkAnswer(df4, longDF)
      }
    }
    ```
    
    I am fine with any idea to deal with this duplication.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r201170482
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp
     when `path/to/table/gender=male` is the path of the data and
     users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
     
    +### Schema Evolution
    --- End diff --
    
    I still want to avoid using `schema evolution` in the doc or tests. `Schema Projection` might better. More importantly, you have to clarify that this only covers the read path.
    
    What is the behavior in the write path when the physical and data schemas are different.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add read schema suite fo...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r201499769
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala ---
    @@ -0,0 +1,493 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * The reader schema is said to be evolved (or projected) when it changed after the data is
    --- End diff --
    
    Here, I used `evolved` and `projected` as a general verb.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    **[Test build #92731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92731/testReport)** for PR 20208 at commit [`ebd239e`](https://github.com/apache/spark/commit/ebd239eab0aa2b03b211cd470eb33d5a538f594a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext `
      * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnAtTheEndEvolutionTest extends SchemaEvolutionTest `
      * `trait HideColumnInTheMiddleEvolutionTest extends SchemaEvolutionTest `
      * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
      * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToStringTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
      * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add read schema suite fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r201506731
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala ---
    @@ -0,0 +1,493 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * The reader schema is said to be evolved (or projected) when it changed after the data is
    + * written by writers. The followings are supported in file-based data sources.
    + * Note that partition columns are not maintained in files. Here, `column` means non-partition
    + * column.
    + *
    + *   1. Add a column
    + *   2. Hide a column
    + *   3. Change a column position
    + *   4. Change a column type (Upcast)
    + *
    + * Here, we consider safe changes without data loss. For example, data type changes should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have the following coverages.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    --- End diff --
    
    Thanks for helping improve the test coverage! All the included test cases are positive. How about the negative test cases? What kind of errors you hit?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20208#discussion_r162781325
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala ---
    @@ -0,0 +1,436 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.spark.sql.{QueryTest, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
    +
    +/**
    + * Schema can evolve in several ways and the followings are supported in file-based data sources.
    + *
    + *   1. Add a column
    + *   2. Remove a column
    + *   3. Change a column position
    + *   4. Change a column type
    + *
    + * Here, we consider safe evolution without data loss. For example, data type evolution should be
    + * from small types to larger types like `int`-to-`long`, not vice versa.
    + *
    + * So far, file-based data sources have schema evolution coverages like the followings.
    + *
    + *   | File Format  | Coverage     | Note                                                   |
    + *   | ------------ | ------------ | ------------------------------------------------------ |
    + *   | TEXT         | N/A          | Schema consists of a single string column.             |
    + *   | CSV          | 1, 2, 4      |                                                        |
    + *   | JSON         | 1, 2, 3, 4   |                                                        |
    + *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    + *   | PARQUET      | 1, 2, 3      |                                                        |
    + *
    + * This aims to provide an explicit test coverage for schema evolution on file-based data sources.
    + * Since a file format has its own coverage of schema evolution, we need a test suite
    + * for each file-based data source with corresponding supported test case traits.
    + *
    + * The following is a hierarchy of test traits.
    + *
    + *   SchemaEvolutionTest
    + *     -> AddColumnEvolutionTest
    + *     -> RemoveColumnEvolutionTest
    + *     -> ChangePositionEvolutionTest
    + *     -> BooleanTypeEvolutionTest
    + *     -> IntegralTypeEvolutionTest
    + *     -> ToDoubleTypeEvolutionTest
    + *     -> ToDecimalTypeEvolutionTest
    + */
    +
    +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext {
    +  val format: String
    +  val options: Map[String, String] = Map.empty[String, String]
    +}
    +
    +/**
    + * Add column.
    + * This test suite assumes that the missing column should be `null`.
    + */
    +trait AddColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("append column at the end") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq("a", "b").toDF("col1")
    +      val df2 = df1.withColumn("col2", lit("x"))
    +      val df3 = df2.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +      val dir3 = s"$path${File.separator}part=three"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +      df3.write.mode("overwrite").format(format).options(options).save(dir3)
    +
    +      val df = spark.read
    +        .schema(df3.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +
    +      checkAnswer(df, Seq(
    +        Row("a", null, null, "one"),
    +        Row("b", null, null, "one"),
    +        Row("a", "x", null, "two"),
    +        Row("b", "x", null, "two"),
    +        Row("a", "x", "y", "three"),
    +        Row("b", "x", "y", "three")))
    +    }
    +  }
    +}
    +
    +/**
    + * Remove column.
    + * This test suite is identical with AddColumnEvolutionTest,
    + * but this test suite ensures that the schema and result are truncated to the given schema.
    + */
    +trait RemoveColumnEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("remove column at the end") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val df1 = Seq(("1", "a"), ("2", "b")).toDF("col1", "col2")
    +      val df2 = df1.withColumn("col3", lit("y"))
    +
    +      val dir1 = s"$path${File.separator}part=two"
    +      val dir2 = s"$path${File.separator}part=three"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +
    +      val df = spark.read
    +        .schema(df1.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +
    +      checkAnswer(df, Seq(
    +        Row("1", "a", "two"),
    +        Row("2", "b", "two"),
    +        Row("1", "a", "three"),
    +        Row("2", "b", "three")))
    +    }
    +  }
    +}
    +
    +/**
    + * Change column positions.
    + * This suite assumes that all data set have the same number of columns.
    + */
    +trait ChangePositionEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("change column position") {
    +    withTempDir { dir =>
    +      // val path = dir.getCanonicalPath
    +      val path = "/tmp/change"
    +
    +      val df1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("col1", "col2")
    +      val df2 = Seq(("d", "4"), ("e", "5"), ("f", "6")).toDF("col2", "col1")
    +      val unionDF = df1.unionByName(df2)
    +
    +      val dir1 = s"$path${File.separator}part=one"
    +      val dir2 = s"$path${File.separator}part=two"
    +
    +      df1.write.mode("overwrite").format(format).options(options).save(dir1)
    +      df2.write.mode("overwrite").format(format).options(options).save(dir2)
    +
    +      val df = spark.read
    +        .schema(unionDF.schema)
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +        .select("col1", "col2")
    +
    +      checkAnswer(df, unionDF)
    +    }
    +  }
    +}
    +
    +trait BooleanTypeEvolutionTest extends SchemaEvolutionTest {
    +  import testImplicits._
    +
    +  test("boolean to byte/short/int/long") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      val values = (1 to 10).map(_ % 2)
    +      val booleanDF = (1 to 10).map(_ % 2 == 1).toDF("col1")
    +      val byteDF = values.map(_.toByte).toDF("col1")
    +      val shortDF = values.map(_.toShort).toDF("col1")
    +      val intDF = values.toDF("col1")
    +      val longDF = values.map(_.toLong).toDF("col1")
    +
    +      booleanDF.write.mode("overwrite").format(format).options(options).save(path)
    +
    +      val df1 = spark.read
    +        .schema("col1 byte")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df1, byteDF)
    +
    +      val df2 = spark.read
    +        .schema("col1 short")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df2, shortDF)
    +
    +      val df3 = spark.read
    +        .schema("col1 int")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df3, intDF)
    +
    +      val df4 = spark.read
    +        .schema("col1 long")
    +        .format(format)
    +        .options(options)
    +        .load(path)
    +      checkAnswer(df4, longDF)
    +    }
    +  }
    +}
    +
    +trait IntegralTypeEvolutionTest extends SchemaEvolutionTest {
    +
    +  import testImplicits._
    +
    +  test("change column type from `byte` to `short/int/long`") {
    +    withTempDir { dir =>
    --- End diff --
    
    I think we can do `withTempPath`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1149/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87752/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1243/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Also, ping @sameeragarwal , too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add read schema suite for file-...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92828/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20208
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85983/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org