You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2015/02/28 06:52:44 UTC

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/4826

    [SPARK-5950][SQL]Insert array into a metastore table saved as parquet should work when using datasource api

    This PR contains the following changes:
    1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However,  the nullability of `ArrayType(IntegerType, containsNull = true)` is compatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
    2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
    3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
    4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
    5. Update the equality check of JSON relation. Since JSON does not really cares nullability,  `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.
    
    cc @marmbrus @liancheng

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark insertNullabilityCheck

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4826.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4826
    
----
commit 4ec17fd28a45e42db92144af9cb8a8e7e796eb40
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T21:20:00Z

    Failed test.

commit 8f19fe520080f064b50dc5885f221889c2612eea
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T21:20:57Z

    equalsIgnoreCompatibleNullability

commit 9a266114fca979c69468709ed82fbb99fe2595e6
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T21:26:33Z

    Make InsertIntoTable happy.

commit 0a703e751cf0ebcd481f2f7dd66cc7bdea529f04
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T21:38:07Z

    Test failed again since we cannot read correct content.

commit bf50d7383e499cbf1e3964a9391d4e9b56607f32
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-28T05:33:43Z

    When appending data, we use the schema of the existing table instead of the schema of the new data.

commit 8bd008b403140b430344d669727410de7b4bc235
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-28T05:34:54Z

    nullable, containsNull, and valueContainsNull will be always true for parquet data.

commit b2c06f8c4e67450650b2a58c5168eb31cd490641
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-28T05:35:30Z

    Ignore nullability in JSON relation's equality check.

commit e4f397cea7ec0dc21a714b75a7254bb275319fc2
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-28T05:35:54Z

    Unit tests.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76863576
  
      [Test build #28204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28204/consoleFull) for   PR 4826 at commit [`80e487e`](https://github.com/apache/spark/commit/80e487ea629c56864bf023c0e7431e8bc7b9f0b1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76572957
  
      [Test build #28137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28137/consoleFull) for   PR 4826 at commit [`d3747d1`](https://github.com/apache/spark/commit/d3747d174ee7493ce1dc0553d35a973512680675).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/4826#issuecomment-76519402

I wouldn't like to say this. But, @liancheng, @yhuai I think you should more respect other's contribution...

You in #4782 made changes to `ParquetConversions`. That is almost as same as what I did in #4729.

Again, the point 4 in this pr:

> 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to false. This change makes the whole data pipeline more robust.

This is exactly what I did in #4729. I remember you said this is a bad idea and we would not use it.

Basically, I can't see why #4729 can't be modified and merged to do that. And I think it is important to respect other's contribution and the time spent on it...

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4826


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25595479
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -172,9 +173,14 @@ private[sql] object ParquetRelation {
           sqlContext.conf.parquetCompressionCodec.toUpperCase, CompressionCodecName.UNCOMPRESSED)
           .name())
         ParquetRelation.enableLogForwarding()
    -    ParquetTypesConverter.writeMetaData(attributes, path, conf)
    +    // This is a hack. We always set nullable/containsNull/valueContainsNull to true
    +    // for the schema of a parquet data.
    +    val schema =
    +      DataType.alwaysNullable(StructType.fromAttributes(attributes)).asInstanceOf[StructType]
    +    val newAttributes = schema.toAttributes
    +    ParquetTypesConverter.writeMetaData(newAttributes, path, conf)
         new ParquetRelation(path.toString, Some(conf), sqlContext) {
    -      override val output = attributes
    +      override val output = newAttributes
    --- End diff --
    
    Verified that when merging schemas, official Parquet implementation will handle nullability (repetition level) properly. So our change should be safe for interoperation with other systems that support Parquet schema evolving.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76655461
  
    A typo in the PR description:
    
    > However, the nullability of `ArrayType(IntegerType, containsNull = true)` is compatible with that of `ArrayType(IntegerType, containsNull = false)`
    
    Here "compatible" should be "incompatible".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76842310
  
      [Test build #28193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28193/consoleFull) for   PR 4826 at commit [`587d88b`](https://github.com/apache/spark/commit/587d88b4ec438a38c6e4db55c9de4c654fd78210).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76513062
  
      [Test build #28113 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28113/consoleFull) for   PR 4826 at commit [`e4f397c`](https://github.com/apache/spark/commit/e4f397cea7ec0dc21a714b75a7254bb275319fc2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76827466
  
      [Test build #28187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28187/consoleFull) for   PR 4826 at commit [`0cb7ea2`](https://github.com/apache/spark/commit/0cb7ea27185db716ae5deddff00649064ddde860).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76882338
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28212/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76514620
  
      [Test build #28119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28119/consoleFull) for   PR 4826 at commit [`0eb5578`](https://github.com/apache/spark/commit/0eb5578f8fc81c9c2186ffe7ba4b538f7a40f828).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76516089
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28119/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76861923
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28199/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25588714
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/dataTypes.scala ---
    @@ -198,6 +198,57 @@ object DataType {
           case (left, right) => left == right
         }
       }
    +
    +  /**
    +   * Compares two types, ignoring compatible nullability of ArrayType, MapType, StructType.
    +   *
    +   * Compatible nullability is defined as follows:
    +   *   - If `from` and `to` are ArrayTypes, `from` has a compatible nullability with `to`
    +   *   if and only if `from.containsNull` is false, or both of `from.containsNull` and
    +   *   `to.containsNull` are true.
    +   *   - If `from` and `to` are MapTypes, `from` has a compatible nullability with `to`
    +   *   if and only if `from.valueContainsNull` is false, or both of `from.valueContainsNull` and
    +   *   `to.valueContainsNull` are true.
    +   *   - If `from` and `to` are StructTypes, `from` has a compatible nullability with `to`
    +   *   if and only if for all every pair of fields, `fromField.nullable` is false, or both
    +   *   of `fromField.nullable` and `toField.nullable` are true.
    +   */
    +  private[spark] def equalsIgnoreCompatibleNullability(from: DataType, to: DataType): Boolean = {
    +    (from, to) match {
    +      case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) =>
    +        (tn || !fn) && equalsIgnoreCompatibleNullability(fromElement, toElement)
    +
    +      case (MapType(fromKey, fromValue, fn), MapType(toKey, toValue, tn)) =>
    +        (tn || !fn) &&
    +          equalsIgnoreCompatibleNullability(fromKey, toKey) &&
    +          equalsIgnoreCompatibleNullability(fromValue, toValue)
    +
    +      case (StructType(fromFields), StructType(toFields)) =>
    +        fromFields.size == toFields.size &&
    +          fromFields.zip(toFields).forall {
    +            case (fromField, toField) =>
    +              fromField.name == toField.name &&
    +                (toField.nullable || !fromField.nullable) &&
    +                equalsIgnoreCompatibleNullability(fromField.dataType, toField.dataType)
    +          }
    +
    +      case (fromDataType, toDataType) => fromDataType == toDataType
    +    }
    +  }
    +
    +  /** Sets all nullable/containsNull/valueContainsNull to true. */
    +  private[spark] def alwaysNullable(dataType: DataType): DataType = dataType match {
    +    case ArrayType(elementType, _) =>
    +      ArrayType(alwaysNullable(elementType), containsNull = true)
    +    case MapType(keyType, valueType, _) =>
    +      MapType(alwaysNullable(keyType), alwaysNullable(valueType), true)
    --- End diff --
    
    Nit: Use named parameter for `valueContainsNull`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25597141
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala ---
    @@ -131,7 +131,7 @@ private[sql] case class JSONRelation(
     
       override def equals(other: Any): Boolean = other match {
         case that: JSONRelation =>
    -      (this.path == that.path) && (this.schema == that.schema)
    +      (this.path == that.path) && (DataType.equalsIgnoreNullability(this.schema, that.schema))
    --- End diff --
    
    Should also have this in `ParquetRelation2.equals`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25639687
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ---
    @@ -115,4 +115,84 @@ class DataTypeSuite extends FunSuite {
       checkDefaultSize(MapType(IntegerType, StringType, true), 410000)
       checkDefaultSize(MapType(IntegerType, ArrayType(DoubleType), false), 80400)
       checkDefaultSize(structType, 812)
    +
    +  def checkEqualsIgnoreCompatibleNullability(
    +      from: DataType,
    +      to: DataType,
    +      expected: Boolean): Unit = {
    +    val testName =
    +      s"equalsIgnoreCompatibleNullability: (from: ${from}, to: ${to})"
    +    test(testName) {
    +      assert(DataType.equalsIgnoreCompatibleNullability(from, to) === expected)
    +    }
    +  }
    +
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = true),
    +    to = ArrayType(DoubleType, containsNull = true),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = false),
    +    to = ArrayType(DoubleType, containsNull = false),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = false),
    +    to = ArrayType(DoubleType, containsNull = true),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = true),
    +    to = ArrayType(DoubleType, containsNull = false),
    +    expected = false)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = false),
    +    to = ArrayType(StringType, containsNull = false),
    +    expected = false)
    +
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, DoubleType, valueContainsNull = true),
    +    to = MapType(StringType, DoubleType, valueContainsNull = true),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, DoubleType, valueContainsNull = false),
    +    to = MapType(StringType, DoubleType, valueContainsNull = false),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, DoubleType, valueContainsNull = false),
    +    to = MapType(StringType, DoubleType, valueContainsNull = true),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, DoubleType, valueContainsNull = true),
    +    to = MapType(StringType, DoubleType, valueContainsNull = false),
    +    expected = false)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, ArrayType(IntegerType, true), valueContainsNull = true),
    +    to = MapType(StringType,  ArrayType(IntegerType, false), valueContainsNull = true),
    +    expected = false)
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76841671
  
      [Test build #28187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28187/consoleFull) for   PR 4826 at commit [`0cb7ea2`](https://github.com/apache/spark/commit/0cb7ea27185db716ae5deddff00649064ddde860).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76842454
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28193/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76538036
  
    @yhuai @liancheng I have no much to say. As you already opened these new prs that do what I submitted before in #4729, I closed the pr
    
    Although you have explained why you did this, looks like you are not doing what you meant. You said you would not use my proposed approach, but then you create this to use it.
    
    Maybe you should read these articles for pr etiquette.
    
    http://readwrite.com/2014/07/02/github-pull-request-etiquette
    
    http://www.quora.com/GitHub/Is-it-bad-etiquette-to-change-someones-pull-request-before-committing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76512669
  
      [Test build #28113 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28113/consoleFull) for   PR 4826 at commit [`e4f397c`](https://github.com/apache/spark/commit/e4f397cea7ec0dc21a714b75a7254bb275319fc2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76513066
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28113/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76577757
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28137/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25561465
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/modelSaveLoad.scala ---
    @@ -110,7 +110,7 @@ private[mllib] object Loader {
           assert(loadedFields.contains(field.name), s"Unable to parse model data." +
             s"  Expected field with name ${field.name} was missing in loaded schema:" +
             s" ${loadedFields.mkString(", ")}")
    -      assert(loadedFields(field.name) == field.dataType,
    +      assert(DataType.equalsIgnoreNullability(loadedFields(field.name), field.dataType),
    --- End diff --
    
    @mengxr Is it OK to ignore the nullability of ArrayType (containsNull field) and MapType (valueContainsNull field) in this check? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76556397
  
      [Test build #28132 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28132/consoleFull) for   PR 4826 at commit [`8360817`](https://github.com/apache/spark/commit/8360817f1097c127602c681c7c6e7fb9238c4081).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76861917
  
      [Test build #28199 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28199/consoleFull) for   PR 4826 at commit [`80e487e`](https://github.com/apache/spark/commit/80e487ea629c56864bf023c0e7431e8bc7b9f0b1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class MatrixFactorizationModel(JavaModelWrapper, Saveable, JavaLoader):`
      * `class Saveable(object):`
      * `class Loader(object):`
      * `class JavaLoader(Loader):`
      * `        java_class = ".".join([java_package, cls.__name__])`
      * `                logError("User class threw exception: " + cause.getMessage, cause)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76875544
  
      [Test build #28204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28204/consoleFull) for   PR 4826 at commit [`80e487e`](https://github.com/apache/spark/commit/80e487ea629c56864bf023c0e7431e8bc7b9f0b1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25639644
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/dataTypes.scala ---
    @@ -198,6 +198,43 @@ object DataType {
           case (left, right) => left == right
         }
       }
    +
    +  /**
    +   * Compares two types, ignoring compatible nullability of ArrayType, MapType, StructType.
    +   *
    +   * Compatible nullability is defined as follows:
    +   *   - If `from` and `to` are ArrayTypes, `from` has a compatible nullability with `to`
    +   *   if and only if `to.containsNull` is true, or both of `from.containsNull` and
    +   *   `to.containsNull` are false.
    +   *   - If `from` and `to` are MapTypes, `from` has a compatible nullability with `to`
    +   *   if and only if `to.valueContainsNull` is true, or both of `from.valueContainsNull` and
    +   *   `to.valueContainsNull` are false.
    +   *   - If `from` and `to` are StructTypes, `from` has a compatible nullability with `to`
    +   *   if and only if for all every pair of fields, `to.nullable` is true, or both
    +   *   of `fromField.nullable` and `toField.nullable` are false.
    +   */
    +  private[sql] def equalsIgnoreCompatibleNullability(from: DataType, to: DataType): Boolean = {
    --- End diff --
    
    We can introduce a method to the class of `DataType` based on this one later (I am not sure what will be a good name. I thought about `compatibleWith`, but I feel it is not very accurate). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76874204
  
      [Test build #28212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28212/consoleFull) for   PR 4826 at commit [`3b61a04`](https://github.com/apache/spark/commit/3b61a0456a3169a6fa39e5e45e0bad8e571b9a4d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25588958
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ---
    @@ -115,4 +115,84 @@ class DataTypeSuite extends FunSuite {
       checkDefaultSize(MapType(IntegerType, StringType, true), 410000)
       checkDefaultSize(MapType(IntegerType, ArrayType(DoubleType), false), 80400)
       checkDefaultSize(structType, 812)
    +
    +  def checkEqualsIgnoreCompatibleNullability(
    +      from: DataType,
    +      to: DataType,
    +      expected: Boolean): Unit = {
    +    val testName =
    +      s"equalsIgnoreCompatibleNullability: (from: ${from}, to: ${to})"
    +    test(testName) {
    +      assert(DataType.equalsIgnoreCompatibleNullability(from, to) === expected)
    +    }
    +  }
    +
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = true),
    +    to = ArrayType(DoubleType, containsNull = true),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = false),
    +    to = ArrayType(DoubleType, containsNull = false),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = false),
    +    to = ArrayType(DoubleType, containsNull = true),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = true),
    +    to = ArrayType(DoubleType, containsNull = false),
    +    expected = false)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = ArrayType(DoubleType, containsNull = false),
    +    to = ArrayType(StringType, containsNull = false),
    +    expected = false)
    +
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, DoubleType, valueContainsNull = true),
    +    to = MapType(StringType, DoubleType, valueContainsNull = true),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, DoubleType, valueContainsNull = false),
    +    to = MapType(StringType, DoubleType, valueContainsNull = false),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, DoubleType, valueContainsNull = false),
    +    to = MapType(StringType, DoubleType, valueContainsNull = true),
    +    expected = true)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, DoubleType, valueContainsNull = true),
    +    to = MapType(StringType, DoubleType, valueContainsNull = false),
    +    expected = false)
    +  checkEqualsIgnoreCompatibleNullability(
    +    from = MapType(StringType, ArrayType(IntegerType, true), valueContainsNull = true),
    +    to = MapType(StringType,  ArrayType(IntegerType, false), valueContainsNull = true),
    +    expected = false)
    --- End diff --
    
    Would be good to add another test case to show nested case:
    
    ```scala
      checkEqualsIgnoreCompatibleNullability(
        from = MapType(StringType, ArrayType(IntegerType, false), valueContainsNull = true),
        to = MapType(StringType,  ArrayType(IntegerType, true), valueContainsNull = true),
        expected = true)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76556398
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28132/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76863430
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25593397
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -172,9 +173,14 @@ private[sql] object ParquetRelation {
           sqlContext.conf.parquetCompressionCodec.toUpperCase, CompressionCodecName.UNCOMPRESSED)
           .name())
         ParquetRelation.enableLogForwarding()
    -    ParquetTypesConverter.writeMetaData(attributes, path, conf)
    +    // This is a hack. We always set nullable/containsNull/valueContainsNull to true
    +    // for the schema of a parquet data.
    +    val schema =
    +      DataType.alwaysNullable(StructType.fromAttributes(attributes)).asInstanceOf[StructType]
    +    val newAttributes = schema.toAttributes
    +    ParquetTypesConverter.writeMetaData(newAttributes, path, conf)
         new ParquetRelation(path.toString, Some(conf), sqlContext) {
    -      override val output = attributes
    +      override val output = newAttributes
    --- End diff --
    
    Never mind, since we always write nullable data, it should be OK to leave `ParquetRelation.output` untouched.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76577750
  
      [Test build #28137 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28137/consoleFull) for   PR 4826 at commit [`d3747d1`](https://github.com/apache/spark/commit/d3747d174ee7493ce1dc0553d35a973512680675).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25593111
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -172,9 +173,14 @@ private[sql] object ParquetRelation {
           sqlContext.conf.parquetCompressionCodec.toUpperCase, CompressionCodecName.UNCOMPRESSED)
           .name())
         ParquetRelation.enableLogForwarding()
    -    ParquetTypesConverter.writeMetaData(attributes, path, conf)
    +    // This is a hack. We always set nullable/containsNull/valueContainsNull to true
    +    // for the schema of a parquet data.
    +    val schema =
    +      DataType.alwaysNullable(StructType.fromAttributes(attributes)).asInstanceOf[StructType]
    +    val newAttributes = schema.toAttributes
    +    ParquetTypesConverter.writeMetaData(newAttributes, path, conf)
         new ParquetRelation(path.toString, Some(conf), sqlContext) {
    -      override val output = attributes
    +      override val output = newAttributes
    --- End diff --
    
    Should we also make data types of `ParquetRelation.output` always nullable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76847366
  
      [Test build #28199 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28199/consoleFull) for   PR 4826 at commit [`80e487e`](https://github.com/apache/spark/commit/80e487ea629c56864bf023c0e7431e8bc7b9f0b1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76516087
  
      [Test build #28119 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28119/consoleFull) for   PR 4826 at commit [`0eb5578`](https://github.com/apache/spark/commit/0eb5578f8fc81c9c2186ffe7ba4b538f7a40f828).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25588553
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/dataTypes.scala ---
    @@ -198,6 +198,57 @@ object DataType {
           case (left, right) => left == right
         }
       }
    +
    +  /**
    +   * Compares two types, ignoring compatible nullability of ArrayType, MapType, StructType.
    +   *
    +   * Compatible nullability is defined as follows:
    +   *   - If `from` and `to` are ArrayTypes, `from` has a compatible nullability with `to`
    +   *   if and only if `from.containsNull` is false, or both of `from.containsNull` and
    +   *   `to.containsNull` are true.
    --- End diff --
    
    Same applies to `MapType` and `StructType` below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25561497
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -172,9 +173,14 @@ private[sql] object ParquetRelation {
           sqlContext.conf.parquetCompressionCodec.toUpperCase, CompressionCodecName.UNCOMPRESSED)
           .name())
         ParquetRelation.enableLogForwarding()
    -    ParquetTypesConverter.writeMetaData(attributes, path, conf)
    +    // This is a hack. We always set nullable/containsNull/valueContainsNull to true
    +    // for the schema of a parquet data.
    +    val schema =
    +      DataType.alwaysNullable(StructType.fromAttributes(attributes)).asInstanceOf[StructType]
    +    val newAttributes = schema.toAttributes
    +    ParquetTypesConverter.writeMetaData(newAttributes, path, conf)
         new ParquetRelation(path.toString, Some(conf), sqlContext) {
    -      override val output = attributes
    +      override val output = newAttributes
    --- End diff --
    
    @liancheng @marmbrus I am also changing the nullability for our old parquet write path to make the behavior consistent with our new write path. Let me know if there is any potential compatibility issue and we should revert this change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76841684
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28187/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76842451
  
      [Test build #28193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28193/consoleFull) for   PR 4826 at commit [`587d88b`](https://github.com/apache/spark/commit/587d88b4ec438a38c6e4db55c9de4c654fd78210).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25562103
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala ---
    @@ -592,13 +595,79 @@ class MetastoreDataSourcesSuite extends QueryTest with BeforeAndAfterEach {
         }
       }
     
    +  test("Pre insert nullability check (ArrayType)") {
    +    val df1 =
    +      createDataFrame(Tuple1(Seq(Int.box(1), null.asInstanceOf[Integer])) :: Nil).toDF("a")
    +    val expectedSchema1 =
    +      StructType(
    +        StructField("a", ArrayType(IntegerType, containsNull = true), nullable = true) :: Nil)
    +    assert(df1.schema === expectedSchema1)
    +    df1.saveAsTable("arrayInParquet", "parquet", SaveMode.Overwrite)
    +
    +    val df2 =
    +      createDataFrame(Tuple1(Seq(2, 3)) :: Nil).toDF("a")
    +    val expectedSchema2 =
    +      StructType(
    +        StructField("a", ArrayType(IntegerType, containsNull = false), nullable = true) :: Nil)
    +    assert(df2.schema === expectedSchema2)
    +    df2.insertInto("arrayInParquet", overwrite = false)
    +    createDataFrame(Tuple1(Seq(4, 5)) :: Nil).toDF("a")
    +      .saveAsTable("arrayInParquet", SaveMode.Append) // This one internally calls df2.insertInto.
    +    createDataFrame(Tuple1(Seq(Int.box(6), null.asInstanceOf[Integer])) :: Nil).toDF("a")
    +      .saveAsTable("arrayInParquet", "parquet", SaveMode.Append)
    +    refreshTable("arrayInParquet")
    +
    +    checkAnswer(
    +      sql("SELECT a FROM arrayInParquet"),
    +      Row(ArrayBuffer(1, null)) ::
    +        Row(ArrayBuffer(2, 3)) ::
    +        Row(ArrayBuffer(4, 5)) ::
    +        Row(ArrayBuffer(6, null)) :: Nil)
    +
    +    sql("DROP TABLE arrayInParquet")
    +  }
    +
    +  test("Pre insert nullability check (MapType)") {
    +    val df1 =
    +      createDataFrame(Tuple1(Map(1 -> null.asInstanceOf[Integer])) :: Nil).toDF("a")
    +    val mapType1 = MapType(IntegerType, IntegerType, valueContainsNull = true)
    +    val expectedSchema1 =
    +      StructType(
    +        StructField("a", mapType1, nullable = true) :: Nil)
    +    assert(df1.schema === expectedSchema1)
    +    df1.saveAsTable("mapInParquet", "parquet", SaveMode.Overwrite)
    +
    +    val df2 =
    +      createDataFrame(Tuple1(Map(2 -> 3)) :: Nil).toDF("a")
    +    val mapType2 = MapType(IntegerType, IntegerType, valueContainsNull = false)
    +    val expectedSchema2 =
    +      StructType(
    +        StructField("a", mapType2, nullable = true) :: Nil)
    +    assert(df2.schema === expectedSchema2)
    +    df2.insertInto("mapInParquet", overwrite = false)
    +    createDataFrame(Tuple1(Map(4 -> 5)) :: Nil).toDF("a")
    +      .saveAsTable("mapInParquet", SaveMode.Append) // This one internally calls df2.insertInto.
    +    createDataFrame(Tuple1(Map(6 -> null.asInstanceOf[Integer])) :: Nil).toDF("a")
    +      .saveAsTable("mapInParquet", "parquet", SaveMode.Append)
    +    refreshTable("mapInParquet")
    +
    +    checkAnswer(
    +      sql("SELECT a FROM mapInParquet"),
    +      Row(Map(1 -> null)) ::
    +        Row(Map(2 -> 3)) ::
    +        Row(Map(4 -> 5)) ::
    +        Row(Map(6 -> null)) :: Nil)
    +
    +    sql("DROP TABLE mapInParquet")
    +  }
    +
       test("SPARK-6024 wide schema support") {
         // We will need 80 splits for this schema if the threshold is 4000.
         val schema = StructType((1 to 5000).map(i => StructField(s"c_${i}", StringType, true)))
         assert(
           schema.json.size > conf.schemaStringLengthThreshold,
           "To correctly test the fix of SPARK-6024, the value of " +
    -      s"spark.sql.sources.schemaStringLengthThreshold needs to be less than ${schema.json.size}")
    +        s"spark.sql.sources.schemaStringLengthThreshold needs to be less than ${schema.json.size}")
    --- End diff --
    
    looks like a spurious change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76547402
  
      [Test build #28132 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28132/consoleFull) for   PR 4826 at commit [`8360817`](https://github.com/apache/spark/commit/8360817f1097c127602c681c7c6e7fb9238c4081).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25597104
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala ---
    @@ -172,9 +173,14 @@ private[sql] object ParquetRelation {
           sqlContext.conf.parquetCompressionCodec.toUpperCase, CompressionCodecName.UNCOMPRESSED)
           .name())
         ParquetRelation.enableLogForwarding()
    -    ParquetTypesConverter.writeMetaData(attributes, path, conf)
    +    // This is a hack. We always set nullable/containsNull/valueContainsNull to true
    +    // for the schema of a parquet data.
    +    val schema =
    +      DataType.alwaysNullable(StructType.fromAttributes(attributes)).asInstanceOf[StructType]
    --- End diff --
    
    Should also make `ParquetRelation2.schema` always nullable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76882331
  
      [Test build #28212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28212/consoleFull) for   PR 4826 at commit [`3b61a04`](https://github.com/apache/spark/commit/3b61a0456a3169a6fa39e5e45e0bad8e571b9a4d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4826#discussion_r25588519
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/dataTypes.scala ---
    @@ -198,6 +198,57 @@ object DataType {
           case (left, right) => left == right
         }
       }
    +
    +  /**
    +   * Compares two types, ignoring compatible nullability of ArrayType, MapType, StructType.
    +   *
    +   * Compatible nullability is defined as follows:
    +   *   - If `from` and `to` are ArrayTypes, `from` has a compatible nullability with `to`
    +   *   if and only if `from.containsNull` is false, or both of `from.containsNull` and
    +   *   `to.containsNull` are true.
    --- End diff --
    
    Might be better to reword this definition to make it consistent with the code:
    
    > If `from` and `to` are `ArrayType`s, `from` has a compatible nullablity with `to` if and only if `to.containsNull` is true or both `to.containsNull` and `from.containsNull` are false.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76875551
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28204/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4826#issuecomment-76715599
  
    @yhuai @marmbrus This PR basically makes both JSON and Parquet data sources always nullable. I agree that this makes most common cases more robust. Should we take a more aggressive step and just remove `nullable`/`containsNull`/`valueContainsNull` fields and assume all data sources are always nullable? Up until now, I can only see troubles rather than any benefits brought by nullability. Of course, this is out of the scope of this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org