You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/06 14:36:06 UTC

[GitHub] [spark] xuanyuanking opened a new pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

xuanyuanking opened a new pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478
 
 
   ### What changes were proposed in this pull request?
   This is a follow-up for #23124, add a new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found.
   
   ### Why are the changes needed?
   Prevent silent behavior changes.
   
   
   ### Does this PR introduce any user-facing change?
   Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.
   
   ### How was this patch tested?
   Modify existing UT.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586568447
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23226/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583705939
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583685459
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22818/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583050788
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117998/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376213630
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   Hi, @cloud-fan , @gatorsmile , @rxin , @marmbrus .
   So, `RuntimeException by default` is better in Apache Spark 3.0.0?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376163252
 
 

 ##########
 File path: python/pyspark/sql/functions.py
 ##########
 @@ -2766,13 +2766,15 @@ def map_concat(*cols):
     :param cols: list of column names (string) or list of :class:`Column` expressions
 
     >>> from pyspark.sql.functions import map_concat
+    >>> spark.conf.set("spark.sql.deduplicateMapKey.lastWinsPolicy.enabled", "true")
 
 Review comment:
   I would don't set this configuration, and change:
   
   ```diff
   - >>> df = spark.sql("SELECT map(1, 'a', 2, 'b') as map1, map(3, 'c', 1, 'd') as map2")
   + >>> df = spark.sql("SELECT map(1, 'a', 2, 'b') as map1, map(3, 'c') as map2")
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583050397
 
 
   **[Test build #117998 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117998/testReport)** for PR 27478 at commit [`09cfb67`](https://github.com/apache/spark/commit/09cfb674f835015c9de15b24ea958f5fb47fb915).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583557917
 
 
   **[Test build #118040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118040/testReport)** for PR 27478 at commit [`fe1fb0f`](https://github.com/apache/spark/commit/fe1fb0f28bf7a448051b46517dc6ceed4fd78288).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583050774
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586568447
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23226/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586908624
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587120130
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118578/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586241007
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23181/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r379831096
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ##########
 @@ -63,6 +68,11 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
       keys.append(key)
       values.append(value)
     } else {
+      if (!allowDuplicatedMapKey) {
+        throw new RuntimeException(s"Duplicate map key $key was founded, please set " +
 
 Review comment:
   We shouldn't recommend users to set legacy config to fix a problem. We should ask them to check the input data to see why there are duplicated keys. If they want to remove duplicated keys, tell them that they can set the legacy config which uses last wins policy.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586976283
 
 
   **[Test build #118578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118578/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376253512
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   Should we mention this `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` config is only useful for migration and could be removed in future version?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583050788
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117998/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586864334
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583155463
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22770/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586587027
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118468/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583453786
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22805/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586587025
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376678023
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   Agree to have a guideline if we need the new rule, but `correctness bug fixes` might not the issue we discuss here? If it's a bug, why we need to have a legacy fallback config.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586976283
 
 
   **[Test build #118578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118578/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586854920
 
 
   **[Test build #118547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118547/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583155458
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583452915
 
 
   **[Test build #118040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118040/testReport)** for PR 27478 at commit [`fe1fb0f`](https://github.com/apache/spark/commit/fe1fb0f28bf7a448051b46517dc6ceed4fd78288).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586235313
 
 
   **[Test build #118423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118423/testReport)** for PR 27478 at commit [`f622180`](https://github.com/apache/spark/commit/f622180e40ed984d593bc95ad2a1b7ad96f2af1c).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586864352
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118537/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586235313
 
 
   **[Test build #118423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118423/testReport)** for PR 27478 at commit [`f622180`](https://github.com/apache/spark/commit/f622180e40ed984d593bc95ad2a1b7ad96f2af1c).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r379999008
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ##########
 @@ -63,6 +68,11 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
       keys.append(key)
       values.append(value)
     } else {
+      if (!allowDuplicatedMapKey) {
+        throw new RuntimeException(s"Duplicate map key $key was founded, please set " +
 
 Review comment:
   Thanks, done in 905ae96.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586834168
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586970413
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118567/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376438874
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   +1 for the idea of highlight this config is migration only.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586859653
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586859653
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang edited a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
gengliangwang edited a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-585039004
 
 
   I searched "deduplicate map key" and there is no matching result.
   Since there will be runtime exception on duplicated keys, how about renaming the config to `spark.sql.allowDuplicatedMapKeys.enabled`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586568444
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586970149
 
 
   **[Test build #118567 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118567/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376678023
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   Agree to have a guideline if we need the new rule, to separate the silent result changing here is caused by the behavior change, not by the bug fixes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-582937112
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22763/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586241007
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23181/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586909169
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586973268
 
 
   retest it please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586912213
 
 
   **[Test build #118567 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118567/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583685270
 
 
   **[Test build #118052 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118052/testReport)** for PR 27478 at commit [`66dc51f`](https://github.com/apache/spark/commit/66dc51f0bc0fb45f7c44ef72f26e1918e01d0212).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r378054659
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   What I have in mind is:
   - if it's a bug fix (the previous result is definitely wrong), then no config is needed. If the impact is big, we can add a legacy config which is false by default.
   - if it makes the behavior better, we should either add a config and use the old behavior by default, or fail by default and ask users to set config explicitly and pick the desired behavior.
   
   I'm trying to think more cases, will send an email to the dev list soon.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583452915
 
 
   **[Test build #118040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118040/testReport)** for PR 27478 at commit [`fe1fb0f`](https://github.com/apache/spark/commit/fe1fb0f28bf7a448051b46517dc6ceed4fd78288).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376449589
 
 

 ##########
 File path: python/pyspark/sql/functions.py
 ##########
 @@ -2766,13 +2766,15 @@ def map_concat(*cols):
     :param cols: list of column names (string) or list of :class:`Column` expressions
 
     >>> from pyspark.sql.functions import map_concat
+    >>> spark.conf.set("spark.sql.deduplicateMapKey.lastWinsPolicy.enabled", "true")
 
 Review comment:
   Thanks, done in fe1fb0f

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586253212
 
 
   LGTM, can you update PR title as well?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586257473
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586859630
 
 
   **[Test build #118547 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118547/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).
    * This patch **fails to generate documentation**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586853035
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-582937097
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586833788
 
 
   **[Test build #118537 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118537/testReport)** for PR 27478 at commit [`905ae96`](https://github.com/apache/spark/commit/905ae96e60f43788fded374e9e7a597c2d973f40).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586871501
 
 
   retest it please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586853035
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586909181
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23320/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586864090
 
 
   **[Test build #118537 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118537/testReport)** for PR 27478 at commit [`905ae96`](https://github.com/apache/spark/commit/905ae96e60f43788fded374e9e7a597c2d973f40).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586257479
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118423/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583153477
 
 
   Retest this please.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583209717
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118005/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586587025
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-582936565
 
 
   **[Test build #117998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117998/testReport)** for PR 27478 at commit [`09cfb67`](https://github.com/apache/spark/commit/09cfb674f835015c9de15b24ea958f5fb47fb915).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586586867
 
 
   **[Test build #118468 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118468/testReport)** for PR 27478 at commit [`b102c36`](https://github.com/apache/spark/commit/b102c361bd800f3183db029da42a069201b9f39d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376262398
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   I agree with the idea to avoid  "silent result changing". Btw, we couldn't keep the old (spark 2.4) behaviour for duplicate keys by using a legacy option? If we couldn't do because of some reasons, the proposed one (the runtime exception) looks reasonable to me, too. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586973691
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23332/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583155092
 
 
   **[Test build #118005 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118005/testReport)** for PR 27478 at commit [`09cfb67`](https://github.com/apache/spark/commit/09cfb674f835015c9de15b24ea958f5fb47fb915).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376254342
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   This idea sounds reasonable. If we think a behavior change is correct but for migration reason, we don't want a silent behavior change, a config like this one is an option we can use to notify users such change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376441716
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
 ##########
 @@ -651,8 +651,10 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
       Row(null)
     )
 
-    checkAnswer(df1.selectExpr("map_concat(map1, map2)"), expected1a)
-    checkAnswer(df1.select(map_concat($"map1", $"map2")), expected1a)
+    withSQLConf(SQLConf.DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY.key -> "true") {
 
 Review comment:
   Thanks, added in `ArrayBasedMapBuilderSuite`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376254710
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -2167,6 +2167,14 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY =
+    buildConf("spark.sql.deduplicateMapKey.lastWinsPolicy.enabled")
+      .doc("When true, use last wins policy to remove duplicated map keys in built-in functions, " +
+        "this config takes effect in below build-in functions: CreateMap, MapFromArrays, " +
+        "MapFromEntries, StringToMap, MapConcat and TransformKeys.")
 
 Review comment:
   We should mention if false then an exception will be thrown.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586834168
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376251516
 
 

 ##########
 File path: python/pyspark/sql/functions.py
 ##########
 @@ -2766,13 +2766,15 @@ def map_concat(*cols):
     :param cols: list of column names (string) or list of :class:`Column` expressions
 
     >>> from pyspark.sql.functions import map_concat
+    >>> spark.conf.set("spark.sql.deduplicateMapKey.lastWinsPolicy.enabled", "true")
 
 Review comment:
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583558598
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586224976
 
 
   Thanks Gengliang, as the config still related to legacy behavior, I rename it to `spark.sql.legacy.allowDuplicatedMapKeys`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang edited a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
gengliangwang edited a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-585039004
 
 
   I google searched "deduplicate map key" and there is no matching result.
   Since there will be runtime exception on duplicated keys, how about renaming the config to `spark.sql.allowDuplicatedMapKeys.enabled`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-582937097
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583453786
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22805/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586854920
 
 
   **[Test build #118547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118547/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586568228
 
 
   **[Test build #118468 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118468/testReport)** for PR 27478 at commit [`b102c36`](https://github.com/apache/spark/commit/b102c361bd800f3183db029da42a069201b9f39d).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586240967
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586863059
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583685454
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583558598
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583705941
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118052/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-582937112
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22763/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376508834
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   I'm okay with the rule, @cloud-fan . At this time, could you add a guideline into our website? Actually, many correctness bug fixes in the migration guide have been `silent result changing` with a legacy fallback config.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583155092
 
 
   **[Test build #118005 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118005/testReport)** for PR 27478 at commit [`09cfb67`](https://github.com/apache/spark/commit/09cfb674f835015c9de15b24ea958f5fb47fb915).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583685459
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22818/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586863071
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23305/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587010019
 
 
   thanks, merging to master/3.0!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586853038
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23301/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586864334
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583705941
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118052/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376449824
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -2167,6 +2167,14 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY =
+    buildConf("spark.sql.deduplicateMapKey.lastWinsPolicy.enabled")
+      .doc("When true, use last wins policy to remove duplicated map keys in built-in functions, " +
+        "this config takes effect in below build-in functions: CreateMap, MapFromArrays, " +
+        "MapFromEntries, StringToMap, MapConcat and TransformKeys.")
 
 Review comment:
   Thanks, done in fe1fb0f.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586257479
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118423/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583209717
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118005/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586973680
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587239996
 
 
   Thanks all for the review.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376449743
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ##########
 @@ -63,6 +65,11 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
       keys.append(key)
       values.append(value)
     } else {
+      if (!SQLConf.get.getConf(DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY)) {
 
 Review comment:
   Thanks, done in fe1fb0f

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587000309
 
 
   **[Test build #118551 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118551/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376239887
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   The principle is still no failure by default, but version upgrade is a different case. We should try our best to avoid "silent result changing".
   
   We may want to introduce a new category of configs, which is for migration-only, and should be removed in the next major version.
   
   Thoughts? @srowen @viirya @maropu @bart-samwel

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586834171
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23292/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376678023
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   Agree to have a guideline if we need the new rule, to separate the silent result changing is caused by the behavior change, not by the bug fixes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376252435
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ##########
 @@ -63,6 +65,11 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
       keys.append(key)
       values.append(value)
     } else {
+      if (!SQLConf.get.getConf(DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY)) {
 
 Review comment:
   can we read the config when `ArrayBasedMapBuilder` is constructed? e.g.
   ```
   private val xxx = SQLConf.get.getConf(DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586862667
 
 
   **[Test build #118551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118551/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r379802575
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -2167,6 +2167,16 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val LEGACY_ALLOW_DUPLICATED_MAP_KEY =
+    buildConf("spark.sql.legacy.allowDuplicatedMapKeys")
+      .doc("When true, use last wins policy to remove duplicated map keys in built-in functions, " +
+        "this config takes effect in below build-in functions: CreateMap, MapFromArrays, " +
+        "MapFromEntries, StringToMap, MapConcat and TransformKeys. Otherwise, if this is false, " +
+        "which is the default, Spark will throw an exception while duplicated map keys are " +
 
 Review comment:
   Thanks, done in b102c36.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586240967
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586587027
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118468/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583558604
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118040/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r380008658
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
 ##########
 @@ -188,11 +188,13 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
           checkExampleSyntax(example)
           example.split("  > ").toList.foreach(_ match {
             case exampleRe(sql, output) =>
-              val df = clonedSpark.sql(sql)
-              val actual = unindentAndTrim(
-                hiveResultString(df.queryExecution.executedPlan).mkString("\n"))
-              val expected = unindentAndTrim(output)
-              assert(actual === expected)
+              withSQLConf(SQLConf.LEGACY_ALLOW_DUPLICATED_MAP_KEY.key -> "true") {
 
 Review comment:
   can you check the map-related builtin expressions? seems they use wrong examples that have duplicated keys.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r379389636
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -2167,6 +2167,16 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val LEGACY_ALLOW_DUPLICATED_MAP_KEY =
+    buildConf("spark.sql.legacy.allowDuplicatedMapKeys")
+      .doc("When true, use last wins policy to remove duplicated map keys in built-in functions, " +
+        "this config takes effect in below build-in functions: CreateMap, MapFromArrays, " +
+        "MapFromEntries, StringToMap, MapConcat and TransformKeys. Otherwise, if this is false, " +
+        "which is the default, Spark will throw an exception while duplicated map keys are " +
 
 Review comment:
   while -> when

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586973691
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23332/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583705939
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583685270
 
 
   **[Test build #118052 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118052/testReport)** for PR 27478 at commit [`66dc51f`](https://github.com/apache/spark/commit/66dc51f0bc0fb45f7c44ef72f26e1918e01d0212).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583209556
 
 
   **[Test build #118005 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118005/testReport)** for PR 27478 at commit [`09cfb67`](https://github.com/apache/spark/commit/09cfb674f835015c9de15b24ea958f5fb47fb915).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586970404
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587001486
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118551/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587001477
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586970413
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118567/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586863059
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586864352
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118537/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376347541
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
 
 Review comment:
   The idea sounds fine.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586973336
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586861735
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583685454
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583155458
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583453774
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586909169
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376163612
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
 ##########
 @@ -651,8 +651,10 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
       Row(null)
     )
 
-    checkAnswer(df1.selectExpr("map_concat(map1, map2)"), expected1a)
-    checkAnswer(df1.select(map_concat($"map1", $"map2")), expected1a)
+    withSQLConf(SQLConf.DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY.key -> "true") {
 
 Review comment:
   Can we have one test that checks the exception when this configuration is false?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587120130
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118578/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583558604
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118040/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583705844
 
 
   **[Test build #118052 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118052/testReport)** for PR 27478 at commit [`66dc51f`](https://github.com/apache/spark/commit/66dc51f0bc0fb45f7c44ef72f26e1918e01d0212).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586853038
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23301/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583453774
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586833788
 
 
   **[Test build #118537 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118537/testReport)** for PR 27478 at commit [`905ae96`](https://github.com/apache/spark/commit/905ae96e60f43788fded374e9e7a597c2d973f40).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586257473
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586257413
 
 
   **[Test build #118423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118423/testReport)** for PR 27478 at commit [`f622180`](https://github.com/apache/spark/commit/f622180e40ed984d593bc95ad2a1b7ad96f2af1c).
    * This patch **fails to generate documentation**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586568228
 
 
   **[Test build #118468 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118468/testReport)** for PR 27478 at commit [`b102c36`](https://github.com/apache/spark/commit/b102c361bd800f3183db029da42a069201b9f39d).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-582936565
 
 
   **[Test build #117998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117998/testReport)** for PR 27478 at commit [`09cfb67`](https://github.com/apache/spark/commit/09cfb674f835015c9de15b24ea958f5fb47fb915).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586568444
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586909181
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23320/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587119453
 
 
   **[Test build #118578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118578/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583155463
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22770/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-585039004
 
 
   I searched "deduplicate map key" and there is no result.
   Since there will be runtime exception on duplicated keys, how about renaming the config to `spark.sql.allowDuplicatedMapKeys.enabled`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587120123
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586973680
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587001486
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118551/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586912213
 
 
   **[Test build #118567 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118567/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586859661
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118547/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586862667
 
 
   **[Test build #118551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118551/testReport)** for PR 27478 at commit [`feb8c0a`](https://github.com/apache/spark/commit/feb8c0aa28034c166f6413bd1eec58a885848192).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583209707
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583209707
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376252692
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
 ##########
 @@ -651,8 +651,10 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
       Row(null)
     )
 
-    checkAnswer(df1.selectExpr("map_concat(map1, map2)"), expected1a)
-    checkAnswer(df1.select(map_concat($"map1", $"map2")), expected1a)
+    withSQLConf(SQLConf.DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY.key -> "true") {
 
 Review comment:
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586859661
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118547/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586834171
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23292/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586863071
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23305/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587001477
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-587120123
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-583050774
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27478: [SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior
URL: https://github.com/apache/spark/pull/27478#issuecomment-586970404
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org