You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/15 01:22:01 UTC

[GitHub] [spark] maropu opened a new pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

maropu opened a new pull request #28830:
URL: https://github.com/apache/spark/pull/28830


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   This PR intends to provide a hot-fix for a bug in `Dataset.dropDuplicates`; we must preserve the input order of `colNames` for `groupCols` because the Streaming's state store depends on the `groupCols` order.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Bug fix.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Added tests in `DataFrameSuite`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643878606






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643880332


   +1 to partial revert which should be also OK with author. (I guess it was applied simply by pattern, and it wasn’t for some outstanding improvement, so no problem for author as well.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439904040



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       Ah, I see and it looks the same: https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/SeqLike.scala#L504-L523 But, (I'm not a Scala compiler expert though),  the `distinct` scala API makes sure an input order is preserved? I read [the @HeartSaVioR comment ](https://issues.apache.org/jira/browse/SPARK-31990?focusedCommentId=17135286&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17135286) in the Jira, then I'm not sure about that. So, I wrote this method to preserve the order explicitly for safeguard.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439904837



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       So consider this as manual implementation of distinct so that we don't even get affected by Scala changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643870176


   Let's use `[SS]` instead as it's specific to SS issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643879180


   okay, I'll revert that part in this PR first.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643888818


   How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?
   
   Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. Sure, this may need to be placed on migration guide or release note as well.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439900371



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest
       assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0))))
     }
   }
+
+  test("SPARK-31990: preserves the input order of colNames in dropDuplicates") {
+    val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b")
+    val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b")

Review comment:
       I just couldn't find a test query to reproduce the issue described in the JIRA. I think @xuanyuanking can provide it because he find this issue...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SQL][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439905543



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       Oh I didn't see @maropu 's comment while I'm commenting. ;) Thanks for explaining.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643879059


   Yes. I prefer to reverting the original fix in 3.0.1. and then discuss how to solve/avoid the problems in a proper way. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439903851



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest
       assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0))))
     }
   }
+
+  test("SPARK-31990: preserves the input order of colNames in dropDuplicates") {
+    val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b")
+    val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b")

Review comment:
       You may want to find the case where `toSet.toSeq` produces different order than `distinct`.
   
   And then you may need to craft a streaming query doing dropDuplicate with having grouping key just you found, and run the query with Spark 2.4.x (where SPARK-31292 is not applied), and store checkpoint, and run the query with Spark 3.x (where SPARK-31292 is applied) with restoring checkpoint.
   
   There're a couple of test suites doing that - one of example is https://github.com/apache/spark/blob/8282bbf12d4e174986a649023ce3984aae7d7755/sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsWithStateSuite.scala#L880-L952




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643855612


   **[Test build #124020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124020/testReport)** for PR 28830 at commit [`3e02ad6`](https://github.com/apache/spark/commit/3e02ad602423b0ff94209f4971d2956050c3c9be).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643949392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643944683


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124027/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643855612






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643948852


   **[Test build #124036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124036/testReport)** for PR 28830 at commit [`7546ba4`](https://github.com/apache/spark/commit/7546ba4eebeee480d9a2ff8b948e900cd6023dfc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #28830:
URL: https://github.com/apache/spark/pull/28830


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643945423






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643878606


   Hi, All.
   This issue is marked as a hotfix for the blocker issue, but the validation of this issue looks non-trivial. Since `toSet.toSeq` is used since Apache Spark 2.2.0 (SPARK-19497) and SPARK-31292 is just an `Improvement` with `Trivial` issue. I'd like to propose to revert SPARK-31292 from `branch-3.0` first? We will keep SPARK-31292 in `master` branch still and proceed this PR to find a better way for Apache Spark 3.1.0.
   
   I know that the reverting is not a good solution for the original author as mentioned by @HeartSaVioR in the dev mailing list, but I believe that is the proper way in this case to cut Apache Spark 3.0.1. How do you think about that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643856341


   cc: @xuanyuanking @HeartSaVioR @srowen @gatorsmile 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439904675



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       I think Scala wouldn't change the implementation unless there's strong reason to do so, but in Scala 2.13 they removed the orderness guarantee on the method doc in `distinct`. Till Scala 2.12 it clearly explains the return value as preserving the "first occurrence", and the explanation no longer exists in Scala 2.13.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439956403



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue
+    // (the Streaming's state store depends on the `groupCols` order).
+    val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
       I think this is good for now.
   
   In the future, this may still be broken by Scala version upgrade, and hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can change it and use a deterministic order.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439903851



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest
       assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0))))
     }
   }
+
+  test("SPARK-31990: preserves the input order of colNames in dropDuplicates") {
+    val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b")
+    val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b")

Review comment:
       You may want to find the case where `toSet.toSeq` produces different order than `distinct`. (I guess you're struggling to find the case, then yeah better to wait for @xuanyuanking )
   
   And then you may need to craft a streaming query doing dropDuplicate with having grouping key just you found, and run the query with Spark 2.4.x (where SPARK-31292 is not applied), and store checkpoint, and run the query with Spark 3.x (where SPARK-31292 is applied) with restoring checkpoint.
   
   There're a couple of test suites doing that - one of example is https://github.com/apache/spark/blob/8282bbf12d4e174986a649023ce3984aae7d7755/sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsWithStateSuite.scala#L880-L952




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643888818


   How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?
   
   Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0. Sure, this may need to be placed on migration guide or release note as well.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439901669



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       This looks logically the same with Scala's `distinct` implementation. Is there a change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643916855


   ```
   How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?
   ```
   Here's my plan to consolidate both: https://github.com/apache/spark/pull/28707#issuecomment-643916110, this will also comment in JIRA & PR description.
   Yes, #28707 is blocking by this fix.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-644133430






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643950413


   > Btw now we know it is broken in Spark 3.0.0, and we will fix it again in Spark 3.0.1.
   
   I think we should list it as a known issue of 3.0.0, and release 3.0.1 soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643888818


   How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?
   
   Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643942982


   Btw now we know it is broken in Spark 3.0.0, and we will fix it again in Spark 3.0.1. Do we have some best practice to follow on guiding such change to end users?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643944675






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439907916



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest
       assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0))))
     }
   }
+
+  test("SPARK-31990: preserves the input order of colNames in dropDuplicates") {
+    val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b")
+    val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b")

Review comment:
       BTW, @HeartSaVioR . Is there a test case failure using the same Spark version checkpointing? I'm curious if this only occurs between different Spark versions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643929047


   The last commit is to trying to preserve the previous behavior (whatever it was) since Apache Spark 2.2.0 although there is no guarantee which it safe or not. We will revisit the correct issue later after 3.0.1.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643880332


   +1 to partial revert which should be also OK with author. (I guess it was applied simply by pattern, and it wasn’t for some intended improvement, so no problem for author as well.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643855929






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-644133430






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643888818


   How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?
   
   Things would be simple if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r440013368



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue
+    // (the Streaming's state store depends on the `groupCols` order).
+    val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
       Ah, that was fixed in a way by adding an env variable. That case also was specific to Python 2 which is deprecated now so it's rather a corner case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439965525



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue
+    // (the Streaming's state store depends on the `groupCols` order).
+    val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
       Worth noting that we need to have "concrete" solution eventually - if columns are all having same type neither #28830 nor #24173 catch the change and the result becomes silently incorrect. I roughly remember the similar issue on pyspark, which was trying to fix the issue on order vs name, don't remember how it ended up. cc. @HyukjinKwon 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-644181830


   Thank you all. Merged to master/3.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643882434






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643878606


   Hi, All.
   This issue is marked as a hotfix for the blocker issue, but the validation of this issue looks non-trivial. Since `toSet.toSeq` is used since Apache Spark 2.2.0 (SPARK-19497) and SPARK-31292 is just an `Improvement` issue with `Trivial` priority. I'd like to propose to revert SPARK-31292 from `branch-3.0` first. We will keep SPARK-31292 in `master` branch still and proceed this @maropu 's PR to find a better way for Apache Spark 3.1.0.
   
   I know that the reverting is not a good solution for the original author as mentioned by @HeartSaVioR in the dev mailing list, but I believe that is the proper way in this case to cut Apache Spark 3.0.1. How do you think about that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643927909


   I am okay to revert it for now but I couldn't fully follow why we expect an explicit order from a set. Has it been ever guaranteed somewhere? Using `distinct`, we can expect the deterministic order but we're reverting back to using a set because of the deterministic order (?).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643945431


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124020/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643861808


   Thank you for making a swift fix, @maropu .
   cc @dbtsai and @holdenk 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439965525



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue
+    // (the Streaming's state store depends on the `groupCols` order).
+    val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
       Worth noting that we need to have "concrete" solution eventually - if columns are all having same type neither #28830 nor #24173 catch the change. I roughly remember the similar issue on pyspark, which was trying to fix the issue on order vs name, don't remember how it ended up. cc. @HyukjinKwon 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643948852


   **[Test build #124036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124036/testReport)** for PR 28830 at commit [`7546ba4`](https://github.com/apache/spark/commit/7546ba4eebeee480d9a2ff8b948e900cd6023dfc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439896619



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest
       assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0))))
     }
   }
+
+  test("SPARK-31990: preserves the input order of colNames in dropDuplicates") {
+    val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b")
+    val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b")

Review comment:
       ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-644130432


   **[Test build #124036 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124036/testReport)** for PR 28830 at commit [`7546ba4`](https://github.com/apache/spark/commit/7546ba4eebeee480d9a2ff8b948e900cd6023dfc).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643944319






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643888818


   How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?
   
   Things would be simpler if we merge the partial fix as it is, and spend our efforts to discuss how to guide known issue - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439910372



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest
       assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0))))
     }
   }
+
+  test("SPARK-31990: preserves the input order of colNames in dropDuplicates") {
+    val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b")
+    val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b")

Review comment:
       Thanks for adding a new UT here. Since this issue was found when investigating the test failure in https://github.com/apache/spark/pull/28707#issuecomment-639861273,
   how about reusing the UT `default config of includeHeader doesn't break existing query from Spark 2.4` in `KafkaMicroBatchV2SourceSuite`? I think we don't need to add a new UT for this regression after #28707. That is to say after #28707 is merged, if we don't do the fix, the mentioned UT will fail.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       How about simply revert this line to https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458, use the original implementation of `toSet`.
   Yes, the `toSet.toSeq` might incompatible during to Scala version, but I think the current fix should just keep the original order. How to detect the order changing and have solid validation should be the work of SPARK-31894 and SPARK-27237. WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643944675


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643885408


   > Thanks for the quick fix @maropu! I think maybe we can simplify the bugfix by combining it together with #28707. WDYT? I'll also reference this PR with #28707.
   
   @xuanyuanking yea, looks fine to me. Could you takes this over? Thanks, anyway!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643880347


   Yep, I think just revert that part is good enough. I will give more context and details on #28707. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643880429


   Ya. +1 for partial revert in this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643945423


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439907052



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       The reported issue claims that Scala `distinct` function was not enough. That's the reason why I asked that `What is the difference to fix Spark issue`.
   
   As @HeartSaVioR 's commented (https://github.com/apache/spark/pull/28830#discussion_r439904302), we need a different code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-644195800


   Thanks, all!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439956403



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue
+    // (the Streaming's state store depends on the `groupCols` order).
+    val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
       I think this is good for now.
   
   In the future, this may still be broken by Scala version upgrade, and hopefully @xuanyuanking 's unsafe row validation can detect it. Then we can change it and use a deterministic order, as it will be broken anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439904302



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest
       assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0))))
     }
   }
+
+  test("SPARK-31990: preserves the input order of colNames in dropDuplicates") {
+    val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b")
+    val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b")

Review comment:
       Technically, as you're implementing the same as `distinct` does, it doesn't fix the issue for restoring from old checkpoint. Both Spark 3.x and your patch will fail. That's what I commented in JIRA - "do the right fix" vs "leave it as it is".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439907052



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       The reported issue claims that Scala `distinct` function was not enough. That's the reason why I asked that `Is there a change?` to fix Spark issue.
   
   As @HeartSaVioR 's commented (https://github.com/apache/spark/pull/28830#discussion_r439904302), we need a different code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643930058


   I am okay with that; I was just wondering even the previous behaviour was deterministic or not, and the current change looked righter to me. Given that we're going to revisit anyway, LGTM from me too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439892041



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2548,6 +2548,21 @@ class DataFrameSuite extends QueryTest
       assert(df.schema === new StructType().add(StructField("d", DecimalType(38, 0))))
     }
   }
+
+  test("SPARK-31990: preserves the input order of colNames in dropDuplicates") {
+    val df = Seq((1, 2, 3, 4, 5), (1, 2, 3, 4, 5)).toDF("c", "e", "d", "a", "b")
+    val inputColNames = Seq("c", "b", "c", "d", "b", "c", "b")

Review comment:
       @xuanyuanking Do you have a test input seq to reproduce this issue? I checked if this test suite could fail w/o the hot-fix, but it passed...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643863071


   Yes, this incompatible bug is found by a WIP validation logic. I will reply
   the details and reference the PR soon.
   
   Dongjoon Hyun <no...@github.com>于2020年6月15日 周一09:54写道:
   
   > Thank you for making a swift fix, @maropu <https://github.com/maropu> .
   > cc @dbtsai <https://github.com/dbtsai> and @holdenk
   > <https://github.com/holdenk>
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/28830#issuecomment-643861808>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/ABE4DZNXKFRD2FJJRKBWICDRWV5MJANCNFSM4N5W4OKA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28830: [SPARK-31990][SS] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439907906



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       Ah, I see. Yea, I'll update the code based on `toSeq.toSeq`.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2542,20 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must preserve the input order of `colNames` because of the compatibility
+    // issue (the Streaming's state store depends on the `groupCols` order).
+    val orderPreservingDistinctColNames = {
+      val nameSeen = mutable.Set[String]()

Review comment:
       Ah, I see. Yea, I'll update the code based on `toSet.toSeq`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643855929






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643929047






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643882321


   **[Test build #124027 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124027/testReport)** for PR 28830 at commit [`7546ba4`](https://github.com/apache/spark/commit/7546ba4eebeee480d9a2ff8b948e900cd6023dfc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28830: [SPARK-31990][SQL] Preserves the input order of colNames in dropDuplicates

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643861808


   Thank you for making a fix swiftly, @maropu .
   cc @dbtsai and @holdenk 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643949392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #28830:
URL: https://github.com/apache/spark/pull/28830#discussion_r439959045



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2541,7 +2541,9 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
     val resolver = sparkSession.sessionState.analyzer.resolver
     val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.distinct.flatMap { (colName: String) =>
+    // SPARK-31990: We must keep `toSet.toSeq` here because of the backward compatibility issue
+    // (the Streaming's state store depends on the `groupCols` order).
+    val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>

Review comment:
       Yep, I also mentioned this at https://github.com/apache/spark/pull/28830#discussion_r439909489, we can relay on the validation checking and integrated tests.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643882434






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643888818


   How we plan to consolidate both? How we will write JIRA title/description and PR title/description? Which is the type of the consolidated issue? Is the consolidated issue a blocker?
   
   Things would be simpler if we merge the partial revert as it is, and spend our efforts to discuss how to guide known issues - this is one of candidates for Spark 3.0.0. This is clearly a bugfix which is a "blocker" preventing some of end users migrate to Spark 3.0.0, worth to have its own JIRA issue, and also commit. Sure, this may need to be placed on migration guide or release note as well.
   
   It'd be no harm for #28707 to wait for this patch to be merged, and rebase to fix the test failure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643946468


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #28830: [SPARK-31990][SS] Use toSet.toSeq in Dataset.dropDuplicates

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #28830:
URL: https://github.com/apache/spark/pull/28830#issuecomment-643930058


   I am okay with that; I was just wondering even the previous behaviour was deterministic or not, and SPARK-31292 looked righter to me. Given that we're going to revisit anyway, LGTM from me too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org