You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/10 09:01:20 UTC

[GitHub] [spark] AngersZhuuuu opened a new pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

AngersZhuuuu opened a new pull request #33955:
URL: https://github.com/apache/spark/pull/33955


   ### What changes were proposed in this pull request?
   For query
   ```
   select array_union(array(cast('nan' as double), cast('nan' as double)), array())
   ```
   This returns [NaN, NaN], but it should return [NaN].
   This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
   In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together
   
   
   ### Why are the changes needed?
   Fix bug
   
   ### Does this PR introduce _any_ user-facing change?
   ArrayUnion won't show duplicated `NaN` value
   
   
   ### How was this patch tested?
   Added UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r709284232



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3575,24 +3576,31 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
     if (TypeUtils.typeWithProperEquals(elementType)) {
       (array1, array2) =>
         val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
-        val hs = new OpenHashSet[Any]
-        var foundNullElement = false
+        val hs = new SQLOpenHashSet[Any]()
+        val isNaN = SQLOpenHashSet.isNaN(elementType)
         Seq(array1, array2).foreach { array =>
           var i = 0
           while (i < array.numElements()) {
             if (array.isNullAt(i)) {
-              if (!foundNullElement) {
+              if (!hs.containsNull) {
+                hs.addNull
                 arrayBuffer += null
-                foundNullElement = true
               }
             } else {
               val elem = array.get(i, elementType)
-              if (!hs.contains(elem)) {
-                if (arrayBuffer.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
-                  ArrayBinaryLike.throwUnionLengthOverflowException(arrayBuffer.size)
+              if (isNaN(elem)) {
+                if (!hs.containsNaN) {
+                  arrayBuffer += elem

Review comment:
       Thanks, @cloud-fan and @AngersZhuuuu .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-920111484


   I fetched the latest master and the test passed on my side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918797641


   **[Test build #143239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143239/testReport)** for PR 33955 at commit [`fe407c9`](https://github.com/apache/spark/commit/fe407c9325716f6ba8fd637e05e54dd208c8ab69).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918795424


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47737/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #33955:
URL: https://github.com/apache/spark/pull/33955


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917791275


   **[Test build #143182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143182/testReport)** for PR 33955 at commit [`119679c`](https://github.com/apache/spark/commit/119679cfc5884928d9fa368f683689a214d01912).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-921518353


   I still see this test failure, see https://github.com/apache/spark/runs/3628995384. Shall we revert this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917969856


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47695/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917415029


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47669/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918374932


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47715/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917993869


   **[Test build #143199 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143199/testReport)** for PR 33955 at commit [`991fddd`](https://github.com/apache/spark/commit/991fddd22d80a9e7e946ba679c9582fc14a33ba6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917189917


   **[Test build #143152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143152/testReport)** for PR 33955 at commit [`1857988`](https://github.com/apache/spark/commit/18579884948898f9a9f6e15046fd807a2d294f7e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r708815457



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3575,24 +3576,31 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
     if (TypeUtils.typeWithProperEquals(elementType)) {
       (array1, array2) =>
         val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
-        val hs = new OpenHashSet[Any]
-        var foundNullElement = false
+        val hs = new SQLOpenHashSet[Any]()
+        val isNaN = SQLOpenHashSet.isNaN(elementType)
         Seq(array1, array2).foreach { array =>
           var i = 0
           while (i < array.numElements()) {
             if (array.isNullAt(i)) {
-              if (!foundNullElement) {
+              if (!hs.containsNull) {
+                hs.addNull
                 arrayBuffer += null
-                foundNullElement = true
               }
             } else {
               val elem = array.get(i, elementType)
-              if (!hs.contains(elem)) {
-                if (arrayBuffer.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
-                  ArrayBinaryLike.throwUnionLengthOverflowException(arrayBuffer.size)
+              if (isNaN(elem)) {
+                if (!hs.containsNaN) {
+                  arrayBuffer += elem

Review comment:
       LGTM




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917814704


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47688/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917814737


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47688/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918221136


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143199/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-921519621


   Actually there are some more: https://github.com/apache/spark/runs/3619357249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917969856


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47695/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918340039


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47714/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918016405






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r708570515



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3575,24 +3576,31 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
     if (TypeUtils.typeWithProperEquals(elementType)) {
       (array1, array2) =>
         val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
-        val hs = new OpenHashSet[Any]
-        var foundNullElement = false
+        val hs = new SQLOpenHashSet[Any]()
+        val isNaN = SQLOpenHashSet.isNaN(elementType)
         Seq(array1, array2).foreach { array =>
           var i = 0
           while (i < array.numElements()) {
             if (array.isNullAt(i)) {
-              if (!foundNullElement) {
+              if (!hs.containsNull) {
+                hs.addNull
                 arrayBuffer += null
-                foundNullElement = true
               }
             } else {
               val elem = array.get(i, elementType)
-              if (!hs.contains(elem)) {
-                if (arrayBuffer.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
-                  ArrayBinaryLike.throwUnionLengthOverflowException(arrayBuffer.size)
+              if (isNaN(elem)) {
+                if (!hs.containsNaN) {
+                  arrayBuffer += elem

Review comment:
       Ur, BTW, there are multiple `NaN` values which has different bytes from `Double.NaN`. So, this new semantic is adding the first `NaN` value into the result, right?
   
   @cloud-fan and @AngersZhuuuu . Do we need to normalize the NaN value by adding `Double.NaN` or `Float.NaN` always?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r708796039



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3575,24 +3576,31 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
     if (TypeUtils.typeWithProperEquals(elementType)) {
       (array1, array2) =>
         val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
-        val hs = new OpenHashSet[Any]
-        var foundNullElement = false
+        val hs = new SQLOpenHashSet[Any]()
+        val isNaN = SQLOpenHashSet.isNaN(elementType)
         Seq(array1, array2).foreach { array =>
           var i = 0
           while (i < array.numElements()) {
             if (array.isNullAt(i)) {
-              if (!foundNullElement) {
+              if (!hs.containsNull) {
+                hs.addNull
                 arrayBuffer += null
-                foundNullElement = true
               }
             } else {
               val elem = array.get(i, elementType)
-              if (!hs.contains(elem)) {
-                if (arrayBuffer.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
-                  ArrayBinaryLike.throwUnionLengthOverflowException(arrayBuffer.size)
+              if (isNaN(elem)) {
+                if (!hs.containsNaN) {
+                  arrayBuffer += elem

Review comment:
       good point!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-916998634


   **[Test build #143152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143152/testReport)** for PR 33955 at commit [`1857988`](https://github.com/apache/spark/commit/18579884948898f9a9f6e15046fd807a2d294f7e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917792072


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47685/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917835861


   **[Test build #143188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143188/testReport)** for PR 33955 at commit [`8d0e4a9`](https://github.com/apache/spark/commit/8d0e4a9cbf51cebdaebd5f52303e785dd69da31b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918828436


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47742/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918824327


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47741/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918893073


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47749/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707139228



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3679,22 +3686,44 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
             body
           }
 
+        def withNaNCheck(body: String): String = {
+          (elementType match {
+            case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)")
+            case FloatType => Some(s"java.lang.Float.isNaN((float)$value)")
+            case _ => None
+          }).map { isNaN =>
+            s"""
+               |if ($isNaN) {
+               |  if (!$hashSet.containsNaN()) {
+               |     $size++;
+               |     $hashSet.addNaN();
+               |     $builder.$$plus$$eq($value);
+               |  }
+               |} else {
+               |  $body
+               |}
+             """.stripMargin
+          }
+        }.getOrElse(body)
+
         val processArray = withArrayNullAssignment(

Review comment:
       a probably better code style
   ```
   val body = ...
   val processArray = withArrayNullAssignment(
     s"""
       |$jt $value = ${genGetValue(array, i)};
       |${withNaNCheck(body)}
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-920084616


   The test added here fails:
   
   ```
   sbt.ForkMain$ForkError: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
   	at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112)
   	at scala.collection.mutable.ArrayBuilder$ofDouble.addOne(ArrayBuilder.scala:402)
   	at scala.collection.mutable.Growable.$plus$eq(Growable.scala:36)
   	at scala.collection.mutable.Growable.$plus$eq$(Growable.scala:36)
   	at scala.collection.mutable.ArrayBuilder.$plus$eq(ArrayBuilder.scala:23)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.ArrayUnion_0$(Unknown Source)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.evaluateWithMutableProjection(ExpressionEvalHelper.scala:238)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.evaluateWithMutableProjection$(ExpressionEvalHelper.scala:232)
   	at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.evaluateWithMutableProjection(CollectionExpressionsSuite.scala:39)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkEvaluationWithMutableProjection$2(ExpressionEvalHelper.scala:222)
   	at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
   	at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
   	at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.withSQLConf(CollectionExpressionsSuite.scala:39)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkEvaluationWithMutableProjection$1(ExpressionEvalHelper.scala:221)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkEvaluationWithMutableProjection$1$adapted(ExpressionEvalHelper.scala:220)
   	at scala.collection.immutable.List.foreach(List.scala:333)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithMutableProjection(ExpressionEvalHelper.scala:220)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithMutableProjection$(ExpressionEvalHelper.scala:215)
   	at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkEvaluationWithMutableProjection(CollectionExpressionsSuite.scala:39)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:88)
   	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82)
   ```
   
   https://github.com/apache/spark/runs/3606700233
   
   I wonder how it passed in the PR tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918797641


   **[Test build #143239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143239/testReport)** for PR 33955 at commit [`fe407c9`](https://github.com/apache/spark/commit/fe407c9325716f6ba8fd637e05e54dd208c8ab69).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-919023006


   thanks, merging to master/3.2/3.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-916757280


   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707139812



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.sql.types.{DataType, DoubleType, FloatType}
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN w.r.t. the SQL semantic.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](

Review comment:
       can we add a UT suite for it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917988334


   **[Test build #143198 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143198/testReport)** for PR 33955 at commit [`3059ea1`](https://github.com/apache/spark/commit/3059ea1d526731c0635a66c48c9154ab259f51da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-919004198


   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918859959


   **[Test build #143246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143246/testReport)** for PR 33955 at commit [`4e5e085`](https://github.com/apache/spark/commit/4e5e08526ffe96eaa5add069aef9467948730755).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707102491



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3679,22 +3686,38 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
             body
           }
 
+        val isNaN = elementType match {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918900036


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47749/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918523381


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143213/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707093906



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3679,22 +3686,38 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
             body
           }
 
+        val isNaN = elementType match {

Review comment:
       ```
   def withNaNCheck(body: String): String = {
     (elementType match {
       case DoubleType => Some(...)
       case FloatType => Some(...)
       case _ => None
     }).map { isNaN =>
       s"""
         | if (isNal) ... else $body
       """
     }
   }.getOrElse(body)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918775974


   **[Test build #143235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143235/testReport)** for PR 33955 at commit [`f59c0a8`](https://github.com/apache/spark/commit/f59c0a87c8792e6551f719f78b5049bfc5a2f917).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918900036


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47749/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918978532


   **[Test build #143239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143239/testReport)** for PR 33955 at commit [`fe407c9`](https://github.com/apache/spark/commit/fe407c9325716f6ba8fd637e05e54dd208c8ab69).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707040485



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3575,15 +3576,15 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
     if (TypeUtils.typeWithProperEquals(elementType)) {
       (array1, array2) =>
         val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
-        val hs = new OpenHashSet[Any]
+        val hs = new SQLOpenHashSet[Any]
         var foundNullElement = false

Review comment:
       we can remove this now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918798632


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47737/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918798604


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47737/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918340039


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47714/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917928352


   **[Test build #143193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143193/testReport)** for PR 33955 at commit [`89a4263`](https://github.com/apache/spark/commit/89a426374c0873c48e738d96d0f46f99b6e39f6d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918040810






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707042392



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN w.r.t. the SQL semantic.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containNull = false
+  private var containNaN = false
+
+  def addNull(): Unit = {
+    containNull = true
+  }
+
+  def addNaN(): Unit = {
+    containNaN = true
+  }
+
+  def add(k: T): Unit = {
+    hashSet.add(k)
+  }
+
+  def contains(k: T): Boolean = {

Review comment:
       shall we add a method `containsNaN`? checking NaN by reflection is pretty slow




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706276419



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN.

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-916750680


   **[Test build #143142 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143142/testReport)** for PR 33955 at commit [`8579c97`](https://github.com/apache/spark/commit/8579c9769df6bfe4f59dda612661d402938867a3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918824358


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47741/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918058441


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47702/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918529528


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143212/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917791275


   **[Test build #143182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143182/testReport)** for PR 33955 at commit [`119679c`](https://github.com/apache/spark/commit/119679cfc5884928d9fa368f683689a214d01912).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918016407






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918031394


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47700/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-916804017


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47646/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918953496


   **[Test build #143238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143238/testReport)** for PR 33955 at commit [`f27c4e1`](https://github.com/apache/spark/commit/f27c4e12530e7d98eefb49bd9631f5d19785d9c2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918798632


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47737/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918213551


   **[Test build #143198 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143198/testReport)** for PR 33955 at commit [`3059ea1`](https://github.com/apache/spark/commit/3059ea1d526731c0635a66c48c9154ab259f51da).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918216168


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143198/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917417677


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47669/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918824140


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47742/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707041873



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN w.r.t. the SQL semantic.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containNull = false
+  private var containNaN = false
+
+  def addNull(): Unit = {
+    containNull = true
+  }
+
+  def addNaN(): Unit = {
+    containNaN = true
+  }
+
+  def add(k: T): Unit = {
+    hashSet.add(k)
+  }
+
+  def contains(k: T): Boolean = {
+    if (SQLOpenHashSet.isNaN(k)) {
+      containNaN
+    } else {
+      hashSet.contains(k)
+    }
+  }
+
+  def containsNull(): Boolean = containNull
+}
+
+object SQLOpenHashSet {
+  def isNaN(value: Any): Boolean = {
+    (value.isInstanceOf[java.lang.Double] &&

Review comment:
       this looks very slow. At least in codegen, we can write `java.lang.Float/Double.isNaN` based on the data type




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918003002


   **[Test build #143188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143188/testReport)** for PR 33955 at commit [`8d0e4a9`](https://github.com/apache/spark/commit/8d0e4a9cbf51cebdaebd5f52303e785dd69da31b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917835861


   **[Test build #143188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143188/testReport)** for PR 33955 at commit [`8d0e4a9`](https://github.com/apache/spark/commit/8d0e4a9cbf51cebdaebd5f52303e785dd69da31b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917409055


   **[Test build #143165 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143165/testReport)** for PR 33955 at commit [`4e533fd`](https://github.com/apache/spark/commit/4e533fdabcae676560f9396442df4f4993cc2f67).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917084869


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143142/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-916998634


   **[Test build #143152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143152/testReport)** for PR 33955 at commit [`1857988`](https://github.com/apache/spark/commit/18579884948898f9a9f6e15046fd807a2d294f7e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918783434


   **[Test build #143238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143238/testReport)** for PR 33955 at commit [`f27c4e1`](https://github.com/apache/spark/commit/f27c4e12530e7d98eefb49bd9631f5d19785d9c2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918064003


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47702/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707139812



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.sql.types.{DataType, DoubleType, FloatType}
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN w.r.t. the SQL semantic.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](

Review comment:
       can we add a UT suite for it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918900014


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47749/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917858174


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47690/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918288358


   **[Test build #143212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143212/testReport)** for PR 33955 at commit [`db8159e`](https://github.com/apache/spark/commit/db8159e3676ee5d137e5ebcbc94d92576f7ca0aa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918824358


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47741/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918288358


   **[Test build #143212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143212/testReport)** for PR 33955 at commit [`db8159e`](https://github.com/apache/spark/commit/db8159e3676ee5d137e5ebcbc94d92576f7ca0aa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918528199


   **[Test build #143212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143212/testReport)** for PR 33955 at commit [`db8159e`](https://github.com/apache/spark/commit/db8159e3676ee5d137e5ebcbc94d92576f7ca0aa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918218723


   **[Test build #143199 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143199/testReport)** for PR 33955 at commit [`991fddd`](https://github.com/apache/spark/commit/991fddd22d80a9e7e946ba679c9582fc14a33ba6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918202888


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143196/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706177022



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containsNull = false
+  private var containsDoubleNaN = false
+  private var containsFloatNaN = false

Review comment:
       Maybe we should do the null/nan check at the caller side
   ```
   class SQLOpenHashSet ... {
     def add(k: T)
     def addNull()
     def addNaN()
   }
   
   // caller side
   if (row.isNullAt...) {
     set.addNull()
   } else {
     ...
     if (java.lang.Double.isNaN(value)) {
       set.addNaN()
     }
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917042176


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47656/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706276531



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containsNull = false
+  private var containsDoubleNaN = false
+  private var containsFloatNaN = false

Review comment:
       How about current

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3649,61 +3643,37 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
       val ptName = CodeGenerator.primitiveTypeName(jt)
 
       nullSafeCodeGen(ctx, ev, (array1, array2) => {
-        val foundNullElement = ctx.freshName("foundNullElement")
         val nullElementIndex = ctx.freshName("nullElementIndex")
         val builder = ctx.freshName("builder")
         val array = ctx.freshName("array")
         val arrays = ctx.freshName("arrays")
         val arrayDataIdx = ctx.freshName("arrayDataIdx")
-        val openHashSet = classOf[OpenHashSet[_]].getName
+        val openHashSet = classOf[SQLOpenHashSet[_]].getName
         val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()"
         val hashSet = ctx.freshName("hashSet")
         val arrayBuilder = classOf[mutable.ArrayBuilder[_]].getName
         val arrayBuilderClass = s"$arrayBuilder$$of$ptName"
 
-        def withArrayNullAssignment(body: String) =
-          if (dataType.asInstanceOf[ArrayType].containsNull) {
-            s"""
-               |if ($array.isNullAt($i)) {
-               |  if (!$foundNullElement) {
-               |    $nullElementIndex = $size;
-               |    $foundNullElement = true;
-               |    $size++;
-               |    $builder.$$plus$$eq($nullValueHolder);
-               |  }
-               |} else {
-               |  $body
-               |}
-             """.stripMargin
-          } else {
-            body
-          }
-
-        val processArray = withArrayNullAssignment(
+        val processArray =
           s"""
              |$jt $value = ${genGetValue(array, i)};

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917988334


   **[Test build #143198 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143198/testReport)** for PR 33955 at commit [`3059ea1`](https://github.com/apache/spark/commit/3059ea1d526731c0635a66c48c9154ab259f51da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706155804



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containsNull = false
+  private var containsDoubleNaN = false
+  private var containsFloatNaN = false

Review comment:
       > The data added to this set will always be the same data type. I think we can just have a single `containsNaN` flag.
   
   I have thought about this too, but since it can support any type, so keep this may be better?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918374932


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47715/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917812261


   **[Test build #143186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143186/testReport)** for PR 33955 at commit [`45d1fee`](https://github.com/apache/spark/commit/45d1feebbbafb59bc5acf9524b3d4c33761060bf).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917851170


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47690/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917792860


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143182/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918343433


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47715/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917077242


   **[Test build #143142 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143142/testReport)** for PR 33955 at commit [`8579c97`](https://github.com/apache/spark/commit/8579c9769df6bfe4f59dda612661d402938867a3).
    * This patch **fails from timeout after a configured wait of `500m`**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918523381


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143213/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918955109


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143238/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918040810


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47700/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-919023223


   @AngersZhuuuu can you open a backport PR for 3.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918775974


   **[Test build #143235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143235/testReport)** for PR 33955 at commit [`f59c0a8`](https://github.com/apache/spark/commit/f59c0a87c8792e6551f719f78b5049bfc5a2f917).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-919326129


   cc @sunchao and @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-919071645


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143246/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917194384


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143152/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-916792864


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47646/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917417677


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47669/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917792845


   **[Test build #143182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143182/testReport)** for PR 33955 at commit [`119679c`](https://github.com/apache/spark/commit/119679cfc5884928d9fa368f683689a214d01912).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917814737


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47688/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917973756


   **[Test build #143196 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143196/testReport)** for PR 33955 at commit [`08da413`](https://github.com/apache/spark/commit/08da4130599d35b2b4a2af1a40a00550617e447f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917858174


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47690/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917789034


   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918828436


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47742/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918989931


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143239/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917973756


   **[Test build #143196 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143196/testReport)** for PR 33955 at commit [`08da413`](https://github.com/apache/spark/commit/08da4130599d35b2b4a2af1a40a00550617e447f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707071732



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3575,15 +3576,15 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
     if (TypeUtils.typeWithProperEquals(elementType)) {
       (array1, array2) =>
         val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
-        val hs = new OpenHashSet[Any]
+        val hs = new SQLOpenHashSet[Any]
         var foundNullElement = false

Review comment:
       Done

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN w.r.t. the SQL semantic.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containNull = false
+  private var containNaN = false
+
+  def addNull(): Unit = {
+    containNull = true
+  }
+
+  def addNaN(): Unit = {
+    containNaN = true
+  }
+
+  def add(k: T): Unit = {
+    hashSet.add(k)
+  }
+
+  def contains(k: T): Boolean = {
+    if (SQLOpenHashSet.isNaN(k)) {
+      containNaN
+    } else {
+      hashSet.contains(k)
+    }
+  }
+
+  def containsNull(): Boolean = containNull
+}
+
+object SQLOpenHashSet {
+  def isNaN(value: Any): Boolean = {
+    (value.isInstanceOf[java.lang.Double] &&

Review comment:
       How about current?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN w.r.t. the SQL semantic.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containNull = false
+  private var containNaN = false
+
+  def addNull(): Unit = {
+    containNull = true
+  }
+
+  def addNaN(): Unit = {
+    containNaN = true
+  }
+
+  def add(k: T): Unit = {
+    hashSet.add(k)
+  }
+
+  def contains(k: T): Boolean = {

Review comment:
       How about current?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918820137


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47741/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918859959


   **[Test build #143246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143246/testReport)** for PR 33955 at commit [`4e5e085`](https://github.com/apache/spark/commit/4e5e08526ffe96eaa5add069aef9467948730755).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917812261


   **[Test build #143186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143186/testReport)** for PR 33955 at commit [`45d1fee`](https://github.com/apache/spark/commit/45d1feebbbafb59bc5acf9524b3d4c33761060bf).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918354595


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47715/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918783434


   **[Test build #143238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143238/testReport)** for PR 33955 at commit [`f27c4e1`](https://github.com/apache/spark/commit/f27c4e12530e7d98eefb49bd9631f5d19785d9c2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707145253



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3679,22 +3686,42 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
             body
           }
 
-        val processArray = withArrayNullAssignment(
+        def withNaNCheck(body: String): String = {
+          (elementType match {
+            case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)")
+            case FloatType => Some(s"java.lang.Float.isNaN((float)$value)")
+            case _ => None
+          }).map { isNaN =>
+            s"""
+               |if ($isNaN) {
+               |  if (!$hashSet.containsNaN()) {
+               |     $size++;
+               |     $hashSet.addNaN();
+               |     $builder.$$plus$$eq($value);
+               |  }
+               |} else {
+               |  $body
+               |}
+             """.stripMargin
+          }
+        }.getOrElse(body)
+
+        val body =
           s"""
-             |$jt $value = ${genGetValue(array, i)};
              |if (!$hashSet.contains($hsValueCast$value)) {
              |  if (++$size > ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}) {
              |    break;
              |  }
              |  $hashSet.add$hsPostFix($hsValueCast$value);
              |  $builder.$$plus$$eq($value);
              |}
-           """.stripMargin)
+           """.stripMargin
+        val processArray =
+          withArrayNullAssignment(s"$jt $value = ${genGetValue(array, i)};" ++ withNaNCheck(body))

Review comment:
       ```suggestion
             withArrayNullAssignment(s"$jt $value = ${genGetValue(array, i)};" + withNaNCheck(body))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-919071645


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143246/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917993869


   **[Test build #143199 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143199/testReport)** for PR 33955 at commit [`991fddd`](https://github.com/apache/spark/commit/991fddd22d80a9e7e946ba679c9582fc14a33ba6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918151004


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918942410


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143235/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917792860


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143182/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918221136


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143199/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-919069304


   **[Test build #143246 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143246/testReport)** for PR 33955 at commit [`4e5e085`](https://github.com/apache/spark/commit/4e5e08526ffe96eaa5add069aef9467948730755).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918192845


   **[Test build #143196 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143196/testReport)** for PR 33955 at commit [`08da413`](https://github.com/apache/spark/commit/08da4130599d35b2b4a2af1a40a00550617e447f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917451396


   **[Test build #143165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143165/testReport)** for PR 33955 at commit [`4e533fd`](https://github.com/apache/spark/commit/4e533fdabcae676560f9396442df4f4993cc2f67).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706725649



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3550,6 +3551,10 @@ object ArrayBinaryLike {
   def throwUnionLengthOverflowException(length: Int): Unit = {
     throw QueryExecutionErrors.unionArrayWithElementsExceedLimitError(length)
   }
+
+  def isNaN(value: Any): Boolean = {
+    Double.NaN.equals(value) || Float.NaN.equals(value)

Review comment:
       Ok, done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917194384


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143152/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917453739


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143165/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917813739


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143186/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918216168


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143198/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918955109


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143238/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-916750680


   **[Test build #143142 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143142/testReport)** for PR 33955 at commit [`8579c97`](https://github.com/apache/spark/commit/8579c9769df6bfe4f59dda612661d402938867a3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917084869


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143142/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917409055


   **[Test build #143165 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143165/testReport)** for PR 33955 at commit [`4e533fd`](https://github.com/apache/spark/commit/4e533fdabcae676560f9396442df4f4993cc2f67).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cfmcgrady commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cfmcgrady commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706304078



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3550,6 +3551,10 @@ object ArrayBinaryLike {
   def throwUnionLengthOverflowException(length: Int): Unit = {
     throw QueryExecutionErrors.unionArrayWithElementsExceedLimitError(length)
   }
+
+  def isNaN(value: Any): Boolean = {
+    Double.NaN.equals(value) || Float.NaN.equals(value)

Review comment:
       Seems `Double.NaN.equals(value)` can't work with `Scala-2.13` together, we need use `java.lang.Double.isNaN()` instead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-916804017


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47646/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706154056



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN.

Review comment:
       ```
   A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN w.r.t. the SQL semantic.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917453739


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143165/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918989931


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143239/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917965039


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47695/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-921524320


   This is so weird. There is no randomness in the test. How frequently do we see the test failure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917813739


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143186/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918065896


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47702/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918828405


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47742/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918040743


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47700/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918151004


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918003584


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47698/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918339982


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47714/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-920472900


   Thanks guys.This is possibly flaky. I'll keep my eyes on the build.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-920095421


   > The test added here fails:
   > 
   > ```
   > sbt.ForkMain$ForkError: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
   > 	at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112)
   > 	at scala.collection.mutable.ArrayBuilder$ofDouble.addOne(ArrayBuilder.scala:402)
   > 	at scala.collection.mutable.Growable.$plus$eq(Growable.scala:36)
   > 	at scala.collection.mutable.Growable.$plus$eq$(Growable.scala:36)
   > 	at scala.collection.mutable.ArrayBuilder.$plus$eq(ArrayBuilder.scala:23)
   > 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.ArrayUnion_0$(Unknown Source)
   > 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.evaluateWithMutableProjection(ExpressionEvalHelper.scala:238)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.evaluateWithMutableProjection$(ExpressionEvalHelper.scala:232)
   > 	at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.evaluateWithMutableProjection(CollectionExpressionsSuite.scala:39)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkEvaluationWithMutableProjection$2(ExpressionEvalHelper.scala:222)
   > 	at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
   > 	at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
   > 	at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.withSQLConf(CollectionExpressionsSuite.scala:39)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkEvaluationWithMutableProjection$1(ExpressionEvalHelper.scala:221)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkEvaluationWithMutableProjection$1$adapted(ExpressionEvalHelper.scala:220)
   > 	at scala.collection.immutable.List.foreach(List.scala:333)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithMutableProjection(ExpressionEvalHelper.scala:220)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithMutableProjection$(ExpressionEvalHelper.scala:215)
   > 	at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkEvaluationWithMutableProjection(CollectionExpressionsSuite.scala:39)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:88)
   > 	at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82)
   > ```
   > 
   > https://github.com/apache/spark/runs/3606700233
   > 
   > I wonder how it passed in the PR tests.
   
   let me check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917928352


   **[Test build #143193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143193/testReport)** for PR 33955 at commit [`89a4263`](https://github.com/apache/spark/commit/89a426374c0873c48e738d96d0f46f99b6e39f6d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917813715


   **[Test build #143186 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143186/testReport)** for PR 33955 at commit [`45d1fee`](https://github.com/apache/spark/commit/45d1feebbbafb59bc5acf9524b3d4c33761060bf).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918522029


   **[Test build #143213 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143213/testReport)** for PR 33955 at commit [`eb1f028`](https://github.com/apache/spark/commit/eb1f02819db9861604945a522dfbb85daa6cca43).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918294963


   **[Test build #143213 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143213/testReport)** for PR 33955 at commit [`eb1f028`](https://github.com/apache/spark/commit/eb1f02819db9861604945a522dfbb85daa6cca43).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918202888


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143196/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706154695



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containsNull = false
+  private var containsDoubleNaN = false
+  private var containsFloatNaN = false

Review comment:
       The data added to this set will always be the same data type. I think we can just have a single `containsNaN` flag.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706158509



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3649,61 +3643,37 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
       val ptName = CodeGenerator.primitiveTypeName(jt)
 
       nullSafeCodeGen(ctx, ev, (array1, array2) => {
-        val foundNullElement = ctx.freshName("foundNullElement")
         val nullElementIndex = ctx.freshName("nullElementIndex")
         val builder = ctx.freshName("builder")
         val array = ctx.freshName("array")
         val arrays = ctx.freshName("arrays")
         val arrayDataIdx = ctx.freshName("arrayDataIdx")
-        val openHashSet = classOf[OpenHashSet[_]].getName
+        val openHashSet = classOf[SQLOpenHashSet[_]].getName
         val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()"
         val hashSet = ctx.freshName("hashSet")
         val arrayBuilder = classOf[mutable.ArrayBuilder[_]].getName
         val arrayBuilderClass = s"$arrayBuilder$$of$ptName"
 
-        def withArrayNullAssignment(body: String) =
-          if (dataType.asInstanceOf[ArrayType].containsNull) {
-            s"""
-               |if ($array.isNullAt($i)) {
-               |  if (!$foundNullElement) {
-               |    $nullElementIndex = $size;
-               |    $foundNullElement = true;
-               |    $size++;
-               |    $builder.$$plus$$eq($nullValueHolder);
-               |  }
-               |} else {
-               |  $body
-               |}
-             """.stripMargin
-          } else {
-            body
-          }
-
-        val processArray = withArrayNullAssignment(
+        val processArray =
           s"""
              |$jt $value = ${genGetValue(array, i)};

Review comment:
       The value can be primitive type, and we should make sure it's not null before calling `SQLOpenHashSet.add/contains`. We should still follow the previous code style.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917027368


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47656/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cfmcgrady commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cfmcgrady commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r706325696



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util
+
+import scala.reflect._
+
+import org.apache.spark.annotation.Private
+import org.apache.spark.util.collection.OpenHashSet
+
+/**
+ * A wrap of [[OpenHashSet]] that can handle null, Double.NaN and Float.NaN w.r.t. the SQL semantic.
+ */
+@Private
+class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
+    initialCapacity: Int,
+    loadFactor: Double) {
+
+  def this(initialCapacity: Int) = this(initialCapacity, 0.7)
+
+  def this() = this(64)
+
+  private val hashSet = new OpenHashSet[T](initialCapacity, loadFactor)
+
+  private var containNull = false
+  private var containNaN = false
+
+  def addNull(): Unit = {
+    containNull = true
+  }
+
+  def addNaN(): Unit = {
+    containNaN = true
+  }
+
+  def add(k: T): Unit = {
+    hashSet.add(k)
+  }
+
+  def contains(k: T): Boolean = {
+    if (Double.NaN.equals(k)) {

Review comment:
       ditto.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917042176


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47656/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-920100427


   @HyukjinKwon This commit pass the check ac8bce83e7abb01fcea9e53a67a695e31aef7b6a https://github.com/apache/spark/pull/34006/commits


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918938930


   **[Test build #143235 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143235/testReport)** for PR 33955 at commit [`f59c0a8`](https://github.com/apache/spark/commit/f59c0a87c8792e6551f719f78b5049bfc5a2f917).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `public class NettyLogger `
     * `public final class AlwaysFalse extends Filter `
     * `public final class AlwaysTrue extends Filter `
     * `public final class And extends BinaryFilter `
     * `abstract class BinaryComparison extends Filter `
     * `abstract class BinaryFilter extends Filter `
     * `public final class EqualNullSafe extends BinaryComparison `
     * `public final class EqualTo extends BinaryComparison `
     * `public abstract class Filter implements Expression `
     * `public final class GreaterThan extends BinaryComparison `
     * `public final class GreaterThanOrEqual extends BinaryComparison `
     * `public final class In extends Filter `
     * `public final class IsNotNull extends Filter `
     * `public final class IsNull extends Filter `
     * `public final class LessThan extends BinaryComparison `
     * `public final class LessThanOrEqual extends BinaryComparison `
     * `public final class Not extends Filter `
     * `public final class Or extends BinaryFilter `
     * `public final class StringContains extends StringPredicate `
     * `public final class StringEndsWith extends StringPredicate `
     * `abstract class StringPredicate extends Filter `
     * `public final class StringStartsWith extends StringPredicate `
     * `case class OptimizeSkewedJoin(`
     * `case class SkewJoinAwareCost(`
     * `case class SimpleCostEvaluator(forceOptimizeSkewedJoin: Boolean) extends CostEvaluator `
     * `case class EnsureRequirements(`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918942410


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143235/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918529528


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143212/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918332492


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47714/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917958250


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47695/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917792054


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47685/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-917792072


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47685/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #33955:
URL: https://github.com/apache/spark/pull/33955#discussion_r707137974



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##########
@@ -3679,22 +3686,44 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi
             body
           }
 
+        def withNaNCheck(body: String): String = {
+          (elementType match {
+            case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)")
+            case FloatType => Some(s"java.lang.Float.isNaN((float)$value)")
+            case _ => None
+          }).map { isNaN =>
+            s"""
+               |if ($isNaN) {
+               |  if (!$hashSet.containsNaN()) {
+               |     $size++;
+               |     $hashSet.addNaN();
+               |     $builder.$$plus$$eq($value);
+               |  }
+               |} else {
+               |  $body
+               |}
+             """.stripMargin
+          }
+        }.getOrElse(body)
+
         val processArray = withArrayNullAssignment(
           s"""
              |$jt $value = ${genGetValue(array, i)};

Review comment:
       now it's one line, we don't need to use multi-line syntax here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918294963


   **[Test build #143213 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143213/testReport)** for PR 33955 at commit [`eb1f028`](https://github.com/apache/spark/commit/eb1f02819db9861604945a522dfbb85daa6cca43).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #33955: [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #33955:
URL: https://github.com/apache/spark/pull/33955#issuecomment-918139222


   **[Test build #143193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143193/testReport)** for PR 33955 at commit [`89a4263`](https://github.com/apache/spark/commit/89a426374c0873c48e738d96d0f46f99b6e39f6d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org