You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "nchammas (via GitHub)" <gi...@apache.org> on 2024/02/06 02:56:42 UTC

[PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

nchammas opened a new pull request, #45036:
URL: https://github.com/apache/spark/pull/45036

   ### What changes were proposed in this pull request?
   
   Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys.
   
   ### Why are the changes needed?
   
   In certain cases where a) both 0.0 and -0.0 are provided as keys to the set and b) they happen to hash to the same bucket, one of the values will be dropped because the lookup indicates the value is already in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1].
   
   [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it resolves the bug described in SPARK-45599, which affects `percentile()` and perhaps other functions that rely on `OpenHashSet` under the hood.
   
   Shifting from `==` to `equals` also changes how `NaN` is stored in the set, though the user impact of that should only be to save some memory since `NaN` will now only get one entry in the set.
   
   ### How was this patch tested?
   
   New and existing unit tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1932381997

   > I can hold cutting the tag of RC1 for this.
   
   Thanks @HeartSaVioR. BTW I don't think this should be a blocker of 3.5.1 as this is not a regression, but if we can find a quick fix for this issue it would be good to include it in the release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1482085088


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   > This is a bit tricky and it's better if we can find a reference system that defines this semantic.
   
   ```scala
   scala> import java.util.HashSet
   import java.util.HashSet
   
   scala> val h = new HashSet[Double]()
   val h: java.util.HashSet[Double] = []
   
   scala> h.add(0.0)
   val res0: Boolean = true
   
   scala> h.add(-0.0)
   val res1: Boolean = true
   
   scala> h.size()
   val res2: Int = 2
   ```
   
   The doc for [HashSet.add][1] states:
   
   > More formally, adds the specified element e to this set if this set contains no element e2 such that Objects.equals(e, e2). If this set already contains the element, the call leaves the set unchanged and returns false.
   
   In other words, `java.util.HashSet` uses `equals` and not `==`, and therefore it considers `0.0` and `-0.0` distinct elements.
   
   So this PR brings `OpenHashSet` more in line with the semantics of `java.util.HashSet`.
   
   [1]: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/HashSet.html#add(E)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1490967547


##########
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:
##########
@@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
     this
   }
 
+  /**
+   * Check if a key exists at the provided position using object equality rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.

Review Comment:
   I was told that in scala `==` is the same as `equals`, but `eq` is a different operator. I need to refresh my knowledge now :)



##########
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:
##########
@@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
     this
   }
 
+  /**
+   * Check if a key exists at the provided position using object equality rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.
+   *
+   * See: https://issues.apache.org/jira/browse/SPARK-45599
+   */
+  @annotation.nowarn("cat=other-non-cooperative-equals")
+  private def keyExistsAtPos(k: T, pos: Int) =
+    // _data(pos) == k

Review Comment:
   we should remove it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1947835261

   If we go this direction and change `OpenHashSet` do we still need `SQLOpenHashSet`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "revans2 (via GitHub)" <gi...@apache.org>.
revans2 commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1482265980


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,42 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
+    // Therefore, 0.0 and -0.0 should get separate entries in the hash set.
+    //
+    // Exactly these elements provided in roughly this order will trigger the following scenario:
+    // When probing the bitset in `getPos(-0.0)`, the loop will happen upon the entry for 0.0.
+    // In the old logic pre-SPARK-45599, the loop will find that the bit is set and, because
+    // -0.0 == 0.0, it will think that's the position of -0.0. But in reality this is the position
+    // of 0.0. So -0.0 and 0.0 will be stored at different positions, but `getPos()` will return
+    // the same position for them. This can cause users of OpenHashSet, like OpenHashMap, to
+    // return the wrong value for a key based on whether or not this bitset lookup collision
+    // happens.

Review Comment:
   I found it because I was essentially running fuzz testing comparing https://github.com/NVIDIA/spark-rapids to the gold standard Apache Spark. Most failures that we find end up being issues with the RAPIDs Accelerator for Apache Spark, but every so often I find something that ends up being wrong with Spark, like in this case.  So yes it was just pure luck that we found it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1482324222


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   Spark's `OpenHashSet` does not have to match `java.util.HashSet`. What matters is the SQL semantic. Can you highlight which functions/operators are using this `OpenHashSet` and what is the impact of this change to the SQL semantic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491975003


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##########
@@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers {
     map(null) = null
     assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+    // Exactly these elements provided in roughly this order trigger a condition where lookups of
+    // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly
+    // and inconsistently if `==` is used to check for key equality.

Review Comment:
   I tweaked the test name. Is that what you had in mind?
   
   This comment explains why we need exactly the following elements to trigger the 0.0/-0.0 miscount. It doesn't always happen (which is part of what kept this bug hidden for so long).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1482123330


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   Consider another interesting case where `java.util.HashSet` and `OpenHashSet` differ:
   
   ```scala
   scala> val h = new HashSet[Double]()
   val h: java.util.HashSet[Double] = []
   
   scala> h.add(Double.NaN)
   val res9: Boolean = true
   
   scala> h.add(Double.NaN)
   val res10: Boolean = false
   
   scala> h.size()
   val res11: Int = 1
   ```
   
   On `master`, `OpenHashSet` does IMO the wrong thing:
   
   ```scala
   val set = new OpenHashSet[Double]()
   set.add(Double.NaN)
   set.add(Double.NaN)
   set.size  // returns 2
   ```
   
   This could possibly lead to a bug like the one reported in SPARK-45599 but in reverse, where a new NaN row is added rather than dropped. I will see if I can construct such a scenario as a demonstration. But regardless, I think this behavior is incorrect by itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1480904118


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   This is a bit tricky and it's better if we can find a reference system that defines this semantic. In Spark, `0.0 == -0.0`, and in GROUP BY, 0.0 and -0.0 are considered to be in the same group and normalized to 0.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491931994


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##########
@@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers {
     map(null) = null
     assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+    // Exactly these elements provided in roughly this order trigger a condition where lookups of
+    // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly
+    // and inconsistently if `==` is used to check for key equality.

Review Comment:
   shall we mention the NaN behavior as well? All NaN values are all the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1490973900


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
+    // Therefore, 0.0 and -0.0 should get separate entries in the hash set.
+    //
+    // Exactly these elements provided in roughly this order will trigger the following scenario:
+    // When probing the bitset in `getPos(-0.0)`, the loop will happen upon the entry for 0.0.
+    // In the old logic pre-SPARK-45599, the loop will find that the bit is set and, because
+    // -0.0 == 0.0, it will think that's the position of -0.0. But in reality this is the position
+    // of 0.0. So -0.0 and 0.0 will be stored at different positions, but `getPos()` will return
+    // the same position for them. This can cause users of OpenHashSet, like OpenHashMap, to
+    // return the wrong value for a key based on whether or not this bitset lookup collision
+    // happens.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+    val set = new OpenHashSet[Double]()
+    spark45599Repro.foreach(set.add)
+    assert(set.size == 6)
+    val zeroPos = set.getPos(0.0)
+    val negZeroPos = set.getPos(-0.0)
+    assert(zeroPos != negZeroPos)
+  }
+
+  test("SPARK-45599: NaN and NaN are the same but not equal") {
+    // Any mathematical comparison to NaN will return false, but when we place it in
+    // a hash set we want the lookup to work like a "normal" value.
+    val set = new OpenHashSet[Double]()
+    set.add(Double.NaN)
+    set.add(Double.NaN)
+    assert(set.contains(Double.NaN))
+    assert(set.size == 1)

Review Comment:
   do we actually waste space in `OpenHashSet` to store all the NaN values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491176098


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##########
@@ -249,4 +249,32 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers {
     map(null) = null
     assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+    // Exactly these elements provided in roughly this order trigger a condition where lookups of
+    // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly
+    // and inconsistently if `==` is used to check for key equality.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+
+    val map1 = new OpenHashMap[Double, Int]()
+    spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1}))
+    assert(map1(0.0) == 1)
+    assert(map1(-0.0) == 1)
+
+    val map2 = new OpenHashMap[Double, Int]()
+    // Simply changing the order in which the elements are added to the map should not change the
+    // counts for 0.0 and -0.0.
+    spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1}))
+    assert(map2(0.0) == 1)
+    assert(map2(-0.0) == 1)

Review Comment:
   Added just below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1967915028

   Backport to 3.5: #45296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1967032385

   Hi, @cloud-fan and @nchammas .
   
   This broke Apache Spark branch-3.5.
   
   ![Screenshot 2024-02-27 at 08 38 03](https://github.com/apache/spark/assets/9700541/55560e72-aa6a-4aeb-8f23-66cb6ffd327c)
   
   ```
   [error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:136:4: Invalid message filter:
   [error] Unknown category: `other-non-cooperative-equals`
   [error]   @annotation.nowarn("cat=other-non-cooperative-equals")
   [error]    ^
   [error] one error found
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1482089432


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   >  In Spark, 0.0 == -0.0, and in GROUP BY, 0.0 and -0.0 are considered to be in the same group and normalized to 0.0.
   
   This PR does not change this behavior. I noticed, however, that we do not have any tests currently to check that -0.0 is normalized and grouped as you describe, so I went ahead and added such a test in 2bfc60548db2c41f4c64b63d40a2652cb22732ab.
   
   Does this address your concern? Or are you suggesting that we should normalize -0.0 to 0.0 across the board?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1947785597

   @revans2 - Just to confirm, have you rerun your tests against this branch (the ones that exposed this bug) and do they pass now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1967045615

   Sorry but I reverted this from `branch-3.5` to unblock the community PRs.
   - https://github.com/apache/spark/commit/cbf25fb633f4bf2f83a6f6e39aafaa80bf47e160
   
   Please make a backporting PR to branch-3.5 again after fixing the above annotation compilation. Thank you in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1963523699

   thanks, merging to master/3.5!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1928693126

   `core/testOnly org.apache.spark.util.collection.*` passes on my machine, but I haven't run the full test suite yet. I will monitor the full test results here on GitHub.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1481726694


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   Yes, probably make sense if we just fix this particular issue as `OpenHashSet` is used at many other places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1930043679

   Build looks good.
   
   @revans2 and @peter-toth - What do you think? (Tagging you since I noticed you are watching the linked JIRA ticket.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #45036: [SPARK-45599][CORE] Use object equality in OpenHashSet
URL: https://github.com/apache/spark/pull/45036


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1967413553

   @dongjoon-hyun - I think this warning category was added in Scala 2.13 and then backported to Scala 2.12. Trying to confirm right now over on https://github.com/scala/scala/pull/8120 and https://github.com/scala/scala/pull/10446.
   
   I see that Spark 3.5 supports both Scala 2.12 and 2.13, so I will figure out some method that is compatible across both versions and open a PR against `branch-3.5`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491125038


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
+    // Therefore, 0.0 and -0.0 should get separate entries in the hash set.
+    //
+    // Exactly these elements provided in roughly this order will trigger the following scenario:
+    // When probing the bitset in `getPos(-0.0)`, the loop will happen upon the entry for 0.0.
+    // In the old logic pre-SPARK-45599, the loop will find that the bit is set and, because
+    // -0.0 == 0.0, it will think that's the position of -0.0. But in reality this is the position
+    // of 0.0. So -0.0 and 0.0 will be stored at different positions, but `getPos()` will return
+    // the same position for them. This can cause users of OpenHashSet, like OpenHashMap, to
+    // return the wrong value for a key based on whether or not this bitset lookup collision
+    // happens.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+    val set = new OpenHashSet[Double]()
+    spark45599Repro.foreach(set.add)
+    assert(set.size == 6)
+    val zeroPos = set.getPos(0.0)
+    val negZeroPos = set.getPos(-0.0)
+    assert(zeroPos != negZeroPos)
+  }
+
+  test("SPARK-45599: NaN and NaN are the same but not equal") {
+    // Any mathematical comparison to NaN will return false, but when we place it in
+    // a hash set we want the lookup to work like a "normal" value.
+    val set = new OpenHashSet[Double]()
+    set.add(Double.NaN)
+    set.add(Double.NaN)
+    assert(set.contains(Double.NaN))
+    assert(set.size == 1)

Review Comment:
   Yes sir. On `master`, this is the actual behavior of `OpenHashSet`:
   
   ```scala
   // ...OpenHashSet$mcD$sp@21b327e6 did not contain NaN
   assert(set.contains(Double.NaN))
   
   // ...OpenHashSet$mcD$sp@1f09db1e had size 2 instead of expected size 1
   assert(set.size == 1)
   ```
   
   Every NaN will get its own entry in `OpenHashSet` on `master`. So if we add 1,000,000 NaNs to the set, NaN will have 1,000,000 entries in there. And `.contains()` will _still_ return `false`. :D



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1947853367

   > `SQLOpenHashSet` also handles null differently. Not sure if `OpenHashSet` already covers it.
   
   Yes, you are right. Probably the `NaN` special handling can be remvoved though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1947841579

   `SQLOpenHashSet` also handles null differently. Not sure if `OpenHashSet` already covers it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1486895251


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   > Spark's `OpenHashSet` does not have to match `java.util.HashSet`. What matters is the SQL semantic.
   
   Whether or not `OpenHashSet` matches `java.util.HashSet`, I want to emphasize for the record that `OpenHashSet` mishandles 0.0/-0.0 and NaN. Its behavior is simply _incorrect_. [These][test1] [tests][test2] fail on `master` in ways that can only be described as bugs, regardless of whatever SQL semantics we want to preserve.
   
   [This comment][why] explains the root cause. Basically, it is a mistake to combine hash code-based lookups with cooperative equality, at least in the way we are doing it in `OpenHashSet`.
   
   [test1]: https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R274-R278
   [test2]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R308-R309
   [why]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
   
   But I understand what you are saying. Fixing bugs in `OpenHashSet` doesn't help us if it also breaks users' SQL.
   
   > Can you highlight which functions/operators are using this `OpenHashSet` 
   
   I've updated the PR description with a summary of what uses `OpenHashSet`.
   
   As a side note, I believe that if we accept the change proposed here, we should be able to eliminate `SQLOpenHashSet`. `SQLOpenHashSet` was created specifically to work around the bugs in `OpenHashSet` that we are addressing in this PR. See #33955 and #33993.
   
   > and what is the impact of this change to the SQL semantic?
   
   I've updated the PR description with a diff of what tests pass or fail on `master` vs. this branch. Please take a look and let me know if you think we need any more tests. I know we are touching a sensitive code path and I appreciate the need for caution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1482237064


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##########
@@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers {
     map(null) = null
     assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+    // Exactly these elements provided in roughly this order trigger a condition where lookups of
+    // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly
+    // and inconsistently if `==` is used to check for key equality.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+
+    val map1 = new OpenHashMap[Double, Int]()
+    spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1}))
+    map1.iterator.foreach(println)
+    assert(map1(0.0) == 1)
+    assert(map1(-0.0) == 1)
+
+    val map2 = new OpenHashMap[Double, Int]()
+    // Simply changing the order in which the elements are added to the map should not change the
+    // counts for 0.0 and -0.0.
+    spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1}))
+    map2.iterator.foreach(println)
+    assert(map2(0.0) == 1)
+    assert(map2(-0.0) == 1)

Review Comment:
   This is another expression of the same bug that this PR addresses. If you run this test on `master`, you will see that the counts for 0.0 and -0.0 depend on the order in which the elements from `spark45599Repro` are added to the map.



##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,42 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
+    // Therefore, 0.0 and -0.0 should get separate entries in the hash set.
+    //
+    // Exactly these elements provided in roughly this order will trigger the following scenario:
+    // When probing the bitset in `getPos(-0.0)`, the loop will happen upon the entry for 0.0.
+    // In the old logic pre-SPARK-45599, the loop will find that the bit is set and, because
+    // -0.0 == 0.0, it will think that's the position of -0.0. But in reality this is the position
+    // of 0.0. So -0.0 and 0.0 will be stored at different positions, but `getPos()` will return
+    // the same position for them. This can cause users of OpenHashSet, like OpenHashMap, to
+    // return the wrong value for a key based on whether or not this bitset lookup collision
+    // happens.

Review Comment:
   It really is amazing that @revans2 found this bug, because it depends on the set being a specific size and the 0.0 and -0.0 being inserted and then looked up in just the right order so that they happen to collide in the bitset.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1930364725

   Yes, I was watching this ticket because it is a correctness issue.
   The fix looks good to me.
   
   cc @cloud-fan and @HeartSaVioR as it would be great to include the fix in 3.5.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1482125925


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   Note also that the docstring for `OpenHashSet` seems to imply that it is meant to be a faster but semantically equivalent alternative to `java.util.HashSet`:
   
   https://github.com/apache/spark/blob/0d5c2ce427e06adfebf47bc446ff6646513ae750/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L31-L32
   
   If that's true, then we should perhaps add property based tests to ensure alignment between the two implementations, but I'll leave that as a potential future improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1931270883

   I have no expertise on this area so will leave the decision on merging the change to which version(s), and whether we want to safeguard or not, to introduce an escape-hatch on behavioral change. I can hold cutting the tag of RC1 for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1482123330


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers {
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   Consider another interesting case where `java.util.HashSet` and `OpenHashSet` differ:
   
   ```scala
   scala> val h = new HashSet[Double]()
   val h: java.util.HashSet[Double] = []
   
   scala> h.add(Double.NaN)
   val res9: Boolean = true
   
   scala> h.add(Double.NaN)
   val res10: Boolean = false
   
   scala> h.contains(Double.NaN)
   val res11: Boolean = true
   
   scala> h.size()
   val res12: Int = 1
   ```
   
   On `master`, `OpenHashSet` does something obviously wrong:
   
   ```scala
   val set = new OpenHashSet[Double]()
   set.add(Double.NaN)
   set.add(Double.NaN)
   set.size  // returns 2
   set.contains(Double.NaN)  // returns false
   ```
   
   This could possibly lead to a bug like the one reported in SPARK-45599 but in reverse, where a new NaN row is added rather than dropped. I will see if I can construct such a scenario as a demonstration. But regardless, I think this behavior is incorrect by itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1490972974


##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala:
##########
@@ -249,4 +249,32 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers {
     map(null) = null
     assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+    // Exactly these elements provided in roughly this order trigger a condition where lookups of
+    // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly
+    // and inconsistently if `==` is used to check for key equality.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+
+    val map1 = new OpenHashMap[Double, Int]()
+    spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1}))
+    assert(map1(0.0) == 1)
+    assert(map1(-0.0) == 1)
+
+    val map2 = new OpenHashMap[Double, Int]()
+    // Simply changing the order in which the elements are added to the map should not change the
+    // counts for 0.0 and -0.0.
+    spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1}))
+    assert(map2(0.0) == 1)
+    assert(map2(-0.0) == 1)

Review Comment:
   shall we test NaN  as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1491158277


##########
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:
##########
@@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
     this
   }
 
+  /**
+   * Check if a key exists at the provided position using object equality rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.

Review Comment:
   Yes, the differences are subtle:
   
   ```scala
   scala> 0.0 == -0.0
   val res0: Boolean = true
   
   scala> 0.0 equals -0.0
   val res1: Boolean = false
   
   scala> Double.NaN == Double.NaN
   val res2: Boolean = false
   
   scala> Double.NaN equals Double.NaN
   val res3: Boolean = true
   ```
   
   There is a long discussion on the Scala forums from 2017 about this difference and some of the problems it causes:
   
   [Can we get rid of cooperative equality?](https://contributors.scala-lang.org/t/can-we-get-rid-of-cooperative-equality/1131)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on PR #45036:
URL: https://github.com/apache/spark/pull/45036#issuecomment-1959816906

   Is there anything more we're waiting on here? (Perhaps just an extra confirmation from @revans2 that the original [fuzz test he ran][1] now passes?)
   
   [1]: https://github.com/apache/spark/pull/45036#discussion_r1482265980


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org