You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "comphead (via GitHub)" <gi...@apache.org> on 2023/03/02 04:45:50 UTC

[GitHub] [arrow-datafusion] comphead opened a new pull request, #5446: Fix is_distinct from for float NaN values

comphead opened a new pull request, #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5392.
   
   # Rationale for this change
   
   To fix wrong behavior when comparing NaN. Details in #5392 
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   Yes
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   No
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on pull request #5446: Fix is_distinct from for float NaN values

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#issuecomment-1452579260

   > This looks good to me. Thank you @comphead -- I have some suggestions but I don't think they are required to merge this PR.
   > 
   > I wonder if it is time to start porting more of these kernels into arrow 🤔
   
   Right, I think https://github.com/apache/arrow-rs/issues/960 is good place to do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5446: Fix is_distinct from for float NaN values

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on pull request #5446: Fix is_distinct from for float NaN values

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#issuecomment-1451293076

   cc @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5446: Fix is_distinct from for float NaN values

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#issuecomment-1453457787

   Thanks again @comphead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5446: Fix is_distinct from for float NaN values

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#discussion_r1123742386


##########
datafusion/core/tests/sqllogictests/test_files/select.slt:
##########
@@ -80,3 +80,29 @@ select '1' from foo order by column1;
 # foo distinct order by
 statement error DataFusion error: Error during planning: For SELECT DISTINCT, ORDER BY expressions column1 must appear in select list
 select distinct '1' from foo order by column1;
+
+# distincts for float nan

Review Comment:
   Ive duplicated in both `select.slt` and `pg_compat_simple.slt` Reason for that is PG doesn't support double datatype



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #5446: Fix is_distinct from for float NaN values

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#issuecomment-1453467053

   Benchmark runs are scheduled for baseline = 61fc51446cb06bc6c8de69d50c9e5f79dede08fb and contender = 8de7ea45e96bfc1931ae51acb5a3abed7511dc7b. 8de7ea45e96bfc1931ae51acb5a3abed7511dc7b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/c6c58fc8d857432eb57daa23924e8220...df2b14278ae942d8b79f317b5e7b148a/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/1c7717b001544c08a515f2455601ddc9...a24cf90805ef44ac999105c994f4dbbc/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1762566ec8854078bfcbe742c4f116df...ce8485960d3148688244b2009c76ab17/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/978138e916234650ab2f069662e60eff...945e1340df034dc287aca16d317c804c/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on pull request #5446: Fix is_distinct from for float NaN values

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#issuecomment-1451293498

   @alamb @tustvold please check this whenever you have time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5446: Fix is_distinct from for float NaN values

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#discussion_r1123740919


##########
datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs:
##########
@@ -87,25 +78,120 @@ pub(crate) fn is_not_distinct_from<T>(
 where
     T: ArrowNumericType,
 {
-    let left_data = left.data();
-    let right_data = right.data();
-    let array_len = left_data.len().min(right_data.len());
+    distinct(
+        left,
+        right,
+        |left_value, right_value, left_isnull, right_isnull| {
+            !(left_isnull != right_isnull || left_value != right_value)
+        },
+    )
+}
 
+fn distinct<
+    T,
+    F: FnMut(
+        <T as ArrowPrimitiveType>::Native,
+        <T as ArrowPrimitiveType>::Native,
+        bool,
+        bool,
+    ) -> bool,
+>(
+    left: &PrimitiveArray<T>,
+    right: &PrimitiveArray<T>,
+    mut op: F,
+) -> Result<BooleanArray>
+where
+    T: ArrowNumericType,
+{
     let left_values = left.values();
     let right_values = right.values();
+    let left_data = left.data();
+    let right_data = right.data();
 
+    let array_len = left_data.len().min(right_data.len());
     let distinct = arrow_buffer::MutableBuffer::collect_bool(array_len, |i| {
-        !(left_data.is_null(i) != right_data.is_null(i)
-            || left_values[i] != right_values[i])
+        op(
+            left_values[i],
+            right_values[i],
+            left_data.is_null(i),
+            right_data.is_null(i),
+        )
     });
-
     let array_data = ArrayData::builder(arrow_schema::DataType::Boolean)
         .len(array_len)
         .add_buffer(distinct.into());
 
     Ok(BooleanArray::from(unsafe { array_data.build_unchecked() }))
 }
 
+pub(crate) fn is_distinct_from_f32(
+    left: &Float32Array,
+    right: &Float32Array,
+) -> Result<BooleanArray> {
+    distinct(
+        left,
+        right,
+        |left_value, right_value, left_isnull, right_isnull| {
+            left_isnull != right_isnull
+                || left_value.is_nan() != right_value.is_nan()
+                || (!left_value.is_nan()
+                    && !right_value.is_nan()
+                    && left_value != right_value)
+        },
+    )
+}
+
+pub(crate) fn is_not_distinct_from_f32(
+    left: &Float32Array,
+    right: &Float32Array,
+) -> Result<BooleanArray> {
+    distinct(
+        left,
+        right,
+        |left_value, right_value, left_isnull, right_isnull| {
+            !(left_isnull != right_isnull
+                || left_value.is_nan() != right_value.is_nan()
+                || (!left_value.is_nan()
+                    && !right_value.is_nan()
+                    && left_value != right_value))
+        },

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5446: Fix is_distinct from for float NaN values

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#discussion_r1123643959


##########
datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs:
##########
@@ -87,25 +78,120 @@ pub(crate) fn is_not_distinct_from<T>(
 where
     T: ArrowNumericType,
 {
-    let left_data = left.data();
-    let right_data = right.data();
-    let array_len = left_data.len().min(right_data.len());
+    distinct(
+        left,
+        right,
+        |left_value, right_value, left_isnull, right_isnull| {
+            !(left_isnull != right_isnull || left_value != right_value)
+        },
+    )
+}
 
+fn distinct<
+    T,
+    F: FnMut(
+        <T as ArrowPrimitiveType>::Native,
+        <T as ArrowPrimitiveType>::Native,
+        bool,
+        bool,
+    ) -> bool,
+>(
+    left: &PrimitiveArray<T>,
+    right: &PrimitiveArray<T>,
+    mut op: F,
+) -> Result<BooleanArray>
+where
+    T: ArrowNumericType,
+{
     let left_values = left.values();
     let right_values = right.values();
+    let left_data = left.data();
+    let right_data = right.data();
 
+    let array_len = left_data.len().min(right_data.len());
     let distinct = arrow_buffer::MutableBuffer::collect_bool(array_len, |i| {
-        !(left_data.is_null(i) != right_data.is_null(i)
-            || left_values[i] != right_values[i])
+        op(
+            left_values[i],
+            right_values[i],
+            left_data.is_null(i),
+            right_data.is_null(i),
+        )
     });
-
     let array_data = ArrayData::builder(arrow_schema::DataType::Boolean)
         .len(array_len)
         .add_buffer(distinct.into());
 
     Ok(BooleanArray::from(unsafe { array_data.build_unchecked() }))
 }
 
+pub(crate) fn is_distinct_from_f32(
+    left: &Float32Array,
+    right: &Float32Array,
+) -> Result<BooleanArray> {
+    distinct(
+        left,
+        right,
+        |left_value, right_value, left_isnull, right_isnull| {
+            left_isnull != right_isnull
+                || left_value.is_nan() != right_value.is_nan()
+                || (!left_value.is_nan()
+                    && !right_value.is_nan()
+                    && left_value != right_value)
+        },
+    )
+}
+
+pub(crate) fn is_not_distinct_from_f32(
+    left: &Float32Array,
+    right: &Float32Array,
+) -> Result<BooleanArray> {
+    distinct(
+        left,
+        right,
+        |left_value, right_value, left_isnull, right_isnull| {
+            !(left_isnull != right_isnull
+                || left_value.is_nan() != right_value.is_nan()
+                || (!left_value.is_nan()
+                    && !right_value.is_nan()
+                    && left_value != right_value))
+        },

Review Comment:
   It took me quite some time to see that the only difference here is that there is a `!` in front of the entire expression. Would it be possible to factor out the duplication so the difference between the kernels are clearer?



##########
datafusion/core/tests/sqllogictests/test_files/select.slt:
##########
@@ -80,3 +80,29 @@ select '1' from foo order by column1;
 # foo distinct order by
 statement error DataFusion error: Error during planning: For SELECT DISTINCT, ORDER BY expressions column1 must appear in select list
 select distinct '1' from foo order by column1;
+
+# distincts for float nan

Review Comment:
   This might be a good test to add to https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests/test_files/pg_compat so they are automatically compared with postgres as part of CI (kudos to @melgenek  for help there)
   
   https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/pg_compat/pg_compat_simple.slt perhaps?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5446: Fix is_distinct from for float NaN values

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5446:
URL: https://github.com/apache/arrow-datafusion/pull/5446#discussion_r1124385056


##########
datafusion/physical-expr/src/expressions/binary/kernels_arrow.rs:
##########
@@ -32,6 +32,14 @@ use std::sync::Arc;
 // Simple (low performance) kernels until optimized kernels are added to arrow
 // See https://github.com/apache/arrow-rs/issues/960
 
+macro_rules! distinct_float {

Review Comment:
   thank you -- this is great



##########
datafusion/core/tests/sqllogictests/test_files/pg_compat/pg_compat_simple.slt:
##########
@@ -572,6 +572,24 @@ e 4 97 -13181 2047637360 6176835796788944083 158 53000 2042457019 97260165026400
 e 5 -86 32514 -467659022 -8012578250188146150 254 2684 2861911482 2126626171973341689 0.12559289 0.014793053078 gxfHWUF8XgY2KdFxigxvNEXe2V2XMl
 e 5 64 -26526 1689098844 8950618259486183091 224 45253 662099130 16127995415060805595 0.2897315 0.575945048386 56MZa5O1hVtX4c5sbnCfxuX5kDChqI
 
+# distinct_from logic for floats
+query BBBBBBBBBBB
+select 
+    'nan'::float is distinct from 'nan'::float v7,
+    'nan'::float is not distinct from 'nan'::float v8,
+    'nan'::float is not distinct from null v9,
+    'nan'::float is distinct from null v10,
+    null is distinct from 'nan'::float v11,
+    null is not distinct from 'nan'::float v12,
+    1::float is distinct from 2::float v13,
+    'nan'::float is distinct from 1::float v14,
+    'nan'::float is not distinct from 1::float v15,
+    1::float is not distinct from null v16,
+    1::float is distinct from null v17
+;
+----
+false true false true true false true true false false true

Review Comment:
   ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org