You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Caizhi Weng (Jira)" <ji...@apache.org> on 2022/05/16 03:44:00 UTC
[jira] [Created] (FLINK-27627) Incorrect result when order by (string, double) pair with NaN values

Caizhi Weng created FLINK-27627:
-----------------------------------

             Summary: Incorrect result when order by (string, double) pair with NaN values
                 Key: FLINK-27627
                 URL: https://issues.apache.org/jira/browse/FLINK-27627
             Project: Flink
          Issue Type: Bug
          Components: Table SQL / Runtime
    Affects Versions: 1.15.0
            Reporter: Caizhi Weng


Use these test data and SQL to reproduce this exception.

gao.csv:
{code}
1.0,2.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
1.0,1.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
1.0,0.0,aaaaaaaaaaaaaaa
0.0,0.0,aaaaaaaaaaaaaaa
-1.0,0.0,aaaaaaaaaaaaaaa
1.0,-1.0,aaaaaaaaaaaaaaa
1.0,-2.0,aaaaaaaaaaaaaaa
{code}

Flink SQL:
{code}
Flink SQL> create table T ( a double, b double, c string ) WITH ( 'connector' = 'filesystem', 'path' = '/tmp/gao.csv', 'format' = 'csv' );
[INFO] Execute statement succeed.

Flink SQL> create table S ( a string, b double ) WITH ( 'connector' = 'filesystem', 'path' = '/tmp/gao2.csv', 'format' = 'csv' );
[INFO] Execute statement succeed.

Flink SQL> insert into S select c, a / b from T;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 8c98f5bb99c2dcd28f13def916e2178a


Flink SQL> select * from S order by a, b;
+-----------------+-----------+
|               a |         b |
+-----------------+-----------+
| aaaaaaaaaaaaaaa |       0.5 |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa |       1.0 |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa |  Infinity |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa | -Infinity |
| aaaaaaaaaaaaaaa |      -1.0 |
| aaaaaaaaaaaaaaa |      -0.5 |
+-----------------+-----------+
9 rows in set

Flink SQL> select * from S order by b;
+-----------------+-----------+
|               a |         b |
+-----------------+-----------+
| aaaaaaaaaaaaaaa | -Infinity |
| aaaaaaaaaaaaaaa |      -1.0 |
| aaaaaaaaaaaaaaa |      -0.5 |
| aaaaaaaaaaaaaaa |       0.5 |
| aaaaaaaaaaaaaaa |       1.0 |
| aaaaaaaaaaaaaaa |  Infinity |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa |       NaN |
| aaaaaaaaaaaaaaa |       NaN |
+-----------------+-----------+
9 rows in set
{code}

As is shown above, when order by a (string, double) pair the result is incorrect, while order by a double column separately yields the correct result.

This is because {{BinaryIndexedSortable}} uses two comparators, the normalized key comparator which directly compares memory segments, and the record comparator which compares actual column values. If the length of sort keys are not determined (for example if the sort keys contain strings) the normalized key comparator cannot fully determine the order and it will fall back to the record comparator.

As we can see in {{GenerateUtils#generateCompare}}, record comparator compares double values directly with {{<}} and {{>}}. However for {{Double.NaN}}, every binary comparator except {{!=}} will return false, which causes this issue.

Note that we cannot simply change {{GenerateUtils#generateCompare}}. This is because comparing {{NaN}} in SQL should also return false except for {{<>}}. It is the sorting operator that requires a specific order. That is to say, the current implementation of {{GenerateUtils#generateCompare}} is correct for comparing, but not for sorting. Maybe we should generate a special comparator for all sorting operators?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)