You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Vogelbacher (Jira)" <ji...@apache.org> on 2022/07/26 22:39:00 UTC

[jira] [Updated] (SPARK-39885) Behavior differs between array_overlap and array_contains for negative 0.0

     [ https://issues.apache.org/jira/browse/SPARK-39885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Vogelbacher updated SPARK-39885:
--------------------------------------
    Description: 
{{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
However, the {{Double::equals}} method doesn't. Therefore, we should either mark double as false in [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala], or we should wrap it with our own equals method that handles this case.

Java code snippets showing the issue:

{code:java}
dataset = sparkSession.createDataFrame(
            List.of(RowFactory.create(List.of(-0.0))),
            DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
                    "doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false))));
        Dataset<Row> df = dataset.withColumn(
            "overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
        List<Row> result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
{code}

{code:java}
dataset = sparkSession.createDataFrame(
                List.of(RowFactory.create(-0.0)),
                DataTypes.createStructType(
                        ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false))));
        Dataset<Row> df = dataset.withColumn(
                "overlaps", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
        List<Row> result = df.collectAsList(); // [[-0.0,true]]
{code}


  was:
{{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
However, the {{Double::equals}} method doesn't. Therefore, we should either mark double as false in {{TypeUtils#typeWithProperEquals}}, or we should wrap it with our own equals method that handles this case.

Java code snippets showing the issue:

{code:java}
dataset = sparkSession.createDataFrame(
            List.of(RowFactory.create(List.of(-0.0))),
            DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
                    "doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false))));
        Dataset<Row> df = dataset.withColumn(
            "overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
        List<Row> result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
{code}

{code:java}
dataset = sparkSession.createDataFrame(
                List.of(RowFactory.create(-0.0)),
                DataTypes.createStructType(
                        ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false))));
        Dataset<Row> df = dataset.withColumn(
                "overlaps", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
        List<Row> result = df.collectAsList(); // [[-0.0,true]]
{code}



> Behavior differs between array_overlap and array_contains for negative 0.0
> --------------------------------------------------------------------------
>
>                 Key: SPARK-39885
>                 URL: https://issues.apache.org/jira/browse/SPARK-39885
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.2
>            Reporter: David Vogelbacher
>            Priority: Major
>
> {{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
> However, the {{Double::equals}} method doesn't. Therefore, we should either mark double as false in [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala], or we should wrap it with our own equals method that handles this case.
> Java code snippets showing the issue:
> {code:java}
> dataset = sparkSession.createDataFrame(
>             List.of(RowFactory.create(List.of(-0.0))),
>             DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
>                     "doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false))));
>         Dataset<Row> df = dataset.withColumn(
>             "overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
>         List<Row> result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
> {code}
> {code:java}
> dataset = sparkSession.createDataFrame(
>                 List.of(RowFactory.create(-0.0)),
>                 DataTypes.createStructType(
>                         ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false))));
>         Dataset<Row> df = dataset.withColumn(
>                 "overlaps", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
>         List<Row> result = df.collectAsList(); // [[-0.0,true]]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org