You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Vogelbacher (Jira)" <ji...@apache.org> on 2022/07/26 22:39:00 UTC
[jira] [Updated] (SPARK-39885) Behavior differs between array_overlap and array_contains for negative 0.0
[ https://issues.apache.org/jira/browse/SPARK-39885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Vogelbacher updated SPARK-39885:
--------------------------------------
Description:
{{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
However, the {{Double::equals}} method doesn't. Therefore, we should either mark double as false in [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala], or we should wrap it with our own equals method that handles this case.
Java code snippets showing the issue:
{code:java}
dataset = sparkSession.createDataFrame(
List.of(RowFactory.create(List.of(-0.0))),
DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
"doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false))));
Dataset<Row> df = dataset.withColumn(
"overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
List<Row> result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
{code}
{code:java}
dataset = sparkSession.createDataFrame(
List.of(RowFactory.create(-0.0)),
DataTypes.createStructType(
ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false))));
Dataset<Row> df = dataset.withColumn(
"overlaps", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
List<Row> result = df.collectAsList(); // [[-0.0,true]]
{code}
was:
{{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
However, the {{Double::equals}} method doesn't. Therefore, we should either mark double as false in {{TypeUtils#typeWithProperEquals}}, or we should wrap it with our own equals method that handles this case.
Java code snippets showing the issue:
{code:java}
dataset = sparkSession.createDataFrame(
List.of(RowFactory.create(List.of(-0.0))),
DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
"doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false))));
Dataset<Row> df = dataset.withColumn(
"overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
List<Row> result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
{code}
{code:java}
dataset = sparkSession.createDataFrame(
List.of(RowFactory.create(-0.0)),
DataTypes.createStructType(
ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false))));
Dataset<Row> df = dataset.withColumn(
"overlaps", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
List<Row> result = df.collectAsList(); // [[-0.0,true]]
{code}
> Behavior differs between array_overlap and array_contains for negative 0.0
> --------------------------------------------------------------------------
>
> Key: SPARK-39885
> URL: https://issues.apache.org/jira/browse/SPARK-39885
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.2.2
> Reporter: David Vogelbacher
> Priority: Major
>
> {{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
> However, the {{Double::equals}} method doesn't. Therefore, we should either mark double as false in [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala], or we should wrap it with our own equals method that handles this case.
> Java code snippets showing the issue:
> {code:java}
> dataset = sparkSession.createDataFrame(
> List.of(RowFactory.create(List.of(-0.0))),
> DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField(
> "doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false))));
> Dataset<Row> df = dataset.withColumn(
> "overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
> List<Row> result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
> {code}
> {code:java}
> dataset = sparkSession.createDataFrame(
> List.of(RowFactory.create(-0.0)),
> DataTypes.createStructType(
> ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false))));
> Dataset<Row> df = dataset.withColumn(
> "overlaps", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol")));
> List<Row> result = df.collectAsList(); // [[-0.0,true]]
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org