You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2017/01/24 19:02:29 UTC

[jira] [Updated] (SPARK-17913) Filter/join expressions can return incorrect results when comparing strings to longs

     [ https://issues.apache.org/jira/browse/SPARK-17913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiao Li updated SPARK-17913:
----------------------------
    Assignee: Wenchen Fan

> Filter/join expressions can return incorrect results when comparing strings to longs
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-17913
>                 URL: https://issues.apache.org/jira/browse/SPARK-17913
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.2, 2.0.0
>            Reporter: Ming Beckwith
>            Assignee: Wenchen Fan
>              Labels: release_notes
>             Fix For: 2.2.0
>
>
> Reproducer:
> {code}
>   case class E(subject: Long, predicate: String, objectNode: String)
>   def test(sc: SparkContext) = {
>     val sqlContext: SQLContext = new SQLContext(sc)
>     import sqlContext.implicits._
>     val broken = List(
>       (19157170390056969L, "right", 19157170390056969L),
>       (19157170390056973L, "wrong", 19157170390056971L),
>       (19157190254313477L, "wrong", 19157190254313475L),
>       (19157180859056133L, "wrong", 19157180859056131L),
>       (19157170390056969L, "number", 161),
>       (19157170390056971L, "string", "a string"),
>       (19157190254313475L, "string", "another string"),
>       (19157180859056131L, "number", 191)
>     )
>     val brokenDF = sc.parallelize(broken).map(b => E(b._1, b._2, b._3.toString)).toDF()
>     val brokenFilter = brokenDF.filter($"subject" === $"objectNode")
>     val fixed = brokenDF.filter(brokenDF("subject").cast("string") === brokenDF("objectNode"))
>     println("***** incorrect filter results *****")
>     println(brokenFilter.show())
>     println("***** correct filter results *****")
>     println(fixed.show())
>     println("***** both sides cast to double *****")
>     println(brokenFilter.explain())
>   }
> Broken filter returns:
> +-----------------+---------+-----------------+
> |          subject|predicate|       objectNode|
> +-----------------+---------+-----------------+
> |19157170390056969|    right|19157170390056969|
> |19157170390056973|    wrong|19157170390056971|
> |19157190254313477|    wrong|19157190254313475|
> |19157180859056133|    wrong|19157180859056131|
> +-----------------+---------+-----------------+
> {code}
> The physical plan shows both sides of the expression are being cast to Double before evaluation. So while comparing numbers to a string number appears to work in many cases, when the numbers are sufficiently large and close together there is enough loss of precision to cause incorrect results. 
> {code}
> == Physical Plan ==
> Filter (cast(subject#0L as double) = cast(objectNode#2 as double))
> After casting the left side into strings, the filter returns the expected result:
> +-----------------+---------+-----------------+
> |          subject|predicate|       objectNode|
> +-----------------+---------+-----------------+
> |19157170390056969|    right|19157170390056969|
> +-----------------+---------+-----------------+
> {code}
> Expected behavior in this case is probably to choose one side and cast the other (compare string to string or long to long) instead of using a data type with less precision. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org