You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Vogelbacher (JIRA)" <ji...@apache.org> on 2018/07/26 00:48:00 UTC
[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

    [ https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556470#comment-16556470 ] 

David Vogelbacher commented on SPARK-12911:
-------------------------------------------

Hey [~hyukjin.kwon] [~sdicocco][~a1ray], I just reproduced this on master. I executed the following in the spark-shell:
{noformat}
scala> import org.apache.spark.sql.functions
import org.apache.spark.sql.functions

scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
df: org.apache.spark.sql.DataFrame = [arrayCol: array<string>]

scala> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), functions.lit("b")))).show()
+--------+
|arrayCol|
+--------+
|  [a, b]|
+--------+


scala> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), functions.lit("b")))).show()
+--------+
|arrayCol|
+--------+
+--------+
{noformat}

This seems to be the same issue?

> Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-12911
>                 URL: https://issues.apache.org/jira/browse/SPARK-12911
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
>            Reporter: Jesse English
>            Priority: Major
>
> When doing a *where* operation on a dataframe and testing for equality on an array type, after 1.6 no valid comparisons are made if the dataframe has been cached.  If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
>     val vectors: Vector[Row] =  Vector(
>       Row.fromTuple("id_1" -> Array(0L, 2L)),
>       Row.fromTuple("id_2" -> Array(0L, 5L)),
>       Row.fromTuple("id_3" -> Array(0L, 9L)),
>       Row.fromTuple("id_4" -> Array(1L, 0L)),
>       Row.fromTuple("id_5" -> Array(1L, 8L)),
>       Row.fromTuple("id_6" -> Array(2L, 4L)),
>       Row.fromTuple("id_7" -> Array(5L, 6L)),
>       Row.fromTuple("id_8" -> Array(6L, 2L)),
>       Row.fromTuple("id_9" -> Array(7L, 0L))
>     )
>     val data: RDD[Row] = sc.parallelize(vectors, 3)
>     val schema = StructType(
>       StructField("id", StringType, false) ::
>         StructField("point", DataTypes.createArrayType(LongType, false), false) ::
>         Nil
>     )
>     val sqlContext = new SQLContext(sc)
>     val dataframe = sqlContext.createDataFrame(data, schema)
>     val targetPoint:Array[Long] = Array(0L,9L)
>     //Cacheing is the trigger to cause the error (no cacheing causes no error)
>     dataframe.cache()
>     //This is the line where it fails
>     //java.util.NoSuchElementException: next on empty iterator
>     //However we know that there is a valid match
>     val targetRow = dataframe.where(dataframe("point") === array(targetPoint.map(value => lit(value)): _*)).first()
>     assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org