You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Dean Arnold <re...@gmail.com> on 2019/09/12 18:41:59 UTC

Inconsistent dataset behavior between file and in-memory versions

I have some code to recover a complex structured row from a dataset.
The row contains several ARRAY fields (mostly Array(IntegerType)),
which are populated with Array[java.lang.Integer], as that seems to be
the only way the Spark row serializer will accept them.

If the dataset is written out to a file (parquet in this case), and
then read back in
from the file, Row.getList() (either scala or java) works fine, and I
get a List. But if I simply apply the created dataset into another
dataset iterator, Row.getList() throws an exception:

java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to
scala.collection.Seq

On top of that mess, the array fields of the row which were assigned a
null show up as non-null empty arrays, yet when written out to a file
and then read back, they are actually null.

Why isn't the behavior consistent ? And why isn't there a
Row.getArray() ? Will any of this nonsense be fixed in 3.0 ?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org