You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2016/04/13 03:10:25 UTC

[jira] [Created] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations

Josh Rosen created SPARK-14584:
----------------------------------

             Summary: Improve recognition of non-nullability in Dataset transformations
                 Key: SPARK-14584
                 URL: https://issues.apache.org/jira/browse/SPARK-14584
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Josh Rosen


There are many cases where we can statically know that a field will never be null. For instance, a field in a case class with a primitive type will never return null. However, there are currently several cases in the Dataset API where we do not properly recognize this non-nullability. For instance:

{code}
case class MyCaseClass(foo: Int)
sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
{code}

claims that the {{foo}} field is nullable even though this is impossible.

I believe that this is due to the way that we reason about nullability when constructing serializer expressions in ExpressionEncoders. The following assertion will currently fail if added to ExpressionEncoder:

{code}

  require(schema.size == serializer.size)
  schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
    require(field.dataType == fieldSerializer.dataType, s"Field ${field.name}'s data type is " +
      s"${field.dataType} in the schema but ${fieldSerializer.dataType} in its serializer")
    require(field.nullable == fieldSerializer.nullable, s"Field ${field.name}'s nullability is " +
      s"${field.nullable} in the schema but ${fieldSerializer.nullable} in its serializer")
  }
{code}

Most often, the schema claims that a field is non-nullable while the encoder allows for nullability, but occasionally we see a mismatch in the datatypes due to disagreements over the nullability of nested structs' fields (or fields of structs in arrays).

I think the problem is that when we're reasoning about nullability in a struct's schema we consider its fields' nullability to be independent of the nullability of the struct itself, whereas in the serializer expressions we are considering those field extraction expressions to be nullable if the input objects themselves can be nullable.

I'm not sure what's the simplest way to fix this. One proposal would be to leave the serializers unchanged and have ObjectOperator derive its output attributes from an explicitly-passed schema rather than using the serializers' attributes. However, I worry that this might introduce bugs in case the serializer and schema disagree.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org