You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Alessandro Baretta <al...@gmail.com> on 2014/12/11 03:19:03 UTC

SparkSQL not honoring schema

Hello,

I defined a SchemaRDD by applying a hand-crafted StructType to an RDD. Some
of the Rows in the RDD are malformed--that is, they do not conform to the
schema defined by the StructType. When running a select statement on this
SchemaRDD I would expect SparkSQL to either reject the malformed rows or
fail. Instead, it returns whatever data it finds, even if malformed. Is
this the desired behavior? Is there no method in SparkSQL to check for
validity with respect to the schema?

Thanks.

Alex

Re: SparkSQL not honoring schema

Posted by Alessandro Baretta <al...@gmail.com>.
Hey Michael,

Thanks for the clarification. I was actually assuming the query would fail.
Ok, so this means I will have to do the validation in an RDD transformation
feeding into the SchemaRDD.

On Wed, Dec 10, 2014 at 6:27 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> As the scala doc for applySchema says, "It is important to make sure that
> the structure of every [[Row]] of the provided RDD matches the provided
> schema. Otherwise, there will be runtime exceptions."  We don't check as
> doing runtime reflection on all of the data would be very expensive.  You
> will only get errors if you try to manipulate the data, but otherwise it
> will pass it though.
>
> I have written some debugging code (developer API, not guaranteed to be
> stable) though that you can use.
>
> import org.apache.spark.sql.execution.debug._
> schemaRDD.typeCheck()
>
> On Wed, Dec 10, 2014 at 6:19 PM, Alessandro Baretta <alexbaretta@gmail.com
> > wrote:
>
>> Hello,
>>
>> I defined a SchemaRDD by applying a hand-crafted StructType to an RDD.
>> Some
>> of the Rows in the RDD are malformed--that is, they do not conform to the
>> schema defined by the StructType. When running a select statement on this
>> SchemaRDD I would expect SparkSQL to either reject the malformed rows or
>> fail. Instead, it returns whatever data it finds, even if malformed. Is
>> this the desired behavior? Is there no method in SparkSQL to check for
>> validity with respect to the schema?
>>
>> Thanks.
>>
>> Alex
>>
>
>

Re: SparkSQL not honoring schema

Posted by Michael Armbrust <mi...@databricks.com>.
As the scala doc for applySchema says, "It is important to make sure that
the structure of every [[Row]] of the provided RDD matches the provided
schema. Otherwise, there will be runtime exceptions."  We don't check as
doing runtime reflection on all of the data would be very expensive.  You
will only get errors if you try to manipulate the data, but otherwise it
will pass it though.

I have written some debugging code (developer API, not guaranteed to be
stable) though that you can use.

import org.apache.spark.sql.execution.debug._
schemaRDD.typeCheck()

On Wed, Dec 10, 2014 at 6:19 PM, Alessandro Baretta <al...@gmail.com>
wrote:

> Hello,
>
> I defined a SchemaRDD by applying a hand-crafted StructType to an RDD. Some
> of the Rows in the RDD are malformed--that is, they do not conform to the
> schema defined by the StructType. When running a select statement on this
> SchemaRDD I would expect SparkSQL to either reject the malformed rows or
> fail. Instead, it returns whatever data it finds, even if malformed. Is
> this the desired behavior? Is there no method in SparkSQL to check for
> validity with respect to the schema?
>
> Thanks.
>
> Alex
>