You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andy Grove <an...@agildata.com> on 2016/05/17 18:48:45 UTC

Inferring schema from GenericRowWithSchema

Hi,

I have a requirement to create types dynamically in Spark and then
instantiate those types from Spark SQL via a UDF.

I tried doing the following:

val addressType = StructType(List(
  new StructField("state", DataTypes.StringType),
  new StructField("zipcode", DataTypes.IntegerType)
))

sqlContext.udf.register("Address", (args: Seq[Any]) => new
GenericRowWithSchema(args.toArray, addressType))

sqlContext.sql("SELECT Address('NY', 12345)").show(10)

This seems reasonable to me but this fails with:

Exception in thread "main" java.lang.UnsupportedOperationException: Schema
for type org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is
not supported
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)

It looks like it would be simple to update ScalaReflection to be able to
infer the schema from a GenericRowWithSchema, but before I file a JIRA and
submit a patch I wanted to see if there is already a way of achieving this.

Thanks,

Andy.

Re: Inferring schema from GenericRowWithSchema

Posted by Andy Grove <an...@agildata.com>.
Hmm. I see. Yes, I guess that won't work then.

I don't understand what you are proposing about UDFRegistration. I only see
methods that take tuples of various sizes (1 .. 22).

On Tue, May 17, 2016 at 1:00 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> I don't think that you will be able to do that.  ScalaReflection is based
> on the TypeTag of the object, and thus the schema of any particular object
> won't be available to it.
>
> Instead I think you want to use the register functions in UDFRegistration
> that take a schema. Does that make sense?
>
> On Tue, May 17, 2016 at 11:48 AM, Andy Grove <an...@agildata.com>
> wrote:
>
>>
>> Hi,
>>
>> I have a requirement to create types dynamically in Spark and then
>> instantiate those types from Spark SQL via a UDF.
>>
>> I tried doing the following:
>>
>> val addressType = StructType(List(
>>   new StructField("state", DataTypes.StringType),
>>   new StructField("zipcode", DataTypes.IntegerType)
>> ))
>>
>> sqlContext.udf.register("Address", (args: Seq[Any]) => new
>> GenericRowWithSchema(args.toArray, addressType))
>>
>> sqlContext.sql("SELECT Address('NY', 12345)").show(10)
>>
>> This seems reasonable to me but this fails with:
>>
>> Exception in thread "main" java.lang.UnsupportedOperationException:
>> Schema for type
>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is not
>> supported
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
>> at
>> org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)
>>
>> It looks like it would be simple to update ScalaReflection to be able to
>> infer the schema from a GenericRowWithSchema, but before I file a JIRA and
>> submit a patch I wanted to see if there is already a way of achieving this.
>>
>> Thanks,
>>
>> Andy.
>>
>>
>>
>

Re: Inferring schema from GenericRowWithSchema

Posted by Michael Armbrust <mi...@databricks.com>.
I don't think that you will be able to do that.  ScalaReflection is based
on the TypeTag of the object, and thus the schema of any particular object
won't be available to it.

Instead I think you want to use the register functions in UDFRegistration
that take a schema. Does that make sense?

On Tue, May 17, 2016 at 11:48 AM, Andy Grove <an...@agildata.com>
wrote:

>
> Hi,
>
> I have a requirement to create types dynamically in Spark and then
> instantiate those types from Spark SQL via a UDF.
>
> I tried doing the following:
>
> val addressType = StructType(List(
>   new StructField("state", DataTypes.StringType),
>   new StructField("zipcode", DataTypes.IntegerType)
> ))
>
> sqlContext.udf.register("Address", (args: Seq[Any]) => new
> GenericRowWithSchema(args.toArray, addressType))
>
> sqlContext.sql("SELECT Address('NY', 12345)").show(10)
>
> This seems reasonable to me but this fails with:
>
> Exception in thread "main" java.lang.UnsupportedOperationException: Schema
> for type org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is
> not supported
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
> at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)
>
> It looks like it would be simple to update ScalaReflection to be able to
> infer the schema from a GenericRowWithSchema, but before I file a JIRA and
> submit a patch I wanted to see if there is already a way of achieving this.
>
> Thanks,
>
> Andy.
>
>
>