You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sam Goodwin <sa...@gmail.com> on 2016/11/02 04:15:41 UTC

Re: getting encoder implicits to be more accurate

You don't need compiler time macros for this, you can do it quite easily
using shapeless. I've been playing with a project which borrows ideas from
spray-json and spray-json-shapeless to implement Row marshalling for
arbitrary case classes. It's checked and generated at compile time,
supports arbitrary/nested case classes, and allows custom types. It is also
entirely pluggable meaning you can bypass the default implementations and
provide your own, just like any type class.

https://github.com/upio/spark-sql-formats


*From:* Michael Armbrust <mi...@databricks.com>
*Date:* October 26, 2016 at 12:50:23 PM PDT
*To:* Koert Kuipers <ko...@tresata.com>
*Cc:* Ryan Blue <rb...@netflix.com>, "dev@spark.apache.org" <
dev@spark.apache.org>
*Subject:* *Re: getting encoder implicits to be more accurate*

Sorry, I realize that set is only one example here, but I don't think that
making the type of the implicit more narrow to include only ProductN or
something eliminates the issue.  Even with that change, we will fail to
generate an encoder with the same error if you, for example, have a field
of your case class that is an unsupported type.



Short of changing this to compile-time macros, I think we are stuck with
this class of errors at runtime.  The simplest solution seems to be to
expand the set of thing we can handle as much as possible and allow users
to turn on a kryo fallback for expression encoders.  I'd be hesitant to
make this the default though, as behavior would change with each release
that adds support for more types.  I would be very supportive of making
this fallback a built-in option though.



On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote:

yup, it doesnt really solve the underlying issue.

we fixed it internally by having our own typeclass that produces encoders
and that does check the contents of the products, but we did this by simply
supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
Product, since we dont have a need for case classes

if case classes extended ProductN (which they will i think in scala 2.12?)
then we could drop Product and support Product1 - Product22 and Option
explicitly while checking the classes they contain. that would be the
cleanest.



On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue <rb...@netflix.com> wrote:

Isn't the problem that Option is a Product and the class it contains isn't
checked? Adding support for Set fixes the example, but the problem would
happen with any class there isn't an encoder for, right?



On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <mi...@databricks.com>
wrote:

Hmm, that is unfortunate.  Maybe the best solution is to add support for
sets?  I don't think that would be super hard.



On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com> wrote:

i am trying to use encoders as a typeclass where if it fails to find an
ExpressionEncoder it falls back to KryoEncoder.

the issue seems to be that ExpressionEncoder claims a little more than it
can handle here:
  implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
Encoders.product[T]

this "claims" to handle for example Option[Set[Int]], but it really cannot
handle Set so it leads to a runtime exception.

would it be useful to make this a little more specific? i guess the
challenge is going to be case classes which unfortunately dont extend
Product1, Product2, etc.







--

Ryan Blue

Software Engineer

Netflix

Re: getting encoder implicits to be more accurate

Posted by Koert Kuipers <ko...@tresata.com>.

sorry this message by me was confusing. i was frustrated about how hard it
is to use the Encoder machinery myself directly on Row objects, this is
unrelated to the question if a shapeless based approach like sam suggest
would be better way to do encoders in general

On Mon, Nov 14, 2016 at 3:03 PM, Koert Kuipers <ko...@tresata.com> wrote:

> that makes sense. we have something like that inhouse as well, but not as
> nice and not using shapeless (we simply relied on sbt-boilerplate to handle
> all tuples and do not support case classes).
>
> however the frustrating part is that spark sql already has this more or
> less. look for example at ExpressionEncoder.fromRow and
> ExpressionEncoder.toRow. but these methods use InternalRow while the rows
> exposed to me as a user are not that.
>
> at this point i am more tempted to simply open up InternalRow at a few
> places strategically than to maintain another inhouse row marshalling
> class. once i have InternalRows looks of good stuff is available to me to
> use.
>
>
>
>
> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin <sa...@gmail.com>
> wrote:
>
>> You don't need compiler time macros for this, you can do it quite easily
>> using shapeless. I've been playing with a project which borrows ideas from
>> spray-json and spray-json-shapeless to implement Row marshalling for
>> arbitrary case classes. It's checked and generated at compile time,
>> supports arbitrary/nested case classes, and allows custom types. It is also
>> entirely pluggable meaning you can bypass the default implementations and
>> provide your own, just like any type class.
>>
>> https://github.com/upio/spark-sql-formats
>>
>>
>> *From:* Michael Armbrust <mi...@databricks.com>
>> *Date:* October 26, 2016 at 12:50:23 PM PDT
>> *To:* Koert Kuipers <ko...@tresata.com>
>> *Cc:* Ryan Blue <rb...@netflix.com>, "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> *Subject:* *Re: getting encoder implicits to be more accurate*
>>
>> Sorry, I realize that set is only one example here, but I don't think
>> that making the type of the implicit more narrow to include only ProductN
>> or something eliminates the issue.  Even with that change, we will fail to
>> generate an encoder with the same error if you, for example, have a field
>> of your case class that is an unsupported type.
>>
>>
>>
>> Short of changing this to compile-time macros, I think we are stuck with
>> this class of errors at runtime.  The simplest solution seems to be to
>> expand the set of thing we can handle as much as possible and allow users
>> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
>> make this the default though, as behavior would change with each release
>> that adds support for more types.  I would be very supportive of making
>> this fallback a built-in option though.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>> yup, it doesnt really solve the underlying issue.
>>
>> we fixed it internally by having our own typeclass that produces encoders
>> and that does check the contents of the products, but we did this by simply
>> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
>> Product, since we dont have a need for case classes
>>
>> if case classes extended ProductN (which they will i think in scala
>> 2.12?) then we could drop Product and support Product1 - Product22 and
>> Option explicitly while checking the classes they contain. that would be
>> the cleanest.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> Isn't the problem that Option is a Product and the class it contains
>> isn't checked? Adding support for Set fixes the example, but the problem
>> would happen with any class there isn't an encoder for, right?
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <
>> michael@databricks.com> wrote:
>>
>> Hmm, that is unfortunate.  Maybe the best solution is to add support for
>> sets?  I don't think that would be super hard.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> i am trying to use encoders as a typeclass where if it fails to find an
>> ExpressionEncoder it falls back to KryoEncoder.
>>
>> the issue seems to be that ExpressionEncoder claims a little more than it
>> can handle here:
>>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
>> Encoders.product[T]
>>
>> this "claims" to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>

Re: getting encoder implicits to be more accurate

Posted by Koert Kuipers <ko...@tresata.com>.

sorry last line should be:
once i have InternalRows lots of good stuff is available to me to use.

On Mon, Nov 14, 2016 at 3:03 PM, Koert Kuipers <ko...@tresata.com> wrote:

> that makes sense. we have something like that inhouse as well, but not as
> nice and not using shapeless (we simply relied on sbt-boilerplate to handle
> all tuples and do not support case classes).
>
> however the frustrating part is that spark sql already has this more or
> less. look for example at ExpressionEncoder.fromRow and
> ExpressionEncoder.toRow. but these methods use InternalRow while the rows
> exposed to me as a user are not that.
>
> at this point i am more tempted to simply open up InternalRow at a few
> places strategically than to maintain another inhouse row marshalling
> class. once i have InternalRows looks of good stuff is available to me to
> use.
>
>
>
>
> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin <sa...@gmail.com>
> wrote:
>
>> You don't need compiler time macros for this, you can do it quite easily
>> using shapeless. I've been playing with a project which borrows ideas from
>> spray-json and spray-json-shapeless to implement Row marshalling for
>> arbitrary case classes. It's checked and generated at compile time,
>> supports arbitrary/nested case classes, and allows custom types. It is also
>> entirely pluggable meaning you can bypass the default implementations and
>> provide your own, just like any type class.
>>
>> https://github.com/upio/spark-sql-formats
>>
>>
>> *From:* Michael Armbrust <mi...@databricks.com>
>> *Date:* October 26, 2016 at 12:50:23 PM PDT
>> *To:* Koert Kuipers <ko...@tresata.com>
>> *Cc:* Ryan Blue <rb...@netflix.com>, "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> *Subject:* *Re: getting encoder implicits to be more accurate*
>>
>> Sorry, I realize that set is only one example here, but I don't think
>> that making the type of the implicit more narrow to include only ProductN
>> or something eliminates the issue.  Even with that change, we will fail to
>> generate an encoder with the same error if you, for example, have a field
>> of your case class that is an unsupported type.
>>
>>
>>
>> Short of changing this to compile-time macros, I think we are stuck with
>> this class of errors at runtime.  The simplest solution seems to be to
>> expand the set of thing we can handle as much as possible and allow users
>> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
>> make this the default though, as behavior would change with each release
>> that adds support for more types.  I would be very supportive of making
>> this fallback a built-in option though.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>> yup, it doesnt really solve the underlying issue.
>>
>> we fixed it internally by having our own typeclass that produces encoders
>> and that does check the contents of the products, but we did this by simply
>> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
>> Product, since we dont have a need for case classes
>>
>> if case classes extended ProductN (which they will i think in scala
>> 2.12?) then we could drop Product and support Product1 - Product22 and
>> Option explicitly while checking the classes they contain. that would be
>> the cleanest.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> Isn't the problem that Option is a Product and the class it contains
>> isn't checked? Adding support for Set fixes the example, but the problem
>> would happen with any class there isn't an encoder for, right?
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <
>> michael@databricks.com> wrote:
>>
>> Hmm, that is unfortunate.  Maybe the best solution is to add support for
>> sets?  I don't think that would be super hard.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> i am trying to use encoders as a typeclass where if it fails to find an
>> ExpressionEncoder it falls back to KryoEncoder.
>>
>> the issue seems to be that ExpressionEncoder claims a little more than it
>> can handle here:
>>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
>> Encoders.product[T]
>>
>> this "claims" to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>

Re: getting encoder implicits to be more accurate

Posted by Koert Kuipers <ko...@tresata.com>.

that makes sense. we have something like that inhouse as well, but not as
nice and not using shapeless (we simply relied on sbt-boilerplate to handle
all tuples and do not support case classes).

however the frustrating part is that spark sql already has this more or
less. look for example at ExpressionEncoder.fromRow and
ExpressionEncoder.toRow. but these methods use InternalRow while the rows
exposed to me as a user are not that.

at this point i am more tempted to simply open up InternalRow at a few
places strategically than to maintain another inhouse row marshalling
class. once i have InternalRows looks of good stuff is available to me to
use.




On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin <sa...@gmail.com>
wrote:

> You don't need compiler time macros for this, you can do it quite easily
> using shapeless. I've been playing with a project which borrows ideas from
> spray-json and spray-json-shapeless to implement Row marshalling for
> arbitrary case classes. It's checked and generated at compile time,
> supports arbitrary/nested case classes, and allows custom types. It is also
> entirely pluggable meaning you can bypass the default implementations and
> provide your own, just like any type class.
>
> https://github.com/upio/spark-sql-formats
>
>
> *From:* Michael Armbrust <mi...@databricks.com>
> *Date:* October 26, 2016 at 12:50:23 PM PDT
> *To:* Koert Kuipers <ko...@tresata.com>
> *Cc:* Ryan Blue <rb...@netflix.com>, "dev@spark.apache.org" <
> dev@spark.apache.org>
> *Subject:* *Re: getting encoder implicits to be more accurate*
>
> Sorry, I realize that set is only one example here, but I don't think that
> making the type of the implicit more narrow to include only ProductN or
> something eliminates the issue.  Even with that change, we will fail to
> generate an encoder with the same error if you, for example, have a field
> of your case class that is an unsupported type.
>
>
>
> Short of changing this to compile-time macros, I think we are stuck with
> this class of errors at runtime.  The simplest solution seems to be to
> expand the set of thing we can handle as much as possible and allow users
> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
> make this the default though, as behavior would change with each release
> that adds support for more types.  I would be very supportive of making
> this fallback a built-in option though.
>
>
>
> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
> yup, it doesnt really solve the underlying issue.
>
> we fixed it internally by having our own typeclass that produces encoders
> and that does check the contents of the products, but we did this by simply
> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
> Product, since we dont have a need for case classes
>
> if case classes extended ProductN (which they will i think in scala 2.12?)
> then we could drop Product and support Product1 - Product22 and Option
> explicitly while checking the classes they contain. that would be the
> cleanest.
>
>
>
> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>
> Isn't the problem that Option is a Product and the class it contains isn't
> checked? Adding support for Set fixes the example, but the problem would
> happen with any class there isn't an encoder for, right?
>
>
>
> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> Hmm, that is unfortunate.  Maybe the best solution is to add support for
> sets?  I don't think that would be super hard.
>
>
>
> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> i am trying to use encoders as a typeclass where if it fails to find an
> ExpressionEncoder it falls back to KryoEncoder.
>
> the issue seems to be that ExpressionEncoder claims a little more than it
> can handle here:
>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
> Encoders.product[T]
>
> this "claims" to handle for example Option[Set[Int]], but it really cannot
> handle Set so it leads to a runtime exception.
>
> would it be useful to make this a little more specific? i guess the
> challenge is going to be case classes which unfortunately dont extend
> Product1, Product2, etc.
>
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>

Re: getting encoder implicits to be more accurate

Posted by Michael Armbrust <mi...@databricks.com>.

I would definitly like to open up APIs for people to write their own
encoders.  The challenge thus far has been that Encoders use internal APIs
that have not been stable for translating the data into the tungsten
format.  We also make use of the analyzer to figure out the mapping from
columns to fields (also not a stable API)  This is the only "magic" that is
happening.

If someone wants to propose a stable / fast API here it would be great to
start the discussion.  Its an often requested feature.

On Mon, Nov 14, 2016 at 1:32 PM, Sam Goodwin <sa...@gmail.com>
wrote:

> I wouldn't recommend using a Tuple as you end up with column names like
> "_1" and "_2", but it will still work :)
>
> ExpressionEncoder can do the same thing but it doesn't support custom
> types, and as far as I can tell, does not support custom implementations.
> I.e. is it possible for me to write my own Encoder logic and completely
> bypass ExpressionEncoder? The trait definition has no useful methods so it
> doesn't seem straight-forward. If the Encoder trait was opened up so
> people could provide their own implementations then I don't see this as an
> issue anymore. It would allow for external Encoder libraries like mine
> while not neglecting Java (non-scala) developers. Is there "magic" happening
> behind the scenes stopping us from doing this?
>
> On Mon, 14 Nov 2016 at 12:31 Koert Kuipers <ko...@tresata.com> wrote:
>
>> just taking it for a quick spin it looks great, with correct support for
>> nested rows and using option for nullability.
>>
>> scala> val format = implicitly[RowFormat[(String, Seq[(String,
>> Option[Int])])]]
>> format: com.github.upio.spark.sql.RowFormat[(String, Seq[(String,
>> Option[Int])])] = com.github.upio.spark.sql.
>> FamilyFormats$$anon$3@2c0961e2
>>
>> scala> format.schema
>> res12: org.apache.spark.sql.types.StructType = StructType(StructField(_1,StringType,false),
>> StructField(_2,ArrayType(StructType(StructField(_1,StringType,false),
>> StructField(_2,IntegerType,true)),true),false))
>>
>> scala> val x = format.read(Row("a", Seq(Row("a", 5))))
>> x: (String, Seq[(String, Option[Int])]) = (a,List((a,Some(5))))
>>
>> scala> format.write(x)
>> res13: org.apache.spark.sql.Row = [a,List([a,5])]
>>
>>
>>
>> On Mon, Nov 14, 2016 at 3:10 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> agreed on your point that this can be done without macros
>>
>> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin <sa...@gmail.com>
>> wrote:
>>
>> You don't need compiler time macros for this, you can do it quite easily
>> using shapeless. I've been playing with a project which borrows ideas from
>> spray-json and spray-json-shapeless to implement Row marshalling for
>> arbitrary case classes. It's checked and generated at compile time,
>> supports arbitrary/nested case classes, and allows custom types. It is also
>> entirely pluggable meaning you can bypass the default implementations and
>> provide your own, just like any type class.
>>
>> https://github.com/upio/spark-sql-formats
>>
>>
>> *From:* Michael Armbrust <mi...@databricks.com>
>> *Date:* October 26, 2016 at 12:50:23 PM PDT
>> *To:* Koert Kuipers <ko...@tresata.com>
>> *Cc:* Ryan Blue <rb...@netflix.com>, "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> *Subject:* *Re: getting encoder implicits to be more accurate*
>>
>> Sorry, I realize that set is only one example here, but I don't think
>> that making the type of the implicit more narrow to include only ProductN
>> or something eliminates the issue.  Even with that change, we will fail to
>> generate an encoder with the same error if you, for example, have a field
>> of your case class that is an unsupported type.
>>
>>
>>
>> Short of changing this to compile-time macros, I think we are stuck with
>> this class of errors at runtime.  The simplest solution seems to be to
>> expand the set of thing we can handle as much as possible and allow users
>> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
>> make this the default though, as behavior would change with each release
>> that adds support for more types.  I would be very supportive of making
>> this fallback a built-in option though.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>> yup, it doesnt really solve the underlying issue.
>>
>> we fixed it internally by having our own typeclass that produces encoders
>> and that does check the contents of the products, but we did this by simply
>> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
>> Product, since we dont have a need for case classes
>>
>> if case classes extended ProductN (which they will i think in scala
>> 2.12?) then we could drop Product and support Product1 - Product22 and
>> Option explicitly while checking the classes they contain. that would be
>> the cleanest.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> Isn't the problem that Option is a Product and the class it contains
>> isn't checked? Adding support for Set fixes the example, but the problem
>> would happen with any class there isn't an encoder for, right?
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <
>> michael@databricks.com> wrote:
>>
>> Hmm, that is unfortunate.  Maybe the best solution is to add support for
>> sets?  I don't think that would be super hard.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> i am trying to use encoders as a typeclass where if it fails to find an
>> ExpressionEncoder it falls back to KryoEncoder.
>>
>> the issue seems to be that ExpressionEncoder claims a little more than it
>> can handle here:
>>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
>> Encoders.product[T]
>>
>> this "claims" to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>>
>>

Re: getting encoder implicits to be more accurate

Posted by Sam Goodwin <sa...@gmail.com>.

I wouldn't recommend using a Tuple as you end up with column names like
"_1" and "_2", but it will still work :)

ExpressionEncoder can do the same thing but it doesn't support custom
types, and as far as I can tell, does not support custom implementations.
I.e. is it possible for me to write my own Encoder logic and completely
bypass ExpressionEncoder? The trait definition has no useful methods so it
doesn't seem straight-forward. If the Encoder trait was opened up so people
could provide their own implementations then I don't see this as an issue
anymore. It would allow for external Encoder libraries like mine while not
neglecting Java (non-scala) developers. Is there "magic" happening behind
the scenes stopping us from doing this?

On Mon, 14 Nov 2016 at 12:31 Koert Kuipers <ko...@tresata.com> wrote:

> just taking it for a quick spin it looks great, with correct support for
> nested rows and using option for nullability.
>
> scala> val format = implicitly[RowFormat[(String, Seq[(String,
> Option[Int])])]]
> format: com.github.upio.spark.sql.RowFormat[(String, Seq[(String,
> Option[Int])])] = com.github.upio.spark.sql.FamilyFormats$$anon$3@2c0961e2
>
> scala> format.schema
> res12: org.apache.spark.sql.types.StructType =
> StructType(StructField(_1,StringType,false),
> StructField(_2,ArrayType(StructType(StructField(_1,StringType,false),
> StructField(_2,IntegerType,true)),true),false))
>
> scala> val x = format.read(Row("a", Seq(Row("a", 5))))
> x: (String, Seq[(String, Option[Int])]) = (a,List((a,Some(5))))
>
> scala> format.write(x)
> res13: org.apache.spark.sql.Row = [a,List([a,5])]
>
>
>
> On Mon, Nov 14, 2016 at 3:10 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> agreed on your point that this can be done without macros
>
> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin <sa...@gmail.com>
> wrote:
>
> You don't need compiler time macros for this, you can do it quite easily
> using shapeless. I've been playing with a project which borrows ideas from
> spray-json and spray-json-shapeless to implement Row marshalling for
> arbitrary case classes. It's checked and generated at compile time,
> supports arbitrary/nested case classes, and allows custom types. It is also
> entirely pluggable meaning you can bypass the default implementations and
> provide your own, just like any type class.
>
> https://github.com/upio/spark-sql-formats
>
>
> *From:* Michael Armbrust <mi...@databricks.com>
> *Date:* October 26, 2016 at 12:50:23 PM PDT
> *To:* Koert Kuipers <ko...@tresata.com>
> *Cc:* Ryan Blue <rb...@netflix.com>, "dev@spark.apache.org" <
> dev@spark.apache.org>
> *Subject:* *Re: getting encoder implicits to be more accurate*
>
> Sorry, I realize that set is only one example here, but I don't think that
> making the type of the implicit more narrow to include only ProductN or
> something eliminates the issue.  Even with that change, we will fail to
> generate an encoder with the same error if you, for example, have a field
> of your case class that is an unsupported type.
>
>
>
> Short of changing this to compile-time macros, I think we are stuck with
> this class of errors at runtime.  The simplest solution seems to be to
> expand the set of thing we can handle as much as possible and allow users
> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
> make this the default though, as behavior would change with each release
> that adds support for more types.  I would be very supportive of making
> this fallback a built-in option though.
>
>
>
> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
> yup, it doesnt really solve the underlying issue.
>
> we fixed it internally by having our own typeclass that produces encoders
> and that does check the contents of the products, but we did this by simply
> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
> Product, since we dont have a need for case classes
>
> if case classes extended ProductN (which they will i think in scala 2.12?)
> then we could drop Product and support Product1 - Product22 and Option
> explicitly while checking the classes they contain. that would be the
> cleanest.
>
>
>
> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>
> Isn't the problem that Option is a Product and the class it contains isn't
> checked? Adding support for Set fixes the example, but the problem would
> happen with any class there isn't an encoder for, right?
>
>
>
> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> Hmm, that is unfortunate.  Maybe the best solution is to add support for
> sets?  I don't think that would be super hard.
>
>
>
> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> i am trying to use encoders as a typeclass where if it fails to find an
> ExpressionEncoder it falls back to KryoEncoder.
>
> the issue seems to be that ExpressionEncoder claims a little more than it
> can handle here:
>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
> Encoders.product[T]
>
> this "claims" to handle for example Option[Set[Int]], but it really cannot
> handle Set so it leads to a runtime exception.
>
> would it be useful to make this a little more specific? i guess the
> challenge is going to be case classes which unfortunately dont extend
> Product1, Product2, etc.
>
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
>
>

Re: getting encoder implicits to be more accurate

Posted by Koert Kuipers <ko...@tresata.com>.

just taking it for a quick spin it looks great, with correct support for
nested rows and using option for nullability.

scala> val format = implicitly[RowFormat[(String, Seq[(String,
Option[Int])])]]
format: com.github.upio.spark.sql.RowFormat[(String, Seq[(String,
Option[Int])])] = com.github.upio.spark.sql.FamilyFormats$$anon$3@2c0961e2

scala> format.schema
res12: org.apache.spark.sql.types.StructType =
StructType(StructField(_1,StringType,false),
StructField(_2,ArrayType(StructType(StructField(_1,StringType,false),
StructField(_2,IntegerType,true)),true),false))

scala> val x = format.read(Row("a", Seq(Row("a", 5))))
x: (String, Seq[(String, Option[Int])]) = (a,List((a,Some(5))))

scala> format.write(x)
res13: org.apache.spark.sql.Row = [a,List([a,5])]



On Mon, Nov 14, 2016 at 3:10 PM, Koert Kuipers <ko...@tresata.com> wrote:

> agreed on your point that this can be done without macros
>
> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin <sa...@gmail.com>
> wrote:
>
>> You don't need compiler time macros for this, you can do it quite easily
>> using shapeless. I've been playing with a project which borrows ideas from
>> spray-json and spray-json-shapeless to implement Row marshalling for
>> arbitrary case classes. It's checked and generated at compile time,
>> supports arbitrary/nested case classes, and allows custom types. It is also
>> entirely pluggable meaning you can bypass the default implementations and
>> provide your own, just like any type class.
>>
>> https://github.com/upio/spark-sql-formats
>>
>>
>> *From:* Michael Armbrust <mi...@databricks.com>
>> *Date:* October 26, 2016 at 12:50:23 PM PDT
>> *To:* Koert Kuipers <ko...@tresata.com>
>> *Cc:* Ryan Blue <rb...@netflix.com>, "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> *Subject:* *Re: getting encoder implicits to be more accurate*
>>
>> Sorry, I realize that set is only one example here, but I don't think
>> that making the type of the implicit more narrow to include only ProductN
>> or something eliminates the issue.  Even with that change, we will fail to
>> generate an encoder with the same error if you, for example, have a field
>> of your case class that is an unsupported type.
>>
>>
>>
>> Short of changing this to compile-time macros, I think we are stuck with
>> this class of errors at runtime.  The simplest solution seems to be to
>> expand the set of thing we can handle as much as possible and allow users
>> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
>> make this the default though, as behavior would change with each release
>> that adds support for more types.  I would be very supportive of making
>> this fallback a built-in option though.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>> yup, it doesnt really solve the underlying issue.
>>
>> we fixed it internally by having our own typeclass that produces encoders
>> and that does check the contents of the products, but we did this by simply
>> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
>> Product, since we dont have a need for case classes
>>
>> if case classes extended ProductN (which they will i think in scala
>> 2.12?) then we could drop Product and support Product1 - Product22 and
>> Option explicitly while checking the classes they contain. that would be
>> the cleanest.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> Isn't the problem that Option is a Product and the class it contains
>> isn't checked? Adding support for Set fixes the example, but the problem
>> would happen with any class there isn't an encoder for, right?
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <
>> michael@databricks.com> wrote:
>>
>> Hmm, that is unfortunate.  Maybe the best solution is to add support for
>> sets?  I don't think that would be super hard.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> i am trying to use encoders as a typeclass where if it fails to find an
>> ExpressionEncoder it falls back to KryoEncoder.
>>
>> the issue seems to be that ExpressionEncoder claims a little more than it
>> can handle here:
>>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
>> Encoders.product[T]
>>
>> this "claims" to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>

Re: getting encoder implicits to be more accurate

Posted by Koert Kuipers <ko...@tresata.com>.

agreed on your point that this can be done without macros

On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin <sa...@gmail.com>
wrote:

> You don't need compiler time macros for this, you can do it quite easily
> using shapeless. I've been playing with a project which borrows ideas from
> spray-json and spray-json-shapeless to implement Row marshalling for
> arbitrary case classes. It's checked and generated at compile time,
> supports arbitrary/nested case classes, and allows custom types. It is also
> entirely pluggable meaning you can bypass the default implementations and
> provide your own, just like any type class.
>
> https://github.com/upio/spark-sql-formats
>
>
> *From:* Michael Armbrust <mi...@databricks.com>
> *Date:* October 26, 2016 at 12:50:23 PM PDT
> *To:* Koert Kuipers <ko...@tresata.com>
> *Cc:* Ryan Blue <rb...@netflix.com>, "dev@spark.apache.org" <
> dev@spark.apache.org>
> *Subject:* *Re: getting encoder implicits to be more accurate*
>
> Sorry, I realize that set is only one example here, but I don't think that
> making the type of the implicit more narrow to include only ProductN or
> something eliminates the issue.  Even with that change, we will fail to
> generate an encoder with the same error if you, for example, have a field
> of your case class that is an unsupported type.
>
>
>
> Short of changing this to compile-time macros, I think we are stuck with
> this class of errors at runtime.  The simplest solution seems to be to
> expand the set of thing we can handle as much as possible and allow users
> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
> make this the default though, as behavior would change with each release
> that adds support for more types.  I would be very supportive of making
> this fallback a built-in option though.
>
>
>
> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
> yup, it doesnt really solve the underlying issue.
>
> we fixed it internally by having our own typeclass that produces encoders
> and that does check the contents of the products, but we did this by simply
> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
> Product, since we dont have a need for case classes
>
> if case classes extended ProductN (which they will i think in scala 2.12?)
> then we could drop Product and support Product1 - Product22 and Option
> explicitly while checking the classes they contain. that would be the
> cleanest.
>
>
>
> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>
> Isn't the problem that Option is a Product and the class it contains isn't
> checked? Adding support for Set fixes the example, but the problem would
> happen with any class there isn't an encoder for, right?
>
>
>
> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> Hmm, that is unfortunate.  Maybe the best solution is to add support for
> sets?  I don't think that would be super hard.
>
>
>
> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> i am trying to use encoders as a typeclass where if it fails to find an
> ExpressionEncoder it falls back to KryoEncoder.
>
> the issue seems to be that ExpressionEncoder claims a little more than it
> can handle here:
>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
> Encoders.product[T]
>
> this "claims" to handle for example Option[Set[Int]], but it really cannot
> handle Set so it leads to a runtime exception.
>
> would it be useful to make this a little more specific? i guess the
> challenge is going to be case classes which unfortunately dont extend
> Product1, Product2, etc.
>
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>