You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Don Drake <do...@gmail.com> on 2017/02/01 04:12:10 UTC

Parameterized types and Datasets - Spark 2.1.0

I have a set of CSV that I need to perform ETL on, with the plan to re-use
a lot of code between each file in a parent abstract class.

I tried creating the following simple abstract class that will have a
parameterized type of a case class that represents the schema being read in.

This won't compile, it just complains about not being able to find an
encoder, but I'm importing the implicits and don't believe this error.


scala> import spark.implicits._
import spark.implicits._

scala>

scala> case class RawTemp(f1: String, f2: String, temp: Long, created_at:
java.sql.Timestamp, data_filename: String)
defined class RawTemp

scala>

scala> abstract class RawTable[A](inDir: String) {
     |     def load() = {
     |         spark.read
     |             .option("header", "true")
     |             .option("mode", "FAILFAST")
     |             .option("escape", "\"")
     |             .option("nullValue", "")
     |             .option("indferSchema", "true")
     |             .csv(inDir)
     |             .as[A]
     |     }
     | }
<console>:27: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are
supported by importing spark.implicits._  Support for serializing other
types will be added in future releases.
                   .as[A]

scala> class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
<console>:13: error: not found: type RawTable
       class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
                      ^

What's odd is that this output looks okay:

scala> val RTEncoder = Encoders.product[RawTemp]
RTEncoder: org.apache.spark.sql.Encoder[RawTemp] = class[f1[0]: string,
f2[0]: string, temp[0]: bigint, created_at[0]: timestamp, data_filename[0]:
string]

scala> RTEncoder.schema
res4: org.apache.spark.sql.types.StructType =
StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(temp,LongType,false),
StructField(created_at,TimestampType,true),
StructField(data_filename,StringType,true))

scala> RTEncoder.clsTag
res5: scala.reflect.ClassTag[RawTemp] = RawTemp

Any ideas?

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Re: Parameterized types and Datasets - Spark 2.1.0

Posted by Don Drake <do...@gmail.com>.

I imported that as my first command in my previous email.  I'm using a
spark-shell.

scala> import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoder

scala>


Any comments regarding importing implicits in an application?

Thanks.

-Don

On Wed, Feb 1, 2017 at 6:10 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> This is the error, you are missing an import:
>
> <console>:13: error: not found: type Encoder
>        abstract class RawTable[A : Encoder](inDir: String) {
>
> Works for me in a REPL.
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/204687029790319/2840265927289860/latest.html>
>
> On Wed, Feb 1, 2017 at 3:34 PM, Don Drake <do...@gmail.com> wrote:
>
>> Thanks for the reply.   I did give that syntax a try [A : Encoder]
>> yesterday, but I kept getting this exception in a spark-shell and Zeppelin
>> browser.
>>
>> scala> import org.apache.spark.sql.Encoder
>> import org.apache.spark.sql.Encoder
>>
>> scala>
>>
>> scala> case class RawTemp(f1: String, f2: String, temp: Long, created_at:
>> java.sql.Timestamp, data_filename: String)
>> defined class RawTemp
>>
>> scala>
>>
>> scala> import spark.implicits._
>> import spark.implicits._
>>
>> scala>
>>
>> scala> abstract class RawTable[A : Encoder](inDir: String) {
>>      |     import spark.implicits._
>>      |     def load() = {
>>      |         import spark.implicits._
>>      |         spark.read
>>      |             .option("header", "true")
>>      |             .option("mode", "FAILFAST")
>>      |             .option("escape", "\"")
>>      |             .option("nullValue", "")
>>      |             .option("indferSchema", "true")
>>      |             .csv(inDir)
>>      |             .as[A]
>>      |     }
>>      | }
>> <console>:13: error: not found: type Encoder
>>        abstract class RawTable[A : Encoder](inDir: String) {
>>                                    ^
>> <console>:24: error: Unable to find encoder for type stored in a
>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>> classes) are supported by importing spark.implicits._  Support for
>> serializing other types will be added in future releases.
>>                    .as[A]
>>
>>
>> I gave it a try today in a Scala application and it seems to work.  Is
>> this a known issue in a spark-shell?
>>
>> In my Scala application, this is being defined in a separate file, etc.
>> without direct access to a Spark session.
>>
>> I had to add the following code snippet so the import spark.implicits._
>> would take effect:
>>
>> // ugly hack to get around Encoder can't be found compile time errors
>>
>> private object myImplicits extends SQLImplicits {
>>
>>   protected override def _sqlContext: SQLContext =
>> MySparkSingleton.getCurrentSession().sqlContext
>>
>> }
>>
>> import myImplicits._
>>
>> I found that in about the hundredth SO post I searched for this problem.
>> Is this the best way to let implicits do its thing?
>>
>> Thanks.
>>
>> -Don
>>
>>
>>
>> On Wed, Feb 1, 2017 at 3:16 PM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> You need to enforce that an Encoder is available for the type A using a context
>>> bound <http://docs.scala-lang.org/tutorials/FAQ/context-bounds>.
>>>
>>> import org.apache.spark.sql.Encoder
>>> abstract class RawTable[A : Encoder](inDir: String) {
>>>   ...
>>> }
>>>
>>> On Tue, Jan 31, 2017 at 8:12 PM, Don Drake <do...@gmail.com> wrote:
>>>
>>>> I have a set of CSV that I need to perform ETL on, with the plan to
>>>> re-use a lot of code between each file in a parent abstract class.
>>>>
>>>> I tried creating the following simple abstract class that will have a
>>>> parameterized type of a case class that represents the schema being read in.
>>>>
>>>> This won't compile, it just complains about not being able to find an
>>>> encoder, but I'm importing the implicits and don't believe this error.
>>>>
>>>>
>>>> scala> import spark.implicits._
>>>> import spark.implicits._
>>>>
>>>> scala>
>>>>
>>>> scala> case class RawTemp(f1: String, f2: String, temp: Long,
>>>> created_at: java.sql.Timestamp, data_filename: String)
>>>> defined class RawTemp
>>>>
>>>> scala>
>>>>
>>>> scala> abstract class RawTable[A](inDir: String) {
>>>>      |     def load() = {
>>>>      |         spark.read
>>>>      |             .option("header", "true")
>>>>      |             .option("mode", "FAILFAST")
>>>>      |             .option("escape", "\"")
>>>>      |             .option("nullValue", "")
>>>>      |             .option("indferSchema", "true")
>>>>      |             .csv(inDir)
>>>>      |             .as[A]
>>>>      |     }
>>>>      | }
>>>> <console>:27: error: Unable to find encoder for type stored in a
>>>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>>>> classes) are supported by importing spark.implicits._  Support for
>>>> serializing other types will be added in future releases.
>>>>                    .as[A]
>>>>
>>>> scala> class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
>>>> <console>:13: error: not found: type RawTable
>>>>        class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
>>>>                       ^
>>>>
>>>> What's odd is that this output looks okay:
>>>>
>>>> scala> val RTEncoder = Encoders.product[RawTemp]
>>>> RTEncoder: org.apache.spark.sql.Encoder[RawTemp] = class[f1[0]:
>>>> string, f2[0]: string, temp[0]: bigint, created_at[0]: timestamp,
>>>> data_filename[0]: string]
>>>>
>>>> scala> RTEncoder.schema
>>>> res4: org.apache.spark.sql.types.StructType =
>>>> StructType(StructField(f1,StringType,true),
>>>> StructField(f2,StringType,true), StructField(temp,LongType,false),
>>>> StructField(created_at,TimestampType,true),
>>>> StructField(data_filename,StringType,true))
>>>>
>>>> scala> RTEncoder.clsTag
>>>> res5: scala.reflect.ClassTag[RawTemp] = RawTemp
>>>>
>>>> Any ideas?
>>>>
>>>> --
>>>> Donald Drake
>>>> Drake Consulting
>>>> http://www.drakeconsulting.com/
>>>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>>>> 800-733-2143 <(800)%20733-2143>
>>>>
>>>
>>>
>>
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>> 800-733-2143 <(800)%20733-2143>
>>
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Re: Parameterized types and Datasets - Spark 2.1.0

Posted by Michael Armbrust <mi...@databricks.com>.

This is the error, you are missing an import:

<console>:13: error: not found: type Encoder
       abstract class RawTable[A : Encoder](inDir: String) {

Works for me in a REPL.
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/204687029790319/2840265927289860/latest.html>

On Wed, Feb 1, 2017 at 3:34 PM, Don Drake <do...@gmail.com> wrote:

> Thanks for the reply.   I did give that syntax a try [A : Encoder]
> yesterday, but I kept getting this exception in a spark-shell and Zeppelin
> browser.
>
> scala> import org.apache.spark.sql.Encoder
> import org.apache.spark.sql.Encoder
>
> scala>
>
> scala> case class RawTemp(f1: String, f2: String, temp: Long, created_at:
> java.sql.Timestamp, data_filename: String)
> defined class RawTemp
>
> scala>
>
> scala> import spark.implicits._
> import spark.implicits._
>
> scala>
>
> scala> abstract class RawTable[A : Encoder](inDir: String) {
>      |     import spark.implicits._
>      |     def load() = {
>      |         import spark.implicits._
>      |         spark.read
>      |             .option("header", "true")
>      |             .option("mode", "FAILFAST")
>      |             .option("escape", "\"")
>      |             .option("nullValue", "")
>      |             .option("indferSchema", "true")
>      |             .csv(inDir)
>      |             .as[A]
>      |     }
>      | }
> <console>:13: error: not found: type Encoder
>        abstract class RawTable[A : Encoder](inDir: String) {
>                                    ^
> <console>:24: error: Unable to find encoder for type stored in a Dataset.
> Primitive types (Int, String, etc) and Product types (case classes) are
> supported by importing spark.implicits._  Support for serializing other
> types will be added in future releases.
>                    .as[A]
>
>
> I gave it a try today in a Scala application and it seems to work.  Is
> this a known issue in a spark-shell?
>
> In my Scala application, this is being defined in a separate file, etc.
> without direct access to a Spark session.
>
> I had to add the following code snippet so the import spark.implicits._
> would take effect:
>
> // ugly hack to get around Encoder can't be found compile time errors
>
> private object myImplicits extends SQLImplicits {
>
>   protected override def _sqlContext: SQLContext = MySparkSingleton.
> getCurrentSession().sqlContext
>
> }
>
> import myImplicits._
>
> I found that in about the hundredth SO post I searched for this problem.
> Is this the best way to let implicits do its thing?
>
> Thanks.
>
> -Don
>
>
>
> On Wed, Feb 1, 2017 at 3:16 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> You need to enforce that an Encoder is available for the type A using a context
>> bound <http://docs.scala-lang.org/tutorials/FAQ/context-bounds>.
>>
>> import org.apache.spark.sql.Encoder
>> abstract class RawTable[A : Encoder](inDir: String) {
>>   ...
>> }
>>
>> On Tue, Jan 31, 2017 at 8:12 PM, Don Drake <do...@gmail.com> wrote:
>>
>>> I have a set of CSV that I need to perform ETL on, with the plan to
>>> re-use a lot of code between each file in a parent abstract class.
>>>
>>> I tried creating the following simple abstract class that will have a
>>> parameterized type of a case class that represents the schema being read in.
>>>
>>> This won't compile, it just complains about not being able to find an
>>> encoder, but I'm importing the implicits and don't believe this error.
>>>
>>>
>>> scala> import spark.implicits._
>>> import spark.implicits._
>>>
>>> scala>
>>>
>>> scala> case class RawTemp(f1: String, f2: String, temp: Long,
>>> created_at: java.sql.Timestamp, data_filename: String)
>>> defined class RawTemp
>>>
>>> scala>
>>>
>>> scala> abstract class RawTable[A](inDir: String) {
>>>      |     def load() = {
>>>      |         spark.read
>>>      |             .option("header", "true")
>>>      |             .option("mode", "FAILFAST")
>>>      |             .option("escape", "\"")
>>>      |             .option("nullValue", "")
>>>      |             .option("indferSchema", "true")
>>>      |             .csv(inDir)
>>>      |             .as[A]
>>>      |     }
>>>      | }
>>> <console>:27: error: Unable to find encoder for type stored in a
>>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>>> classes) are supported by importing spark.implicits._  Support for
>>> serializing other types will be added in future releases.
>>>                    .as[A]
>>>
>>> scala> class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
>>> <console>:13: error: not found: type RawTable
>>>        class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
>>>                       ^
>>>
>>> What's odd is that this output looks okay:
>>>
>>> scala> val RTEncoder = Encoders.product[RawTemp]
>>> RTEncoder: org.apache.spark.sql.Encoder[RawTemp] = class[f1[0]: string,
>>> f2[0]: string, temp[0]: bigint, created_at[0]: timestamp, data_filename[0]:
>>> string]
>>>
>>> scala> RTEncoder.schema
>>> res4: org.apache.spark.sql.types.StructType =
>>> StructType(StructField(f1,StringType,true),
>>> StructField(f2,StringType,true), StructField(temp,LongType,false),
>>> StructField(created_at,TimestampType,true),
>>> StructField(data_filename,StringType,true))
>>>
>>> scala> RTEncoder.clsTag
>>> res5: scala.reflect.ClassTag[RawTemp] = RawTemp
>>>
>>> Any ideas?
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>>> 800-733-2143 <(800)%20733-2143>
>>>
>>
>>
>
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake <http://www.MailLaunder.com/>
> 800-733-2143 <(800)%20733-2143>
>

Re: Parameterized types and Datasets - Spark 2.1.0

Posted by Don Drake <do...@gmail.com>.

Thanks for the reply.   I did give that syntax a try [A : Encoder]
yesterday, but I kept getting this exception in a spark-shell and Zeppelin
browser.

scala> import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoder

scala>

scala> case class RawTemp(f1: String, f2: String, temp: Long, created_at:
java.sql.Timestamp, data_filename: String)
defined class RawTemp

scala>

scala> import spark.implicits._
import spark.implicits._

scala>

scala> abstract class RawTable[A : Encoder](inDir: String) {
     |     import spark.implicits._
     |     def load() = {
     |         import spark.implicits._
     |         spark.read
     |             .option("header", "true")
     |             .option("mode", "FAILFAST")
     |             .option("escape", "\"")
     |             .option("nullValue", "")
     |             .option("indferSchema", "true")
     |             .csv(inDir)
     |             .as[A]
     |     }
     | }
<console>:13: error: not found: type Encoder
       abstract class RawTable[A : Encoder](inDir: String) {
                                   ^
<console>:24: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are
supported by importing spark.implicits._  Support for serializing other
types will be added in future releases.
                   .as[A]


I gave it a try today in a Scala application and it seems to work.  Is this
a known issue in a spark-shell?

In my Scala application, this is being defined in a separate file, etc.
without direct access to a Spark session.

I had to add the following code snippet so the import spark.implicits._
would take effect:

// ugly hack to get around Encoder can't be found compile time errors

private object myImplicits extends SQLImplicits {

  protected override def _sqlContext: SQLContext =
MySparkSingleton.getCurrentSession().sqlContext

}

import myImplicits._

I found that in about the hundredth SO post I searched for this problem.
Is this the best way to let implicits do its thing?

Thanks.

-Don



On Wed, Feb 1, 2017 at 3:16 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> You need to enforce that an Encoder is available for the type A using a context
> bound <http://docs.scala-lang.org/tutorials/FAQ/context-bounds>.
>
> import org.apache.spark.sql.Encoder
> abstract class RawTable[A : Encoder](inDir: String) {
>   ...
> }
>
> On Tue, Jan 31, 2017 at 8:12 PM, Don Drake <do...@gmail.com> wrote:
>
>> I have a set of CSV that I need to perform ETL on, with the plan to
>> re-use a lot of code between each file in a parent abstract class.
>>
>> I tried creating the following simple abstract class that will have a
>> parameterized type of a case class that represents the schema being read in.
>>
>> This won't compile, it just complains about not being able to find an
>> encoder, but I'm importing the implicits and don't believe this error.
>>
>>
>> scala> import spark.implicits._
>> import spark.implicits._
>>
>> scala>
>>
>> scala> case class RawTemp(f1: String, f2: String, temp: Long, created_at:
>> java.sql.Timestamp, data_filename: String)
>> defined class RawTemp
>>
>> scala>
>>
>> scala> abstract class RawTable[A](inDir: String) {
>>      |     def load() = {
>>      |         spark.read
>>      |             .option("header", "true")
>>      |             .option("mode", "FAILFAST")
>>      |             .option("escape", "\"")
>>      |             .option("nullValue", "")
>>      |             .option("indferSchema", "true")
>>      |             .csv(inDir)
>>      |             .as[A]
>>      |     }
>>      | }
>> <console>:27: error: Unable to find encoder for type stored in a
>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>> classes) are supported by importing spark.implicits._  Support for
>> serializing other types will be added in future releases.
>>                    .as[A]
>>
>> scala> class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
>> <console>:13: error: not found: type RawTable
>>        class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
>>                       ^
>>
>> What's odd is that this output looks okay:
>>
>> scala> val RTEncoder = Encoders.product[RawTemp]
>> RTEncoder: org.apache.spark.sql.Encoder[RawTemp] = class[f1[0]: string,
>> f2[0]: string, temp[0]: bigint, created_at[0]: timestamp, data_filename[0]:
>> string]
>>
>> scala> RTEncoder.schema
>> res4: org.apache.spark.sql.types.StructType =
>> StructType(StructField(f1,StringType,true),
>> StructField(f2,StringType,true), StructField(temp,LongType,false),
>> StructField(created_at,TimestampType,true),
>> StructField(data_filename,StringType,true))
>>
>> scala> RTEncoder.clsTag
>> res5: scala.reflect.ClassTag[RawTemp] = RawTemp
>>
>> Any ideas?
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>> 800-733-2143 <(800)%20733-2143>
>>
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Re: Parameterized types and Datasets - Spark 2.1.0

Posted by Michael Armbrust <mi...@databricks.com>.

You need to enforce that an Encoder is available for the type A using a context
bound <http://docs.scala-lang.org/tutorials/FAQ/context-bounds>.

import org.apache.spark.sql.Encoder
abstract class RawTable[A : Encoder](inDir: String) {
  ...
}

On Tue, Jan 31, 2017 at 8:12 PM, Don Drake <do...@gmail.com> wrote:

> I have a set of CSV that I need to perform ETL on, with the plan to re-use
> a lot of code between each file in a parent abstract class.
>
> I tried creating the following simple abstract class that will have a
> parameterized type of a case class that represents the schema being read in.
>
> This won't compile, it just complains about not being able to find an
> encoder, but I'm importing the implicits and don't believe this error.
>
>
> scala> import spark.implicits._
> import spark.implicits._
>
> scala>
>
> scala> case class RawTemp(f1: String, f2: String, temp: Long, created_at:
> java.sql.Timestamp, data_filename: String)
> defined class RawTemp
>
> scala>
>
> scala> abstract class RawTable[A](inDir: String) {
>      |     def load() = {
>      |         spark.read
>      |             .option("header", "true")
>      |             .option("mode", "FAILFAST")
>      |             .option("escape", "\"")
>      |             .option("nullValue", "")
>      |             .option("indferSchema", "true")
>      |             .csv(inDir)
>      |             .as[A]
>      |     }
>      | }
> <console>:27: error: Unable to find encoder for type stored in a Dataset.
> Primitive types (Int, String, etc) and Product types (case classes) are
> supported by importing spark.implicits._  Support for serializing other
> types will be added in future releases.
>                    .as[A]
>
> scala> class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
> <console>:13: error: not found: type RawTable
>        class TempTable extends RawTable[RawTemp]("/user/drake/t.csv")
>                       ^
>
> What's odd is that this output looks okay:
>
> scala> val RTEncoder = Encoders.product[RawTemp]
> RTEncoder: org.apache.spark.sql.Encoder[RawTemp] = class[f1[0]: string,
> f2[0]: string, temp[0]: bigint, created_at[0]: timestamp, data_filename[0]:
> string]
>
> scala> RTEncoder.schema
> res4: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true),
> StructField(f2,StringType,true), StructField(temp,LongType,false),
> StructField(created_at,TimestampType,true), StructField(data_filename,
> StringType,true))
>
> scala> RTEncoder.clsTag
> res5: scala.reflect.ClassTag[RawTemp] = RawTemp
>
> Any ideas?
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake <http://www.MailLaunder.com/>
> 800-733-2143 <(800)%20733-2143>
>