You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Keith Chapman <ke...@gmail.com> on 2017/03/22 23:18:21 UTC

Having issues reading a csv file into a DataSet using Spark 2.1

Hi,

I'm trying to read in a CSV file into a Dataset but keep having compilation
issues. I'm using spark 2.1 and the following is a small program that
exhibit the issue I'm having. I've searched around but not found a solution
that worked, I've added "import sqlContext.implicits._" as suggested but no
luck. What am I missing? Would appreciate some advice.

import org.apache.spark.sql.functions._
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{Encoder,Encoders}

object DatasetTest{

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("DatasetTest")
    val sc = new SparkContext(sparkConf)
    case class Foo(text: String)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    val ds : org.apache.spark.sql.Dataset[Foo] =
sqlContext.read.csv(args(1)).as[Foo]
    ds.show
  }
}

Compiling the above program gives, I'd expect it to work as its a simple
case class, changing it to as[String] works, but I would like to get the
case class to work.

[error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder
for type stored in a Dataset.  Primitive types (Int, String, etc) and
Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
[error]     val ds : org.apache.spark.sql.Dataset[Foo] =
sqlContext.read.csv(args(1)).as[Foo]


Regards,
Keith.

Re: Having issues reading a csv file into a DataSet using Spark 2.1

Posted by Diego Fanesi <di...@gmail.com>.

that variable "x" would be a DataFrame which is an alias of Dataset in the
last versions. you can do your map operation by doing x.map(case
Row(f1:String, f2:Int, ....) => [your code]). f1 and f2 stands for the
columns of your dataset with the type. in the code you can use f1 and f2 as
variables to make your map function.

On Thu, Mar 23, 2017 at 2:58 AM, Keith Chapman <ke...@gmail.com>
wrote:

> Thanks for the advice Diego, that was very helpful. How could I read the
> csv as a dataset though? I need to do a map operation over the dataset, I
> just coded up an example to illustrate the issue
>
> On Mar 22, 2017 6:43 PM, "Diego Fanesi" <di...@gmail.com> wrote:
>
>> You are using spark as a library but it is much more than that. The book
>> "learning Spark"  is very well done and it helped me a lot starting with
>> spark. Maybe you should start from there.
>>
>> Those are the issues in your code:
>>
>> Basically, you generally don't execute spark code like that. You could
>> but it is not officially supported and many functions don't work in that
>> way. You should start your local cluster made of master and single worker,
>> then make a jar with your code and use spark-submit to send it to the
>> cluster.
>>
>> You generally never use args because spark is a multiprocess,
>> multi-thread application so args will not be available everywhere.
>>
>> All contexts have been merged into the same context in the last versions
>> of spark. so you will need to do something like this:
>>
>> import org.apache.spark.sql.{DataFrame, SparkSession}
>>
>> object DatasetTest{
>>
>> val spark: SparkSession = SparkSession
>>   .builder() .master("local[8]")
>>   .appName("Spark basic example").getOrCreate()
>>
>> import spark.implicits._
>>
>> def main(Args: Array[String]) {
>>
>> var x = spark.read.format("csv").load("/home/user/data.csv")
>>
>> x.show()
>>
>> }
>>
>> }
>>
>>
>> hope this helps.
>>
>> Diego
>>
>> On 22 Mar 2017 7:18 pm, "Keith Chapman" <ke...@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm trying to read in a CSV file into a Dataset but keep having
>> compilation issues. I'm using spark 2.1 and the following is a small
>> program that exhibit the issue I'm having. I've searched around but not
>> found a solution that worked, I've added "import sqlContext.implicits._" as
>> suggested but no luck. What am I missing? Would appreciate some advice.
>>
>> import org.apache.spark.sql.functions._
>> import org.apache.spark.{SparkContext, SparkConf}
>> import org.apache.spark.sql.{Encoder,Encoders}
>>
>> object DatasetTest{
>>
>>   def main(args: Array[String]) {
>>     val sparkConf = new SparkConf().setAppName("DatasetTest")
>>     val sc = new SparkContext(sparkConf)
>>     case class Foo(text: String)
>>     val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>     import sqlContext.implicits._
>>     val ds : org.apache.spark.sql.Dataset[Foo] =
>> sqlContext.read.csv(args(1)).as[Foo]
>>     ds.show
>>   }
>> }
>>
>> Compiling the above program gives, I'd expect it to work as its a simple
>> case class, changing it to as[String] works, but I would like to get the
>> case class to work.
>>
>> [error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder
>> for type stored in a Dataset.  Primitive types (Int, String, etc) and
>> Product types (case classes) are supported by importing spark.implicits._
>> Support for serializing other types will be added in future releases.
>> [error]     val ds : org.apache.spark.sql.Dataset[Foo] =
>> sqlContext.read.csv(args(1)).as[Foo]
>>
>>
>> Regards,
>> Keith.
>>
>>
>>

Re: Having issues reading a csv file into a DataSet using Spark 2.1

Posted by Keith Chapman <ke...@gmail.com>.

Thanks for the advice Diego, that was very helpful. How could I read the
csv as a dataset though? I need to do a map operation over the dataset, I
just coded up an example to illustrate the issue

On Mar 22, 2017 6:43 PM, "Diego Fanesi" <di...@gmail.com> wrote:

> You are using spark as a library but it is much more than that. The book
> "learning Spark"  is very well done and it helped me a lot starting with
> spark. Maybe you should start from there.
>
> Those are the issues in your code:
>
> Basically, you generally don't execute spark code like that. You could but
> it is not officially supported and many functions don't work in that way.
> You should start your local cluster made of master and single worker, then
> make a jar with your code and use spark-submit to send it to the cluster.
>
> You generally never use args because spark is a multiprocess, multi-thread
> application so args will not be available everywhere.
>
> All contexts have been merged into the same context in the last versions
> of spark. so you will need to do something like this:
>
> import org.apache.spark.sql.{DataFrame, SparkSession}
>
> object DatasetTest{
>
> val spark: SparkSession = SparkSession
>   .builder() .master("local[8]")
>   .appName("Spark basic example").getOrCreate()
>
> import spark.implicits._
>
> def main(Args: Array[String]) {
>
> var x = spark.read.format("csv").load("/home/user/data.csv")
>
> x.show()
>
> }
>
> }
>
>
> hope this helps.
>
> Diego
>
> On 22 Mar 2017 7:18 pm, "Keith Chapman" <ke...@gmail.com> wrote:
>
> Hi,
>
> I'm trying to read in a CSV file into a Dataset but keep having
> compilation issues. I'm using spark 2.1 and the following is a small
> program that exhibit the issue I'm having. I've searched around but not
> found a solution that worked, I've added "import sqlContext.implicits._" as
> suggested but no luck. What am I missing? Would appreciate some advice.
>
> import org.apache.spark.sql.functions._
> import org.apache.spark.{SparkContext, SparkConf}
> import org.apache.spark.sql.{Encoder,Encoders}
>
> object DatasetTest{
>
>   def main(args: Array[String]) {
>     val sparkConf = new SparkConf().setAppName("DatasetTest")
>     val sc = new SparkContext(sparkConf)
>     case class Foo(text: String)
>     val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>     import sqlContext.implicits._
>     val ds : org.apache.spark.sql.Dataset[Foo] =
> sqlContext.read.csv(args(1)).as[Foo]
>     ds.show
>   }
> }
>
> Compiling the above program gives, I'd expect it to work as its a simple
> case class, changing it to as[String] works, but I would like to get the
> case class to work.
>
> [error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder
> for type stored in a Dataset.  Primitive types (Int, String, etc) and
> Product types (case classes) are supported by importing spark.implicits._
> Support for serializing other types will be added in future releases.
> [error]     val ds : org.apache.spark.sql.Dataset[Foo] =
> sqlContext.read.csv(args(1)).as[Foo]
>
>
> Regards,
> Keith.
>
>
>

Re: Having issues reading a csv file into a DataSet using Spark 2.1

Posted by Diego Fanesi <di...@gmail.com>.

You are using spark as a library but it is much more than that. The book
"learning Spark"  is very well done and it helped me a lot starting with
spark. Maybe you should start from there.

Those are the issues in your code:

Basically, you generally don't execute spark code like that. You could but
it is not officially supported and many functions don't work in that way.
You should start your local cluster made of master and single worker, then
make a jar with your code and use spark-submit to send it to the cluster.

You generally never use args because spark is a multiprocess, multi-thread
application so args will not be available everywhere.

All contexts have been merged into the same context in the last versions of
spark. so you will need to do something like this:

import org.apache.spark.sql.{DataFrame, SparkSession}

object DatasetTest{

val spark: SparkSession = SparkSession
  .builder() .master("local[8]")
  .appName("Spark basic example").getOrCreate()

import spark.implicits._

def main(Args: Array[String]) {

var x = spark.read.format("csv").load("/home/user/data.csv")

x.show()

}

}


hope this helps.

Diego

On 22 Mar 2017 7:18 pm, "Keith Chapman" <ke...@gmail.com> wrote:

Hi,

I'm trying to read in a CSV file into a Dataset but keep having compilation
issues. I'm using spark 2.1 and the following is a small program that
exhibit the issue I'm having. I've searched around but not found a solution
that worked, I've added "import sqlContext.implicits._" as suggested but no
luck. What am I missing? Would appreciate some advice.

import org.apache.spark.sql.functions._
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{Encoder,Encoders}

object DatasetTest{

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("DatasetTest")
    val sc = new SparkContext(sparkConf)
    case class Foo(text: String)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    val ds : org.apache.spark.sql.Dataset[Foo] =
sqlContext.read.csv(args(1)).as[Foo]
    ds.show
  }
}

Compiling the above program gives, I'd expect it to work as its a simple
case class, changing it to as[String] works, but I would like to get the
case class to work.

[error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder
for type stored in a Dataset.  Primitive types (Int, String, etc) and
Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
[error]     val ds : org.apache.spark.sql.Dataset[Foo] =
sqlContext.read.csv(args(1)).as[Foo]


Regards,
Keith.