You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ca...@free.fr on 2022/02/08 10:16:22 UTC

question on the different way of RDD to dataframe

Hello

I am converting some py code to scala.
This works in python:

>>> rdd = sc.parallelize([('apple',1),('orange',2)])
>>> rdd.toDF(['fruit','num']).show()
+------+---+
| fruit|num|
+------+---+
| apple|  1|
|orange|  2|
+------+---+

And in scala:
scala> rdd.toDF("fruit","num").show()
+------+---+
| fruit|num|
+------+---+
| apple|  1|
|orange|  2|
+------+---+

But I saw many code that use a case class for translation.

scala> case class Fruit(fruit:String,num:Int)
defined class Fruit

scala> rdd.map{case (x,y) => Fruit(x,y) }.toDF().show()
+------+---+
| fruit|num|
+------+---+
| apple|  1|
|orange|  2|
+------+---+


Do you know why to use a "case class" here?

thanks.


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: question on the different way of RDD to dataframe

Posted by frakass <ca...@free.fr>.

I think it's better as:

df1.map { case(w,x,y,z) => columns(w,x,y,z) }

Thanks


On 2022/2/9 12:46, Mich Talebzadeh wrote:
> scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString, 
> p(2).toString,p(3).toString.toDouble)) // map those columns

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: question on the different way of RDD to dataframe

Posted by Mich Talebzadeh <mi...@gmail.com>.

As Sean mentioned Scala case class  is a handy way of representing objects
with names and types. For example, if you are reading a csv file with
spaced column names like "counter party" etc and you want a more
compact column name like counterparty etc


scala> val location="hdfs://rhes75:9000/tmp/crap.csv"

location: String = hdfs://rhes75:9000/tmp/crap.csv

scala> val df1 = spark.read.option("header", false).csv(location)  // don't
read the header

df1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 34 more
fields]  // column header are represted as _c0, _c1 etc

scala> case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
PRICE: Double)  // create name and type for _c0, _c1 and so forth

defined class columns

scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
p(2).toString,p(3).toString.toDouble)) // map those columns

df2: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER: string
... 2 more fields]

scala> df2.printSchema

root

 |-- KEY: string (nullable = true)

 |-- TICKER: string (nullable = true)

 |-- TIMEISSUED: string (nullable = true)

 |-- PRICE: double (nullable = false)

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Feb 2022 at 14:32, Sean Owen <sr...@gmail.com> wrote:

> It's just a possibly tidier way to represent objects with named, typed
> fields, in order to specify a DataFrame's contents.
>
> On Tue, Feb 8, 2022 at 4:16 AM <ca...@free.fr> wrote:
>
>> Hello
>>
>> I am converting some py code to scala.
>> This works in python:
>>
>> >>> rdd = sc.parallelize([('apple',1),('orange',2)])
>> >>> rdd.toDF(['fruit','num']).show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple|  1|
>> |orange|  2|
>> +------+---+
>>
>> And in scala:
>> scala> rdd.toDF("fruit","num").show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple|  1|
>> |orange|  2|
>> +------+---+
>>
>> But I saw many code that use a case class for translation.
>>
>> scala> case class Fruit(fruit:String,num:Int)
>> defined class Fruit
>>
>> scala> rdd.map{case (x,y) => Fruit(x,y) }.toDF().show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple|  1|
>> |orange|  2|
>> +------+---+
>>
>>
>> Do you know why to use a "case class" here?
>>
>> thanks.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: question on the different way of RDD to dataframe

Posted by frakass <ca...@free.fr>.

I know that using case class I can control the data type strictly.

scala> val rdd = sc.parallelize(List(("apple",1),("orange",2)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] 
at parallelize at <console>:23

scala> rdd.toDF.printSchema
root
  |-- _1: string (nullable = true)
  |-- _2: integer (nullable = false)


I can specify the second column to other type such as Double by case class:

scala> rdd.map{ case (x,y) => Fruit(x,y) }.toDF.printSchema
root
  |-- fruit: string (nullable = true)
  |-- num: double (nullable = false)



Thank you.



On 2022/2/8 10:32, Sean Owen wrote:
> It's just a possibly tidier way to represent objects with named, typed 
> fields, in order to specify a DataFrame's contents.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: question on the different way of RDD to dataframe

Posted by Sean Owen <sr...@gmail.com>.

It's just a possibly tidier way to represent objects with named, typed
fields, in order to specify a DataFrame's contents.

On Tue, Feb 8, 2022 at 4:16 AM <ca...@free.fr> wrote:

> Hello
>
> I am converting some py code to scala.
> This works in python:
>
> >>> rdd = sc.parallelize([('apple',1),('orange',2)])
> >>> rdd.toDF(['fruit','num']).show()
> +------+---+
> | fruit|num|
> +------+---+
> | apple|  1|
> |orange|  2|
> +------+---+
>
> And in scala:
> scala> rdd.toDF("fruit","num").show()
> +------+---+
> | fruit|num|
> +------+---+
> | apple|  1|
> |orange|  2|
> +------+---+
>
> But I saw many code that use a case class for translation.
>
> scala> case class Fruit(fruit:String,num:Int)
> defined class Fruit
>
> scala> rdd.map{case (x,y) => Fruit(x,y) }.toDF().show()
> +------+---+
> | fruit|num|
> +------+---+
> | apple|  1|
> |orange|  2|
> +------+---+
>
>
> Do you know why to use a "case class" here?
>
> thanks.
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>