You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ca...@free.fr on 2022/02/08 10:16:22 UTC
question on the different way of RDD to dataframe
Hello
I am converting some py code to scala.
This works in python:
>>> rdd = sc.parallelize([('apple',1),('orange',2)])
>>> rdd.toDF(['fruit','num']).show()
+------+---+
| fruit|num|
+------+---+
| apple| 1|
|orange| 2|
+------+---+
And in scala:
scala> rdd.toDF("fruit","num").show()
+------+---+
| fruit|num|
+------+---+
| apple| 1|
|orange| 2|
+------+---+
But I saw many code that use a case class for translation.
scala> case class Fruit(fruit:String,num:Int)
defined class Fruit
scala> rdd.map{case (x,y) => Fruit(x,y) }.toDF().show()
+------+---+
| fruit|num|
+------+---+
| apple| 1|
|orange| 2|
+------+---+
Do you know why to use a "case class" here?
thanks.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: question on the different way of RDD to dataframe
Posted by frakass <ca...@free.fr>.
I think it's better as:
df1.map { case(w,x,y,z) => columns(w,x,y,z) }
Thanks
On 2022/2/9 12:46, Mich Talebzadeh wrote:
> scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
> p(2).toString,p(3).toString.toDouble)) // map those columns
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: question on the different way of RDD to dataframe
Posted by Mich Talebzadeh <mi...@gmail.com>.
As Sean mentioned Scala case class is a handy way of representing objects
with names and types. For example, if you are reading a csv file with
spaced column names like "counter party" etc and you want a more
compact column name like counterparty etc
scala> val location="hdfs://rhes75:9000/tmp/crap.csv"
location: String = hdfs://rhes75:9000/tmp/crap.csv
scala> val df1 = spark.read.option("header", false).csv(location) // don't
read the header
df1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 34 more
fields] // column header are represted as _c0, _c1 etc
scala> case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
PRICE: Double) // create name and type for _c0, _c1 and so forth
defined class columns
scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
p(2).toString,p(3).toString.toDouble)) // map those columns
df2: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER: string
... 2 more fields]
scala> df2.printSchema
root
|-- KEY: string (nullable = true)
|-- TICKER: string (nullable = true)
|-- TIMEISSUED: string (nullable = true)
|-- PRICE: double (nullable = false)
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Tue, 8 Feb 2022 at 14:32, Sean Owen <sr...@gmail.com> wrote:
> It's just a possibly tidier way to represent objects with named, typed
> fields, in order to specify a DataFrame's contents.
>
> On Tue, Feb 8, 2022 at 4:16 AM <ca...@free.fr> wrote:
>
>> Hello
>>
>> I am converting some py code to scala.
>> This works in python:
>>
>> >>> rdd = sc.parallelize([('apple',1),('orange',2)])
>> >>> rdd.toDF(['fruit','num']).show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple| 1|
>> |orange| 2|
>> +------+---+
>>
>> And in scala:
>> scala> rdd.toDF("fruit","num").show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple| 1|
>> |orange| 2|
>> +------+---+
>>
>> But I saw many code that use a case class for translation.
>>
>> scala> case class Fruit(fruit:String,num:Int)
>> defined class Fruit
>>
>> scala> rdd.map{case (x,y) => Fruit(x,y) }.toDF().show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple| 1|
>> |orange| 2|
>> +------+---+
>>
>>
>> Do you know why to use a "case class" here?
>>
>> thanks.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
Re: question on the different way of RDD to dataframe
Posted by frakass <ca...@free.fr>.
I know that using case class I can control the data type strictly.
scala> val rdd = sc.parallelize(List(("apple",1),("orange",2)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0]
at parallelize at <console>:23
scala> rdd.toDF.printSchema
root
|-- _1: string (nullable = true)
|-- _2: integer (nullable = false)
I can specify the second column to other type such as Double by case class:
scala> rdd.map{ case (x,y) => Fruit(x,y) }.toDF.printSchema
root
|-- fruit: string (nullable = true)
|-- num: double (nullable = false)
Thank you.
On 2022/2/8 10:32, Sean Owen wrote:
> It's just a possibly tidier way to represent objects with named, typed
> fields, in order to specify a DataFrame's contents.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: question on the different way of RDD to dataframe
Posted by Sean Owen <sr...@gmail.com>.
It's just a possibly tidier way to represent objects with named, typed
fields, in order to specify a DataFrame's contents.
On Tue, Feb 8, 2022 at 4:16 AM <ca...@free.fr> wrote:
> Hello
>
> I am converting some py code to scala.
> This works in python:
>
> >>> rdd = sc.parallelize([('apple',1),('orange',2)])
> >>> rdd.toDF(['fruit','num']).show()
> +------+---+
> | fruit|num|
> +------+---+
> | apple| 1|
> |orange| 2|
> +------+---+
>
> And in scala:
> scala> rdd.toDF("fruit","num").show()
> +------+---+
> | fruit|num|
> +------+---+
> | apple| 1|
> |orange| 2|
> +------+---+
>
> But I saw many code that use a case class for translation.
>
> scala> case class Fruit(fruit:String,num:Int)
> defined class Fruit
>
> scala> rdd.map{case (x,y) => Fruit(x,y) }.toDF().show()
> +------+---+
> | fruit|num|
> +------+---+
> | apple| 1|
> |orange| 2|
> +------+---+
>
>
> Do you know why to use a "case class" here?
>
> thanks.
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>