You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Hao Wang <bi...@gmail.com> on 2015/10/15 19:04:08 UTC
Complex transformation on a dataframe column
Hi,
I have searched around but could not find a satisfying answer to this question: what is the best way to do a complex transformation on a dataframe column?
For example, I have a dataframe with the following schema and a function that has pretty complex logic to format addresses. I would like to use the function to format each address and store the output as an additional column in the dataframe. What is the best way to do it? Use Dataframe.map? Define a UDF? Some code example would be appreciated.
Input dataframe:
root
|-- ID: string (nullable = true)
|-- Name: string (nullable = true)
|-- PhoneNumber: string (nullable = true)
|-- Address: string (nullable = true)
Output dataframe:
root
|-- ID: string (nullable = true)
|-- Name: string (nullable = true)
|-- PhoneNumber: string (nullable = true)
|-- Address: string (nullable = true)
|-- FormattedAddress: string (nullable = true)
The function for format addresses:
def formatAddress(address: String): String
Best regards,
Hao Wang
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Complex transformation on a dataframe column
Posted by Raghavendra Pandey <ra...@gmail.com>.
Here is a quick code sample I can come up with :
case class Input(ID:String, Name:String, PhoneNumber:String, Address:
String)
val df = sc.parallelize(Seq(Input("1", "raghav", "0123456789",
"houseNo:StreetNo:City:State:Zip"))).toDF()
val formatAddress = udf { (s: String) => s.split(":").mkString("-")}
val outputDF = df.withColumn("FormattedAddress",
formatAddress(df("Address")))
-Raghav
On Thu, Oct 15, 2015 at 10:34 PM, Hao Wang <bi...@gmail.com> wrote:
> Hi,
>
> I have searched around but could not find a satisfying answer to this
> question: what is the best way to do a complex transformation on a
> dataframe column?
>
> For example, I have a dataframe with the following schema and a function
> that has pretty complex logic to format addresses. I would like to use the
> function to format each address and store the output as an additional
> column in the dataframe. What is the best way to do it? Use Dataframe.map?
> Define a UDF? Some code example would be appreciated.
>
> Input dataframe:
> root
> |-- ID: string (nullable = true)
> |-- Name: string (nullable = true)
> |-- PhoneNumber: string (nullable = true)
> |-- Address: string (nullable = true)
>
> Output dataframe:
> root
> |-- ID: string (nullable = true)
> |-- Name: string (nullable = true)
> |-- PhoneNumber: string (nullable = true)
> |-- Address: string (nullable = true)
> |-- FormattedAddress: string (nullable = true)
>
> The function for format addresses:
> def formatAddress(address: String): String
>
>
> Best regards,
> Hao Wang
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>