You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hao Wang <bi...@gmail.com> on 2015/10/15 19:04:08 UTC

Complex transformation on a dataframe column

Hi,

I have searched around but could not find a satisfying answer to this question: what is the best way to do a complex transformation on a dataframe column?

For example, I have a dataframe with the following schema and a function that has pretty complex logic to format addresses. I would like to use the function to format each address and store the output as an additional column in the dataframe. What is the best way to do it? Use Dataframe.map? Define a UDF? Some code example would be appreciated.

Input dataframe:
root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- PhoneNumber: string (nullable = true)
 |-- Address: string (nullable = true)

Output dataframe:
root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- PhoneNumber: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- FormattedAddress: string (nullable = true)

The function for format addresses:
def formatAddress(address: String): String


Best regards,
Hao Wang

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Complex transformation on a dataframe column

Posted by Raghavendra Pandey <ra...@gmail.com>.

Here is a quick code sample I can come up with :

case class Input(ID:String, Name:String, PhoneNumber:String, Address:
String)
val df = sc.parallelize(Seq(Input("1", "raghav", "0123456789",
"houseNo:StreetNo:City:State:Zip"))).toDF()
val formatAddress = udf { (s: String) => s.split(":").mkString("-")}
val outputDF = df.withColumn("FormattedAddress",
formatAddress(df("Address")))


-Raghav

On Thu, Oct 15, 2015 at 10:34 PM, Hao Wang <bi...@gmail.com> wrote:

> Hi,
>
> I have searched around but could not find a satisfying answer to this
> question: what is the best way to do a complex transformation on a
> dataframe column?
>
> For example, I have a dataframe with the following schema and a function
> that has pretty complex logic to format addresses. I would like to use the
> function to format each address and store the output as an additional
> column in the dataframe. What is the best way to do it? Use Dataframe.map?
> Define a UDF? Some code example would be appreciated.
>
> Input dataframe:
> root
>  |-- ID: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- PhoneNumber: string (nullable = true)
>  |-- Address: string (nullable = true)
>
> Output dataframe:
> root
>  |-- ID: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- PhoneNumber: string (nullable = true)
>  |-- Address: string (nullable = true)
>  |-- FormattedAddress: string (nullable = true)
>
> The function for format addresses:
> def formatAddress(address: String): String
>
>
> Best regards,
> Hao Wang
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>