You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Zeming Yu <ze...@gmail.com> on 2017/05/07 01:49:31 UTC

take the difference between two columns of a dataframe in pyspark

Say I have the following dataframe with two numeric columns A and B, what's
the best way to add a column showing the difference between the two columns?

+-----------------+----------+
|                A|         B|
+-----------------+----------+
|786.3199999999999|    786.12|
|           786.12|    786.12|
|           786.42|    786.12|
|           786.72|    786.12|
|           786.92|    786.12|
|           786.92|    786.12|
|           786.72|    786.12|
|           786.72|    786.12|
|           827.72|    786.02|
|           827.72|    786.02|
+-----------------+----------+


I could probably figure out how to do this vis UDF, but is UDF generally slower?


Thanks!

Re: take the difference between two columns of a dataframe in pyspark

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,

convert then to temporary table and write a SQL, that will also work.


Regards,
Gourav

On Sun, May 7, 2017 at 2:49 AM, Zeming Yu <ze...@gmail.com> wrote:

> Say I have the following dataframe with two numeric columns A and B,
> what's the best way to add a column showing the difference between the two
> columns?
>
> +-----------------+----------+
> |                A|         B|
> +-----------------+----------+
> |786.3199999999999|    786.12|
> |           786.12|    786.12|
> |           786.42|    786.12|
> |           786.72|    786.12|
> |           786.92|    786.12|
> |           786.92|    786.12|
> |           786.72|    786.12|
> |           786.72|    786.12|
> |           827.72|    786.02|
> |           827.72|    786.02|
> +-----------------+----------+
>
>
> I could probably figure out how to do this vis UDF, but is UDF generally slower?
>
>
> Thanks!
>
>

Re: take the difference between two columns of a dataframe in pyspark

Posted by Zeming Yu <ze...@gmail.com>.
OK. I've worked it out.

df.withColumn('diff', col('A')-col('B'))

On Sun, May 7, 2017 at 11:49 AM, Zeming Yu <ze...@gmail.com> wrote:

> Say I have the following dataframe with two numeric columns A and B,
> what's the best way to add a column showing the difference between the two
> columns?
>
> +-----------------+----------+
> |                A|         B|
> +-----------------+----------+
> |786.3199999999999|    786.12|
> |           786.12|    786.12|
> |           786.42|    786.12|
> |           786.72|    786.12|
> |           786.92|    786.12|
> |           786.92|    786.12|
> |           786.72|    786.12|
> |           786.72|    786.12|
> |           827.72|    786.02|
> |           827.72|    786.02|
> +-----------------+----------+
>
>
> I could probably figure out how to do this vis UDF, but is UDF generally slower?
>
>
> Thanks!
>
>