You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Zeming Yu <ze...@gmail.com> on 2017/05/07 01:49:31 UTC
take the difference between two columns of a dataframe in pyspark
Say I have the following dataframe with two numeric columns A and B, what's
the best way to add a column showing the difference between the two columns?
+-----------------+----------+
| A| B|
+-----------------+----------+
|786.3199999999999| 786.12|
| 786.12| 786.12|
| 786.42| 786.12|
| 786.72| 786.12|
| 786.92| 786.12|
| 786.92| 786.12|
| 786.72| 786.12|
| 786.72| 786.12|
| 827.72| 786.02|
| 827.72| 786.02|
+-----------------+----------+
I could probably figure out how to do this vis UDF, but is UDF generally slower?
Thanks!
Re: take the difference between two columns of a dataframe in pyspark
Posted by Gourav Sengupta <go...@gmail.com>.
Hi,
convert then to temporary table and write a SQL, that will also work.
Regards,
Gourav
On Sun, May 7, 2017 at 2:49 AM, Zeming Yu <ze...@gmail.com> wrote:
> Say I have the following dataframe with two numeric columns A and B,
> what's the best way to add a column showing the difference between the two
> columns?
>
> +-----------------+----------+
> | A| B|
> +-----------------+----------+
> |786.3199999999999| 786.12|
> | 786.12| 786.12|
> | 786.42| 786.12|
> | 786.72| 786.12|
> | 786.92| 786.12|
> | 786.92| 786.12|
> | 786.72| 786.12|
> | 786.72| 786.12|
> | 827.72| 786.02|
> | 827.72| 786.02|
> +-----------------+----------+
>
>
> I could probably figure out how to do this vis UDF, but is UDF generally slower?
>
>
> Thanks!
>
>
Re: take the difference between two columns of a dataframe in pyspark
Posted by Zeming Yu <ze...@gmail.com>.
OK. I've worked it out.
df.withColumn('diff', col('A')-col('B'))
On Sun, May 7, 2017 at 11:49 AM, Zeming Yu <ze...@gmail.com> wrote:
> Say I have the following dataframe with two numeric columns A and B,
> what's the best way to add a column showing the difference between the two
> columns?
>
> +-----------------+----------+
> | A| B|
> +-----------------+----------+
> |786.3199999999999| 786.12|
> | 786.12| 786.12|
> | 786.42| 786.12|
> | 786.72| 786.12|
> | 786.92| 786.12|
> | 786.92| 786.12|
> | 786.72| 786.12|
> | 786.72| 786.12|
> | 827.72| 786.02|
> | 827.72| 786.02|
> +-----------------+----------+
>
>
> I could probably figure out how to do this vis UDF, but is UDF generally slower?
>
>
> Thanks!
>
>