You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Lalwani, Jayesh" <jl...@amazon.com.INVALID> on 2022/03/16 12:49:27 UTC
Re: 回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

No, You don’t need 30 dataframes and self joins. Convert a list of columns to a list of functions, and then pass the list of functions to the agg function


From: "ckgppl_yan@sina.cn" <ck...@sina.cn>
Reply-To: "ckgppl_yan@sina.cn" <ck...@sina.cn>
Date: Wednesday, March 16, 2022 at 8:16 AM
To: Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>
Cc: user <us...@spark.apache.org>
Subject: [EXTERNAL] 回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


Thanks, Enrico.
I just found that I need to group the data frame then calculate the correlation. So I will get a list of dataframe, not columns.
So I used following solution:
1.       use following codes to create a mutable data frame df_all. I used the first datacol to calculate correlation.  df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
2.       iterate all remaining datacol columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid column, then drop duplicated groupid column.
3.       after the iteration, I will get the dataframe which contains all correlation data.


I need to verify the data to make sure it is valid.


Liang
----- 原始邮件 -----
发件人：Enrico Minack <in...@enrico.minack.dev>
收件人：ckgppl_yan@sina.cn, Sean Owen <sr...@gmail.com>
抄送人：user <us...@spark.apache.org>
主题：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期：2022年03月16日 19点53分

If you have a list of Columns called `columns`, you can pass them to the `agg` method as:

  agg(columns.head, columns.tail: _*)

Enrico


Am 16.03.22 um 08:02 schrieb ckgppl_yan@sina.cn<ma...@sina.cn>:
Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It seems that there is no direct  API to do this.

----- 原始邮件 -----
发件人：Sean Owen <sr...@gmail.com>
收件人：ckgppl_yan@sina.cn<ma...@sina.cn>
抄送人：user <us...@spark.apache.org>
主题：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
日期：2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list.
On Tue, Mar 15, 2022, 10:30 PM <ck...@sina.cn>> wrote:
Hi all,


I am stuck at  a correlation calculation problem. I have a dataframe like below:
groupid

datacol1

datacol2

datacol3

datacol*

corr_co

00001

1

2

3

4

5

00001

2

3

4

6

5

00002

4

2

1

7

5

00002

8

9

3

2

5

00003

7

1

2

3

5

00003

3

5

3

1

5

I want to calculate the correlation between all datacol columns and corr_col column by each groupid.
So I used the following spark scala-api codes:
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.

I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.
So any  spark scala API codes can do this job efficiently?

Thanks

Liang