You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ck...@sina.cn on 2022/03/16 13:38:00 UTC

回复：Re: 回复：Re: 回复：Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

Thanks, Jayesh and all. I finally get the correlation data frame using agg with list of functions.I think the list of functions which generate a column should be more detailed description.
Liang
----- 原始邮件 -----
发件人："Lalwani, Jayesh" <jl...@amazon.com>
收件人："ckgppl_yan@sina.cn" <ck...@sina.cn>, Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>
抄送人：user <us...@spark.apache.org>
主题：Re: 回复：Re:  回复：Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期：2022年03月16日 20点49分


No, You don’t need 30 dataframes and self joins. Convert a list of columns to a list of functions, and then pass the list of functions to the agg function
 
 

From: "ckgppl_yan@sina.cn" <ck...@sina.cn>

Reply-To: "ckgppl_yan@sina.cn" <ck...@sina.cn>

Date: Wednesday, March 16, 2022 at 8:16 AM

To: Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>

Cc: user <us...@spark.apache.org>

Subject: [EXTERNAL] 回复：Re:
回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame


 






CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless
 you can confirm the sender and know the content is safe.





 


Thanks, Enrico.


I just found that I need to group the data frame then calculate the correlation. So I will get a list of dataframe, not columns. 


So I used following solution:



1.      
use following codes to create a mutable data frame df_all. I used the first datacol to calculate correlation.  df.groupby("groupid").agg(functions.corr("datacol1","corr_col")

2.      
iterate all remaining datacol columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid
 column, then drop duplicated groupid column.

3.      
after the iteration, I will get the dataframe which contains all correlation data.









I need to verify the data to make sure it is valid.









Liang



----- 
原始邮件 -----

发件人：Enrico Minack <in...@enrico.minack.dev>

收件人：ckgppl_yan@sina.cn, Sean Owen <sr...@gmail.com>

抄送人：user <us...@spark.apache.org>

主题：Re:
回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

日期：2022年03月16日
 19点53分

 

If you have a list of Columns called `columns`, you can pass them to the `agg` method as:


 


  agg(columns.head, columns.tail: _*)


 


Enrico


 


 


Am 16.03.22 um 08:02 schrieb 
ckgppl_yan@sina.cn:



Thanks, Sean. I modified the codes and have generated a list of columns.


I am working on convert a list of columns to a new data frame. It seems that there is no direct  API to do this.


 



----- 
原始邮件 -----

发件人：Sean Owen
<sr...@gmail.com>

收件人：ckgppl_yan@sina.cn

抄送人：user
<us...@spark.apache.org>

主题：Re: calculate correlation between multiple
 columns and one specific column after groupby the spark data frame

日期：2022年03月16日
 11点55分

 


Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. 


On Tue, Mar 15, 2022, 10:30 PM <ck...@sina.cn> wrote:



Hi all,


 



I am stuck at  a correlation calculation problem. I have a dataframe like below:





groupid


datacol1


datacol2


datacol3


datacol*


corr_co






00001


1


2


3


4


5




00001


2


3


4


6


5




00002


4


2


1


7


5




00002


8


9


3


2


5




00003


7


1


2


3


5




00003


3


5


3


1


5






I want to calculate the correlation between all datacol columns and corr_col column by each groupid.



So I used the following spark scala-api codes:


df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))



 


This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.


I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.

So any  spark scala API codes can do this job efficiently?


 


Thanks



 


Liang

Unsubscribe

Posted by van wilson <vf...@gmail.com>.


> On Mar 16, 2022, at 7:38 AM, <ck...@sina.cn> <ck...@sina.cn> wrote:
> 
> Thanks, Jayesh and all. I finally get the correlation data frame using agg with list of functions.
> I think the list of functions which generate a column should be more detailed description.
> 
> Liang
> 
> ----- 原始邮件 -----
> 发件人："Lalwani, Jayesh" <jl...@amazon.com>
> 收件人："ckgppl_yan@sina.cn" <ck...@sina.cn>, Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>
> 抄送人：user <us...@spark.apache.org>
> 主题：Re: 回复：Re: 回复：Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
> 日期：2022年03月16日 20点49分
> 
> No, You don’t need 30 dataframes and self joins. Convert a list of columns to a list of functions, and then pass the list of functions to the agg function
> 
>  
> 
>  
> 
> From: "ckgppl_yan@sina.cn" <ck...@sina.cn>
> Reply-To: "ckgppl_yan@sina.cn" <ck...@sina.cn>
> Date: Wednesday, March 16, 2022 at 8:16 AM
> To: Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>
> Cc: user <us...@spark.apache.org>
> Subject: [EXTERNAL] 回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
> 
>  
> 
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
>  
> 
> Thanks, Enrico.
> 
> I just found that I need to group the data frame then calculate the correlation. So I will get a list of dataframe, not columns. 
> 
> So I used following solution:
> 
> 1.       use following codes to create a mutable data frame df_all. I used the first datacol to calculate correlation.  df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
> 
> 2.       iterate all remaining datacol columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid column, then drop duplicated groupid column.
> 
> 3.       after the iteration, I will get the dataframe which contains all correlation data.
> 
> 
> 
> 
> I need to verify the data to make sure it is valid.
> 
> 
> 
> 
> Liang
> 
> ----- 原始邮件 -----
> 发件人：Enrico Minack <in...@enrico.minack.dev>
> 收件人：ckgppl_yan@sina.cn, Sean Owen <sr...@gmail.com>
> 抄送人：user <us...@spark.apache.org>
> 主题：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
> 日期：2022年03月16日 19点53分
> 
>  
> 
> If you have a list of Columns called `columns`, you can pass them to the `agg` method as:
> 
>  
> 
>   agg(columns.head, columns.tail: _*)
> 
>  
> 
> Enrico
> 
>  
> 
>  
> 
> Am 16.03.22 um 08:02 schrieb ckgppl_yan@sina.cn <ma...@sina.cn>:
> 
> Thanks, Sean. I modified the codes and have generated a list of columns.
> 
> I am working on convert a list of columns to a new data frame. It seems that there is no direct  API to do this.
> 
>  
> 
> ----- 原始邮件 -----
> 发件人：Sean Owen <sr...@gmail.com> <ma...@gmail.com>
> 收件人：ckgppl_yan@sina.cn <ma...@sina.cn>
> 抄送人：user <us...@spark.apache.org> <ma...@spark.apache.org>
> 主题：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
> 日期：2022年03月16日 11点55分
> 
>  
> 
> Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. 
> 
> On Tue, Mar 15, 2022, 10:30 PM <ckgppl_yan@sina.cn <ma...@sina.cn>> wrote:
> 
> Hi all,
> 
>  
> 
> I am stuck at  a correlation calculation problem. I have a dataframe like below:
> 
> groupid
> 
> datacol1
> 
> datacol2
> 
> datacol3
> 
> datacol*
> 
> corr_co
> 
> 00001
> 
> 1
> 
> 2
> 
> 3
> 
> 4
> 
> 5
> 
> 00001
> 
> 2
> 
> 3
> 
> 4
> 
> 6
> 
> 5
> 
> 00002
> 
> 4
> 
> 2
> 
> 1
> 
> 7
> 
> 5
> 
> 00002
> 
> 8
> 
> 9
> 
> 3
> 
> 2
> 
> 5
> 
> 00003
> 
> 7
> 
> 1
> 
> 2
> 
> 3
> 
> 5
> 
> 00003
> 
> 3
> 
> 5
> 
> 3
> 
> 1
> 
> 5
> 
> I want to calculate the correlation between all datacol columns and corr_col column by each groupid.
> 
> So I used the following spark scala-api codes:
> 
> df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
> 
>  
> 
> This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.
> 
> I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.
> 
> So any  spark scala API codes can do this job efficiently?
> 
>  
> 
> Thanks
> 
>  
> 
> Liang
> 
>  
>