You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ck...@sina.cn on 2022/03/16 13:38:00 UTC

回复:Re: 回复:Re: 回复:Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

Thanks, Jayesh and all. I finally get the correlation data frame using agg with list of functions.I think the list of functions which generate a column should be more detailed description.
Liang
----- 原始邮件 -----
发件人:"Lalwani, Jayesh" <jl...@amazon.com>
收件人:"ckgppl_yan@sina.cn" <ck...@sina.cn>, Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>
抄送人:user <us...@spark.apache.org>
主题:Re: 回复:Re:  回复:Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期:2022年03月16日 20点49分


No, You don’t need 30 dataframes and self joins. Convert a list of columns to a list of functions, and then pass the list of functions to the agg function
 
 

From: "ckgppl_yan@sina.cn" <ck...@sina.cn>

Reply-To: "ckgppl_yan@sina.cn" <ck...@sina.cn>

Date: Wednesday, March 16, 2022 at 8:16 AM

To: Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>

Cc: user <us...@spark.apache.org>

Subject: [EXTERNAL] 回复:Re:
回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame


 






CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless
 you can confirm the sender and know the content is safe.





 


Thanks, Enrico.


I just found that I need to group the data frame then calculate the correlation. So I will get a list of dataframe, not columns. 


So I used following solution:



1.      
use following codes to create a mutable data frame df_all. I used the first datacol to calculate correlation.  df.groupby("groupid").agg(functions.corr("datacol1","corr_col")

2.      
iterate all remaining datacol columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid
 column, then drop duplicated groupid column.

3.      
after the iteration, I will get the dataframe which contains all correlation data.









I need to verify the data to make sure it is valid.









Liang



----- 
原始邮件 -----

发件人:Enrico Minack <in...@enrico.minack.dev>

收件人:ckgppl_yan@sina.cn, Sean Owen <sr...@gmail.com>

抄送人:user <us...@spark.apache.org>

主题:Re:
回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

日期:2022年03月16日
 19点53分

 

If you have a list of Columns called `columns`, you can pass them to the `agg` method as:


 


  agg(columns.head, columns.tail: _*)


 


Enrico


 


 


Am 16.03.22 um 08:02 schrieb 
ckgppl_yan@sina.cn:



Thanks, Sean. I modified the codes and have generated a list of columns.


I am working on convert a list of columns to a new data frame. It seems that there is no direct  API to do this.


 



----- 
原始邮件 -----

发件人:Sean Owen
<sr...@gmail.com>

收件人:ckgppl_yan@sina.cn

抄送人:user
<us...@spark.apache.org>

主题:Re: calculate correlation between multiple
 columns and one specific column after groupby the spark data frame

日期:2022年03月16日
 11点55分

 


Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. 


On Tue, Mar 15, 2022, 10:30 PM <ck...@sina.cn> wrote:



Hi all,


 



I am stuck at  a correlation calculation problem. I have a dataframe like below:





groupid


datacol1


datacol2


datacol3


datacol*


corr_co






00001


1


2


3


4


5




00001


2


3


4


6


5




00002


4


2


1


7


5




00002


8


9


3


2


5




00003


7


1


2


3


5




00003


3


5


3


1


5






I want to calculate the correlation between all datacol columns and corr_col column by each groupid.



So I used the following spark scala-api codes:


df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))



 


This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.


I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.

So any  spark scala API codes can do this job efficiently?


 


Thanks



 


Liang







 



Unsubscribe

Posted by van wilson <vf...@gmail.com>.

> On Mar 16, 2022, at 7:38 AM, <ck...@sina.cn> <ck...@sina.cn> wrote:
> 
> Thanks, Jayesh and all. I finally get the correlation data frame using agg with list of functions.
> I think the list of functions which generate a column should be more detailed description.
> 
> Liang
> 
> ----- 原始邮件 -----
> 发件人:"Lalwani, Jayesh" <jl...@amazon.com>
> 收件人:"ckgppl_yan@sina.cn" <ck...@sina.cn>, Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>
> 抄送人:user <us...@spark.apache.org>
> 主题:Re: 回复:Re: 回复:Re: calculate correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
> 日期:2022年03月16日 20点49分
> 
> No, You don’t need 30 dataframes and self joins. Convert a list of columns to a list of functions, and then pass the list of functions to the agg function
> 
>  
> 
>  
> 
> From: "ckgppl_yan@sina.cn" <ck...@sina.cn>
> Reply-To: "ckgppl_yan@sina.cn" <ck...@sina.cn>
> Date: Wednesday, March 16, 2022 at 8:16 AM
> To: Enrico Minack <in...@enrico.minack.dev>, Sean Owen <sr...@gmail.com>
> Cc: user <us...@spark.apache.org>
> Subject: [EXTERNAL] 回复:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
> 
>  
> 
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
>  
> 
> Thanks, Enrico.
> 
> I just found that I need to group the data frame then calculate the correlation. So I will get a list of dataframe, not columns. 
> 
> So I used following solution:
> 
> 1.       use following codes to create a mutable data frame df_all. I used the first datacol to calculate correlation.  df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
> 
> 2.       iterate all remaining datacol columns, create a temp data frame for this iteration. In this iteration, use df_all to join the temp data frame on the groupid column, then drop duplicated groupid column.
> 
> 3.       after the iteration, I will get the dataframe which contains all correlation data.
> 
> 
> 
> 
> I need to verify the data to make sure it is valid.
> 
> 
> 
> 
> Liang
> 
> ----- 原始邮件 -----
> 发件人:Enrico Minack <in...@enrico.minack.dev>
> 收件人:ckgppl_yan@sina.cn, Sean Owen <sr...@gmail.com>
> 抄送人:user <us...@spark.apache.org>
> 主题:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
> 日期:2022年03月16日 19点53分
> 
>  
> 
> If you have a list of Columns called `columns`, you can pass them to the `agg` method as:
> 
>  
> 
>   agg(columns.head, columns.tail: _*)
> 
>  
> 
> Enrico
> 
>  
> 
>  
> 
> Am 16.03.22 um 08:02 schrieb ckgppl_yan@sina.cn <ma...@sina.cn>:
> 
> Thanks, Sean. I modified the codes and have generated a list of columns.
> 
> I am working on convert a list of columns to a new data frame. It seems that there is no direct  API to do this.
> 
>  
> 
> ----- 原始邮件 -----
> 发件人:Sean Owen <sr...@gmail.com> <ma...@gmail.com>
> 收件人:ckgppl_yan@sina.cn <ma...@sina.cn>
> 抄送人:user <us...@spark.apache.org> <ma...@spark.apache.org>
> 主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
> 日期:2022年03月16日 11点55分
> 
>  
> 
> Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. 
> 
> On Tue, Mar 15, 2022, 10:30 PM <ckgppl_yan@sina.cn <ma...@sina.cn>> wrote:
> 
> Hi all,
> 
>  
> 
> I am stuck at  a correlation calculation problem. I have a dataframe like below:
> 
> groupid
> 
> datacol1
> 
> datacol2
> 
> datacol3
> 
> datacol*
> 
> corr_co
> 
> 00001
> 
> 1
> 
> 2
> 
> 3
> 
> 4
> 
> 5
> 
> 00001
> 
> 2
> 
> 3
> 
> 4
> 
> 6
> 
> 5
> 
> 00002
> 
> 4
> 
> 2
> 
> 1
> 
> 7
> 
> 5
> 
> 00002
> 
> 8
> 
> 9
> 
> 3
> 
> 2
> 
> 5
> 
> 00003
> 
> 7
> 
> 1
> 
> 2
> 
> 3
> 
> 5
> 
> 00003
> 
> 3
> 
> 5
> 
> 3
> 
> 1
> 
> 5
> 
> I want to calculate the correlation between all datacol columns and corr_col column by each groupid.
> 
> So I used the following spark scala-api codes:
> 
> df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
> 
>  
> 
> This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.
> 
> I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.
> 
> So any  spark scala API codes can do this job efficiently?
> 
>  
> 
> Thanks
> 
>  
> 
> Liang
> 
>  
>