You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ck...@sina.cn on 2022/03/16 07:02:43 UTC

回复:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Thanks, Sean. I modified the codes and have generated a list of columns.I am working on convert a list of columns to a new data frame. It seems that there is no direct  API to do this.
----- 原始邮件 -----
发件人:Sean Owen <sr...@gmail.com>
收件人:ckgppl_yan@sina.cn
抄送人:user <us...@spark.apache.org>
主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
日期:2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. 

On Tue, Mar 15, 2022, 10:30 PM  <ck...@sina.cn> wrote:
Hi all,
I am stuck at  a correlation calculation problem. I have a dataframe like below:groupiddatacol1datacol2datacol3datacol*corr_co000011234500001234650000242175000028932500003712350000335315I want to calculate the correlation between all datacol columns and corr_col column by each groupid.So I used the following spark scala-api codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.So any  spark scala API codes can do this job efficiently?
Thanks
Liang

Re: 回复:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Posted by Enrico Minack <in...@enrico.minack.dev>.
If you have a list of Columns called `columns`, you can pass them to the 
`agg` method as:

   agg(columns.head, columns.tail: _*)

Enrico


Am 16.03.22 um 08:02 schrieb ckgppl_yan@sina.cn:
> Thanks, Sean. I modified the codes and have generated a list of columns.
> I am working on convert a list of columns to a new data frame. It 
> seems that there is no direct  API to do this.
>
> ----- 原始邮件 -----
> 发件人:Sean Owen <sr...@gmail.com>
> 收件人:ckgppl_yan@sina.cn
> 抄送人:user <us...@spark.apache.org>
> 主题:Re: calculate correlation between multiple columns and one specific 
> column after groupby the spark data frame
> 日期:2022年03月16日 11点55分
>
> Are you just trying to avoid writing the function call 30 times? Just 
> put this in a loop over all the columns instead, which adds a new corr 
> col every time to a list.
>
> On Tue, Mar 15, 2022, 10:30 PM <ck...@sina.cn> wrote:
>
>     Hi all,
>
>     I am stuck at  a correlation calculation problem. I have a
>     dataframe like below:
>
>     groupid 	datacol1 	datacol2 	datacol3 	datacol* 	corr_co
>     00001 	1 	2 	3 	4 	5
>     00001 	2 	3 	4 	6 	5
>     00002 	4 	2 	1 	7 	5
>     00002 	8 	9 	3 	2 	5
>     00003 	7 	1 	2 	3 	5
>     00003 	3 	5 	3 	1 	5
>
>     I want to calculate the correlation between all datacol columns
>     and corr_col column by each groupid.
>     So I used the following spark scala-api codes:
>     df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
>
>     This is very inefficient. If I have 30 data_col columns, I need to
>     input 30 times functions.corr to calculate correlation.
>
>     I have searched, it seems that functions.corr doesn't accept a
>     List/Array parameter, and df.agg doesn't accept a function to be
>     parameter.
>
>     So any  spark scala API codes can do this job efficiently?
>
>     Thanks
>
>     Liang
>