You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ck...@sina.cn on 2022/03/16 07:02:43 UTC

回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Thanks, Sean. I modified the codes and have generated a list of columns.I am working on convert a list of columns to a new data frame. It seems that there is no direct  API to do this.
----- 原始邮件 -----
发件人：Sean Owen <sr...@gmail.com>
收件人：ckgppl_yan@sina.cn
抄送人：user <us...@spark.apache.org>
主题：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
日期：2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. 

On Tue, Mar 15, 2022, 10:30 PM  <ck...@sina.cn> wrote:
Hi all,
I am stuck at  a correlation calculation problem. I have a dataframe like below:groupiddatacol1datacol2datacol3datacol*corr_co000011234500001234650000242175000028932500003712350000335315I want to calculate the correlation between all datacol columns and corr_col column by each groupid.So I used the following spark scala-api codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.So any  spark scala API codes can do this job efficiently?
Thanks
Liang

Re: 回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Posted by Enrico Minack <in...@enrico.minack.dev>.

If you have a list of Columns called `columns`, you can pass them to the 
`agg` method as:

   agg(columns.head, columns.tail: _*)

Enrico


Am 16.03.22 um 08:02 schrieb ckgppl_yan@sina.cn:
> Thanks, Sean. I modified the codes and have generated a list of columns.
> I am working on convert a list of columns to a new data frame. It 
> seems that there is no direct  API to do this.
>
> ----- 原始邮件 -----
> 发件人：Sean Owen <sr...@gmail.com>
> 收件人：ckgppl_yan@sina.cn
> 抄送人：user <us...@spark.apache.org>
> 主题：Re: calculate correlation between multiple columns and one specific 
> column after groupby the spark data frame
> 日期：2022年03月16日 11点55分
>
> Are you just trying to avoid writing the function call 30 times? Just 
> put this in a loop over all the columns instead, which adds a new corr 
> col every time to a list.
>
> On Tue, Mar 15, 2022, 10:30 PM <ck...@sina.cn> wrote:
>
>     Hi all,
>
>     I am stuck at  a correlation calculation problem. I have a
>     dataframe like below:
>
>     groupid 	datacol1 	datacol2 	datacol3 	datacol* 	corr_co
>     00001 	1 	2 	3 	4 	5
>     00001 	2 	3 	4 	6 	5
>     00002 	4 	2 	1 	7 	5
>     00002 	8 	9 	3 	2 	5
>     00003 	7 	1 	2 	3 	5
>     00003 	3 	5 	3 	1 	5
>
>     I want to calculate the correlation between all datacol columns
>     and corr_col column by each groupid.
>     So I used the following spark scala-api codes:
>     df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
>
>     This is very inefficient. If I have 30 data_col columns, I need to
>     input 30 times functions.corr to calculate correlation.
>
>     I have searched, it seems that functions.corr doesn't accept a
>     List/Array parameter, and df.agg doesn't accept a function to be
>     parameter.
>
>     So any  spark scala API codes can do this job efficiently?
>
>     Thanks
>
>     Liang
>