You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ankur Jain <an...@yash.com> on 2016/05/17 14:09:25 UTC

dataframe stat corr for multiple columns

Hello Team,

In my current usecase I am loading data from CSV using spark-csv and trying to correlate all variables.

As of now if we want to correlate 2 column in a dataframe df.stat.corr works great but if we want to correlate multiple columns this won't work.
In case of R we can use corrplot and correlate all numeric columns in a single line of code. Can you guide me how to achieve the same with dataframe or sql?

There seems a way in spark-mllib
http://spark.apache.org/docs/latest/mllib-statistics.html

[cid:image001.png@01D1B069.D3099410]

But it seems that it don't take input as dataframe...

Regards,
Ankur
Information transmitted by this e-mail is proprietary to YASH Technologies and/ or its Customers and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please notify us immediately at info@yash.com and delete this mail from your records.

Re: dataframe stat corr for multiple columns

Posted by Sun Rui <su...@163.com>.
There is an existing JIRA issue for it: https://issues.apache.org/jira/browse/SPARK-11057 <https://issues.apache.org/jira/browse/SPARK-11057>
Also there is an PR. Maybe we should help to review and merge it with a higher priority.
> On May 20, 2016, at 00:09, Xiangrui Meng <me...@databricks.com> wrote:
> 
> This is nice to have. Please create a JIRA for it. Right now, you can merge all columns into a vector column using RFormula or VectorAssembler, then convert it into an RDD and call corr from MLlib.
> 
> 
> On Tue, May 17, 2016, 7:09 AM Ankur Jain <ankur.jain@yash.com <ma...@yash.com>> wrote:
> Hello Team,
> 
>  
> 
> In my current usecase I am loading data from CSV using spark-csv and trying to correlate all variables.
> 
>  
> 
> As of now if we want to correlate 2 column in a dataframe df.stat.corr works great but if we want to correlate multiple columns this won’t work.
> 
> In case of R we can use corrplot and correlate all numeric columns in a single line of code. Can you guide me how to achieve the same with dataframe or sql?
> 
>  
> 
> There seems a way in spark-mllib
> 
> http://spark.apache.org/docs/latest/mllib-statistics.html <http://spark.apache.org/docs/latest/mllib-statistics.html>
>  
> 
> 
> 
>  
> 
> But it seems that it don’t take input as dataframe…
> 
>  
> 
> Regards,
> 
> Ankur
> 
> Information transmitted by this e-mail is proprietary to YASH Technologies and/ or its Customers and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please notify us immediately at info@yash.com <ma...@yash.com> and delete this mail from your records.
>  邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
> 共有 2 个附件
> image001.png(10K)
> 极速下载 <http://preview.mail.163.com/xdownload?filename=image001.png&mid=1tbiMgFumlWBRpWRfQAAss&part=3&sign=cdddeddde407cee944ec9707d55dbcf5&time=1463707827&uid=sunrise_win%40163.com> 在线预览 <http://preview.mail.163.com/preview?mid=1tbiMgFumlWBRpWRfQAAss&part=3&sign=cdddeddde407cee944ec9707d55dbcf5&time=1463707827&uid=sunrise_win%40163.com>
> image001.png(10K)
> 极速下载 <http://preview.mail.163.com/xdownload?filename=image001.png&mid=1tbiMgFumlWBRpWRfQAAss&part=4&sign=cdddeddde407cee944ec9707d55dbcf5&time=1463707827&uid=sunrise_win%40163.com> 在线预览 <http://preview.mail.163.com/preview?mid=1tbiMgFumlWBRpWRfQAAss&part=4&sign=cdddeddde407cee944ec9707d55dbcf5&time=1463707827&uid=sunrise_win%40163.com><image001.png><image001.png>


Re: dataframe stat corr for multiple columns

Posted by Xiangrui Meng <me...@databricks.com>.
This is nice to have. Please create a JIRA for it. Right now, you can merge
all columns into a vector column using RFormula or VectorAssembler, then
convert it into an RDD and call corr from MLlib.

On Tue, May 17, 2016, 7:09 AM Ankur Jain <an...@yash.com> wrote:

> Hello Team,
>
>
>
> In my current usecase I am loading data from CSV using spark-csv and
> trying to correlate all variables.
>
>
>
> As of now if we want to correlate 2 column in a dataframe * df.stat.corr*
> works great but if we want to correlate multiple columns this won’t work.
>
> In case of R we can use corrplot and correlate all numeric columns in a
> single line of code. Can you guide me how to achieve the same with
> dataframe or sql?
>
>
>
> There seems a way in spark-mllib
>
> http://spark.apache.org/docs/latest/mllib-statistics.html
>
>
>
>
>
> But it seems that it don’t take input as dataframe…
>
>
>
> Regards,
>
> Ankur
> Information transmitted by this e-mail is proprietary to YASH Technologies
> and/ or its Customers and is intended for use only by the individual or
> entity to which it is addressed, and may contain information that is
> privileged, confidential or exempt from disclosure under applicable law. If
> you are not the intended recipient or it appears that this mail has been
> forwarded to you without proper authority, you are notified that any use or
> dissemination of this information in any manner is strictly prohibited. In
> such cases, please notify us immediately at info@yash.com and delete this
> mail from your records.
>