You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/04/26 09:02:38 UTC
[jira] [Created] (SPARK-7151) Correlation methods for DataFrame
Joseph K. Bradley created SPARK-7151:
----------------------------------------
Summary: Correlation methods for DataFrame
Key: SPARK-7151
URL: https://issues.apache.org/jira/browse/SPARK-7151
Project: Spark
Issue Type: New Feature
Components: ML, SQL
Reporter: Joseph K. Bradley
Priority: Minor
We should support computing correlations between columns in DataFrames with a simple API.
This could be a DataFrame feature:
{code}
myDataFrame.corr("col1", "col2")
// or
myDataFrame.corr("col1", "col2", "pearson") // specify correlation type
{code}
Or it could be an MLlib feature:
{code}
Statistics.corr(myDataFrame("col1"), myDataFrame("col2"))
// or
Statistics.corr(myDataFrame, "col1", "col2")
{code}
(The first Statistics.corr option is more flexible, but it could cause trouble if a user tries to pass in 2 unzippable DataFrame columns.)
Note: R follow the latter setup. I'm OK with either.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org