You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/04/26 09:02:38 UTC

[jira] [Created] (SPARK-7151) Correlation methods for DataFrame

Joseph K. Bradley created SPARK-7151:
----------------------------------------

             Summary: Correlation methods for DataFrame
                 Key: SPARK-7151
                 URL: https://issues.apache.org/jira/browse/SPARK-7151
             Project: Spark
          Issue Type: New Feature
          Components: ML, SQL
            Reporter: Joseph K. Bradley
            Priority: Minor


We should support computing correlations between columns in DataFrames with a simple API.

This could be a DataFrame feature:
{code}
myDataFrame.corr("col1", "col2")
// or
myDataFrame.corr("col1", "col2", "pearson") // specify correlation type
{code}

Or it could be an MLlib feature:
{code}
Statistics.corr(myDataFrame("col1"), myDataFrame("col2"))
// or
Statistics.corr(myDataFrame, "col1", "col2")
{code}
(The first Statistics.corr option is more flexible, but it could cause trouble if a user tries to pass in 2 unzippable DataFrame columns.)

Note: R follow the latter setup.  I'm OK with either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org