You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sun-rui <gi...@git.apache.org> on 2015/09/22 14:27:36 UTC

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

GitHub user sun-rui opened a pull request:

    https://github.com/apache/spark/pull/8869

    [SPARK-10752][SPARKR] Implement corr() and cov in DataFrameStatFunctions.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sun-rui/spark SPARK-10752

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8869.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8869
    
----
commit c54002c295278fb2b7c80df11c0d1305adb3aa9e
Author: Sun Rui <ru...@intel.com>
Date:   2015-09-22T12:21:25Z

    [SPARK-10752][SPARKR] Implement corr() and cov in DataFrameStatFunctions.

commit 038be09ee04de625bb10d1d4a13f495dcc774ac3
Author: Sun Rui <ru...@intel.com>
Date:   2015-09-22T12:28:25Z

    Remove crosstab() from DataFrame.R.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-145261694
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43212/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146080183
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43320/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r41087117
  
    --- Diff: R/pkg/R/DataFrameStatFunctions.R ---
    @@ -0,0 +1,102 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# DataFrameStatFunctions.R - Statistic functions for DataFrames.
    +
    +setOldClass("jobj")
    +
    +#' crosstab
    +#'
    +#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
    +#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
    +#' non-zero pair frequencies will be returned.
    +#'
    +#' @param col1 name of the first column. Distinct items will make the first item of each row.
    +#' @param col2 name of the second column. Distinct items will make the column names of the output.
    +#' @return a local R data.frame representing the contingency table. The first column of each row
    +#'         will be the distinct values of `col1` and the column names will be the distinct values
    +#'         of `col2`. The name of the first column will be `$col1_$col2`. Pairs that have no
    +#'         occurrences will have zero as their counts.
    +#'
    +#' @rdname statfunctions
    +#' @name crosstab
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' ct <- crosstab(df, "title", "gender")
    +#' }
    +setMethod("crosstab",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    +          function(x, col1, col2) {
    +            statFunctions <- callJMethod(x@sdf, "stat")
    +            sct <- callJMethod(statFunctions, "crosstab", col1, col2)
    +            collect(dataFrame(sct))
    +          })
    +
    +#' cov
    +#'
    +#' Calculate the sample covariance of two numerical columns of a DataFrame.
    +#'
    +#' @param x A SparkSQL DataFrame
    +#' @param col1 the name of the first column
    +#' @param col2 the name of the second column
    +#' @return the covariance of the two columns.
    +#'
    +#' @rdname statfunctions
    +#' @name cov
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' cov <- cov(df, "title", "gender")
    +#' }
    +setMethod("cov",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    --- End diff --
    
    yeah, I agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146078664
  
      [Test build #43320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43320/consoleFull) for   PR 8869 at commit [`e73c8f3`](https://github.com/apache/spark/commit/e73c8f3a01a68a4ea839aa3925581e67744cfddd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-145249449
  
      [Test build #43212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43212/consoleFull) for   PR 8869 at commit [`302af26`](https://github.com/apache/spark/commit/302af267195b1bcb7e3171f26afae29993025de5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146072678
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/8869#discussion_r41348453

--- Diff: R/pkg/R/stats.R ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# DataFrameStatFunctions.R - Statistic functions for DataFrames.
+
+setOldClass("jobj")
+
+#' crosstab
+#'
+#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
+#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
+#' non-zero pair frequencies will be returned.
+#'
+#' @param col1 name of the first column. Distinct items will make the first item of each row.
+#' @param col2 name of the second column. Distinct items will make the column names of the output.
+#' @return a local R data.frame representing the contingency table. The first column of each row
+#' will be the distinct values of `col1` and the column names will be the distinct values
+#' of `col2`. The name of the first column will be `$col1_$col2`. Pairs that have no
+#' occurrences will have zero as their counts.
+#'
+#' @rdname statfunctions
+#' @name crosstab
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- jsonFile(sqlCtx, "/path/to/file.json")
--- End diff --

perhaps a good time to update `sqlCtx` to `sqlContext`

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146078080
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146072671
  
      [Test build #43316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43316/console) for   PR 8869 at commit [`b05c443`](https://github.com/apache/spark/commit/b05c44386a7bbc14bb05f5eb11844fda2bc84623).
     * This patch **fails R style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146080129
  
      [Test build #43320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43320/console) for   PR 8869 at commit [`e73c8f3`](https://github.com/apache/spark/commit/e73c8f3a01a68a4ea839aa3925581e67744cfddd).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142464859
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r41348491
  
    --- Diff: R/pkg/R/stats.R ---
    @@ -0,0 +1,102 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# DataFrameStatFunctions.R - Statistic functions for DataFrames.
    +
    +setOldClass("jobj")
    +
    +#' crosstab
    +#'
    +#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
    +#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
    +#' non-zero pair frequencies will be returned.
    +#'
    +#' @param col1 name of the first column. Distinct items will make the first item of each row.
    +#' @param col2 name of the second column. Distinct items will make the column names of the output.
    +#' @return a local R data.frame representing the contingency table. The first column of each row
    +#'         will be the distinct values of `col1` and the column names will be the distinct values
    +#'         of `col2`. The name of the first column will be `$col1_$col2`. Pairs that have no
    +#'         occurrences will have zero as their counts.
    +#'
    +#' @rdname statfunctions
    +#' @name crosstab
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' ct <- crosstab(df, "title", "gender")
    +#' }
    +setMethod("crosstab",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    +          function(x, col1, col2) {
    +            statFunctions <- callJMethod(x@sdf, "stat")
    +            sct <- callJMethod(statFunctions, "crosstab", col1, col2)
    +            collect(dataFrame(sct))
    +          })
    +
    +#' cov
    +#'
    +#' Calculate the sample covariance of two numerical columns of a DataFrame.
    +#'
    +#' @param x A SparkSQL DataFrame
    +#' @param col1 the name of the first column
    +#' @param col2 the name of the second column
    +#' @return the covariance of the two columns.
    +#'
    +#' @rdname statfunctions
    +#' @name cov
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' cov <- cov(df, "title", "gender")
    +#' }
    +setMethod("cov",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    +          function(x, col1, col2) {
    +            statFunctions <- callJMethod(x@sdf, "stat")
    +            callJMethod(statFunctions, "cov", col1, col2)
    +          })
    +
    +#' corr
    +#'
    +#' Calculates the correlation of two columns of a DataFrame.
    +#' Currently only supports the Pearson Correlation Coefficient.
    +#' For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
    +#' 
    +#' @param x A SparkSQL DataFrame
    +#' @param col1 the name of the first column
    +#' @param col2 the name of the second column
    +#' @param method Optional. A character specifying the method for calculating the correlation.
    +#'               only "pearson" is allowed now.
    +#' @return The Pearson Correlation Coefficient as a Double.
    +#'
    +#' @rdname statfunctions
    +#' @name corr
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' corr <- corr(df, "title", "gender")
    +#' corr <- corr(df, "title", "gender", "pearson")
    --- End diff --
    
    would it be better to say
    `corr <- corr(df, "title", "gender", method = "pearson")`
    ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146258846
  
    Merging into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8869


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146048835
  
    rebased to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142274109
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-145261693
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146049566
  
      [Test build #43311 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43311/console) for   PR 8869 at commit [`ac2fd32`](https://github.com/apache/spark/commit/ac2fd32a660a14c2b20a82bf7207c4805f46a9be).
     * This patch **fails R style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-145942409
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146072680
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43316/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-145932636
  
    @davies any other comments ? 
    @sun-rui Could you bring this up to date with master branch ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r40490731
  
    --- Diff: R/pkg/R/DataFrameStatFunctions.R ---
    @@ -0,0 +1,102 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# DataFrameStatFunctions.R - Statistic functions for DataFrames.
    +
    +setOldClass("jobj")
    +
    +#' crosstab
    +#'
    +#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
    +#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
    +#' non-zero pair frequencies will be returned.
    +#'
    +#' @param col1 name of the first column. Distinct items will make the first item of each row.
    +#' @param col2 name of the second column. Distinct items will make the column names of the output.
    +#' @return a local R data.frame representing the contingency table. The first column of each row
    +#'         will be the distinct values of `col1` and the column names will be the distinct values
    +#'         of `col2`. The name of the first column will be `$col1_$col2`. Pairs that have no
    +#'         occurrences will have zero as their counts.
    +#'
    +#' @rdname statfunctions
    +#' @name crosstab
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' ct <- crosstab(df, "title", "gender")
    +#' }
    +setMethod("crosstab",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    +          function(x, col1, col2) {
    +            statFunctions <- callJMethod(x@sdf, "stat")
    +            sct <- callJMethod(statFunctions, "crosstab", col1, col2)
    +            collect(dataFrame(sct))
    +          })
    +
    +#' cov
    +#'
    +#' Calculate the sample covariance of two numerical columns of a DataFrame.
    +#'
    +#' @param x A SparkSQL DataFrame
    +#' @param col1 the name of the first column
    +#' @param col2 the name of the second column
    +#' @return the covariance of the two columns.
    +#'
    +#' @rdname statfunctions
    +#' @name cov
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' cov <- cov(df, "title", "gender")
    +#' }
    +setMethod("cov",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    --- End diff --
    
    It would cool if we also have versions which take in columns instead of just strings ? 
    @rxin Any reason all the stat functions only take string column names in Scala ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146080181
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146071562
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142466128
  
      [Test build #42875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42875/consoleFull) for   PR 8869 at commit [`d35c3f5`](https://github.com/apache/spark/commit/d35c3f56be1785cd5e3217bf0f53f7ba42504b7c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142464871
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146071597
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146048920
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r40581218
  
    --- Diff: R/pkg/R/DataFrameStatFunctions.R ---
    @@ -0,0 +1,102 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# DataFrameStatFunctions.R - Statistic functions for DataFrames.
    +
    +setOldClass("jobj")
    +
    +#' crosstab
    +#'
    +#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
    +#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
    +#' non-zero pair frequencies will be returned.
    +#'
    +#' @param col1 name of the first column. Distinct items will make the first item of each row.
    +#' @param col2 name of the second column. Distinct items will make the column names of the output.
    +#' @return a local R data.frame representing the contingency table. The first column of each row
    +#'         will be the distinct values of `col1` and the column names will be the distinct values
    +#'         of `col2`. The name of the first column will be `$col1_$col2`. Pairs that have no
    +#'         occurrences will have zero as their counts.
    +#'
    +#' @rdname statfunctions
    +#' @name crosstab
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' ct <- crosstab(df, "title", "gender")
    +#' }
    +setMethod("crosstab",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    +          function(x, col1, col2) {
    +            statFunctions <- callJMethod(x@sdf, "stat")
    +            sct <- callJMethod(statFunctions, "crosstab", col1, col2)
    +            collect(dataFrame(sct))
    +          })
    +
    +#' cov
    +#'
    +#' Calculate the sample covariance of two numerical columns of a DataFrame.
    +#'
    +#' @param x A SparkSQL DataFrame
    +#' @param col1 the name of the first column
    +#' @param col2 the name of the second column
    +#' @return the covariance of the two columns.
    +#'
    +#' @rdname statfunctions
    +#' @name cov
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' cov <- cov(df, "title", "gender")
    +#' }
    +setMethod("cov",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    --- End diff --
    
    Hi there, 
    I have some points about correlation and covariance.
    1. R calls the method 'cor' and not 'corr', so if we want to have the same syntax as R, we might want to use the 'cor'.
    2. The actual syntax for cor (cov has a similar one) is : cor(x, y = NULL, use = "everything",
        method = c("pearson", "kendall", "spearman"))
    where X is a dataframe and y can be another dataframe, a vector or matrix 
    and in R I can get smth like this:
    cor(longley)
                 GNP.deflator       GNP   Unemployed .... 
    GNP.deflator    1.0000000 0.9915892
    GNP             0.9915892 1.0000000
    Unemployed      0.6206334 0.6042609
    Armed.Forces    0.4647442 0.4464368
    Population      0.9791634 0.9910901
    Year            0.9911492 0.9952735
    Employed        0.9708985 0.9835516
    
    I wonder if we can get this in SparkR too.
    I see at least 2 options here:
    1. we make K number of calls to dataframe api for each column pair or
    2. we extend scala dataframe api so that it also accepts a list of columns ... 
    I can help you with this if you think that it makes sense and we want to add it.
    
    Thanks,
    Narine


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r41348531
  
    --- Diff: R/pkg/R/stats.R ---
    @@ -0,0 +1,102 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# DataFrameStatFunctions.R - Statistic functions for DataFrames.
    --- End diff --
    
    stats.R


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146078070
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142276456
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42832/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142276452
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146072487
  
      [Test build #43316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43316/consoleFull) for   PR 8869 at commit [`b05c443`](https://github.com/apache/spark/commit/b05c44386a7bbc14bb05f5eb11844fda2bc84623).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146049134
  
      [Test build #43311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43311/consoleFull) for   PR 8869 at commit [`ac2fd32`](https://github.com/apache/spark/commit/ac2fd32a660a14c2b20a82bf7207c4805f46a9be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142467524
  
      [Test build #42875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42875/console) for   PR 8869 at commit [`d35c3f5`](https://github.com/apache/spark/commit/d35c3f56be1785cd5e3217bf0f53f7ba42504b7c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146048909
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142275868
  
      [Test build #42832 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42832/consoleFull) for   PR 8869 at commit [`038be09`](https://github.com/apache/spark/commit/038be09ee04de625bb10d1d4a13f495dcc774ac3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142276448
  
      [Test build #42832 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42832/console) for   PR 8869 at commit [`038be09`](https://github.com/apache/spark/commit/038be09ee04de625bb10d1d4a13f495dcc774ac3).
     * This patch **fails R style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r40624158
  
    --- Diff: R/pkg/R/DataFrameStatFunctions.R ---
    @@ -0,0 +1,102 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# DataFrameStatFunctions.R - Statistic functions for DataFrames.
    +
    +setOldClass("jobj")
    +
    +#' crosstab
    +#'
    +#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
    +#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
    +#' non-zero pair frequencies will be returned.
    +#'
    +#' @param col1 name of the first column. Distinct items will make the first item of each row.
    +#' @param col2 name of the second column. Distinct items will make the column names of the output.
    +#' @return a local R data.frame representing the contingency table. The first column of each row
    +#'         will be the distinct values of `col1` and the column names will be the distinct values
    +#'         of `col2`. The name of the first column will be `$col1_$col2`. Pairs that have no
    +#'         occurrences will have zero as their counts.
    +#'
    +#' @rdname statfunctions
    +#' @name crosstab
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' ct <- crosstab(df, "title", "gender")
    +#' }
    +setMethod("crosstab",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    +          function(x, col1, col2) {
    +            statFunctions <- callJMethod(x@sdf, "stat")
    +            sct <- callJMethod(statFunctions, "crosstab", col1, col2)
    +            collect(dataFrame(sct))
    +          })
    +
    +#' cov
    +#'
    +#' Calculate the sample covariance of two numerical columns of a DataFrame.
    +#'
    +#' @param x A SparkSQL DataFrame
    +#' @param col1 the name of the first column
    +#' @param col2 the name of the second column
    +#' @return the covariance of the two columns.
    +#'
    +#' @rdname statfunctions
    +#' @name cov
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' cov <- cov(df, "title", "gender")
    +#' }
    +setMethod("cov",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    --- End diff --
    
    Link on the function name: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r40968690
  
    --- Diff: R/pkg/R/DataFrameStatFunctions.R ---
    @@ -0,0 +1,102 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# DataFrameStatFunctions.R - Statistic functions for DataFrames.
    +
    +setOldClass("jobj")
    +
    +#' crosstab
    +#'
    +#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
    +#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
    +#' non-zero pair frequencies will be returned.
    +#'
    +#' @param col1 name of the first column. Distinct items will make the first item of each row.
    +#' @param col2 name of the second column. Distinct items will make the column names of the output.
    +#' @return a local R data.frame representing the contingency table. The first column of each row
    +#'         will be the distinct values of `col1` and the column names will be the distinct values
    +#'         of `col2`. The name of the first column will be `$col1_$col2`. Pairs that have no
    +#'         occurrences will have zero as their counts.
    +#'
    +#' @rdname statfunctions
    +#' @name crosstab
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' ct <- crosstab(df, "title", "gender")
    +#' }
    +setMethod("crosstab",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    +          function(x, col1, col2) {
    +            statFunctions <- callJMethod(x@sdf, "stat")
    +            sct <- callJMethod(statFunctions, "crosstab", col1, col2)
    +            collect(dataFrame(sct))
    +          })
    +
    +#' cov
    +#'
    +#' Calculate the sample covariance of two numerical columns of a DataFrame.
    +#'
    +#' @param x A SparkSQL DataFrame
    +#' @param col1 the name of the first column
    +#' @param col2 the name of the second column
    +#' @return the covariance of the two columns.
    +#'
    +#' @rdname statfunctions
    +#' @name cov
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' cov <- cov(df, "title", "gender")
    +#' }
    +setMethod("cov",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    --- End diff --
    
    It will be great if we could have same signature as R API. Given the fact that Spark DataFrame is much different than R dataframe, this will be hard, maybe we could only support a small subset of what the R API can do. Instead of confusing users, it's more clear to use different name, if they can't be compatible.
    
    Does this sound reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142274093
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-145261664
  
      [Test build #43212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43212/console) for   PR 8869 at commit [`302af26`](https://github.com/apache/spark/commit/302af267195b1bcb7e3171f26afae29993025de5).
     * This patch **passes all tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146049577
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43311/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-145249335
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142467567
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-145249343
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r40639291
  
    --- Diff: R/pkg/R/DataFrameStatFunctions.R ---
    @@ -0,0 +1,102 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# DataFrameStatFunctions.R - Statistic functions for DataFrames.
    +
    +setOldClass("jobj")
    +
    +#' crosstab
    +#'
    +#' Computes a pair-wise frequency table of the given columns. Also known as a contingency
    +#' table. The number of distinct values for each column should be less than 1e4. At most 1e6
    +#' non-zero pair frequencies will be returned.
    +#'
    +#' @param col1 name of the first column. Distinct items will make the first item of each row.
    +#' @param col2 name of the second column. Distinct items will make the column names of the output.
    +#' @return a local R data.frame representing the contingency table. The first column of each row
    +#'         will be the distinct values of `col1` and the column names will be the distinct values
    +#'         of `col2`. The name of the first column will be `$col1_$col2`. Pairs that have no
    +#'         occurrences will have zero as their counts.
    +#'
    +#' @rdname statfunctions
    +#' @name crosstab
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' ct <- crosstab(df, "title", "gender")
    +#' }
    +setMethod("crosstab",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    +          function(x, col1, col2) {
    +            statFunctions <- callJMethod(x@sdf, "stat")
    +            sct <- callJMethod(statFunctions, "crosstab", col1, col2)
    +            collect(dataFrame(sct))
    +          })
    +
    +#' cov
    +#'
    +#' Calculate the sample covariance of two numerical columns of a DataFrame.
    +#'
    +#' @param x A SparkSQL DataFrame
    +#' @param col1 the name of the first column
    +#' @param col2 the name of the second column
    +#' @return the covariance of the two columns.
    +#'
    +#' @rdname statfunctions
    +#' @name cov
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlCtx, "/path/to/file.json")
    +#' cov <- cov(df, "title", "gender")
    +#' }
    +setMethod("cov",
    +          signature(x = "DataFrame", col1 = "character", col2 = "character"),
    --- End diff --
    
    @NarineK, thank you for your comments. You suggestion needs extensions to Scala DataFrame. I prefer that you can submit a new JIRA in the community. @shivaram, what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-142467569
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42875/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8869#discussion_r40967930
  
    --- Diff: R/pkg/DESCRIPTION ---
    @@ -23,6 +23,7 @@ Collate:
         'column.R'
         'group.R'
         'DataFrame.R'
    +    'DataFrameStatFunctions.R'
    --- End diff --
    
    Can we use a shorter name? like stats.R ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-10752][SPARKR] Implement corr() and cov...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8869#issuecomment-146049574
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org