You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by brkyvz <gi...@git.apache.org> on 2014/08/26 03:49:50 UTC

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/2123

    [SPARK-2839][MLlib] Stats Toolkit documentation updated

    Documentation updated for the Statistics Toolkit of MLlib. @mengxr @atalwalkar
    
    https://issues.apache.org/jira/browse/SPARK-2839

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark StatsLib-Docs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2123
    
----
commit fec4d9d8b9569000125e5e3778d8a5521f4f0b72
Author: Burak <br...@gmail.com>
Date:   2014-08-26T01:44:43Z

    [SPARK-2830][MLlib] Stats Toolkit documentation updated

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695734
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
    +val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
     println(summary.mean) // a dense vector containing the mean value for each column
     println(summary.variance) // column-wise variance
     println(summary.numNonzeros) // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     
    -[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
    +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
     [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
     import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
     
    -RowMatrix mat = ... // a RowMatrix
    +JavaRDD<Vector> mat = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
    +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
     System.out.println(summary.mean()); // a dense vector containing the mean value for each column
     System.out.println(summary.variance()); // column-wise variance
     System.out.println(summary.numNonzeros()); // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -Matrix cov = mat.computeCovariance();
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python" markdown="1">
    +[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
    +[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
    +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
    +total count.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +mat = ... # an RDD of Vectors
    +
    +# Compute column summary statistics.
    +summary = Statistics.colStats(mat)
    +print summary.mean()
    +print summary.variance()
    +print summary.numNonzeros()
    +
    +{% endhighlight %}
     </div>
     
    +</div>
    +
    +
    +## Hypothesis Testing
    +
    +Hypothesis testing is a power tool in statistics to determine whether a result is statistically 
    +significant, whether this result occurred by chance or not. MLlib currently supports Pearson's 
    +chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine 
    +whether the goodness of fit or the independence test is conducted. The goodness of fit test requires 
    +an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
    +
    +MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
    +independence tests.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
     
    -## Hypothesis Testing 
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics._
    +
    +val sc: SparkContext = ...
    +
    +val vec: Vector = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
    +println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
    +                                 // test statistic, the method used, and the null hypothesis.
    +
    +val mat: Matrix = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +val independenceTestResult = Statistics.chiSqTest(mat) 
    +println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...
    +
    +val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
    +
    +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
    +// against the label.
    +val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
    +var i: Integer = 1
    +featureTestResults.foreach{ result =>
    +    println(s"Column $i: \n$result")
    +    i += 1
    +} // summary of the test 
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.stat.Statistics;
    +import org.apache.spark.mllib.stat.test.ChiSqTestResult;
    +
    +JavaSparkContext jsc = ...
    +
    +Vector vec = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);
    +// summary of the test including the p-value, degrees of freedom, test statistic, the method used, 
    +// and the null hypothesis.
    +System.out.println(goodnessOfFitTestResult);
    +
    +Matrix mat = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);
    +// summary of the test including the p-value, degrees of freedom...
    +System.out.println(independenceTestResult);
    +
    +JavaRDD<LabeledPoint> obs = ... // (feature, label) pairs.
    +
    +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
    +// against the label.
    +ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());
    +int i = 1;
    +for (ChiSqTestResult individualSummary:featureTestResults){
    +    System.out.println("Column " + i++ + ":");
    --- End diff --
    
    put `i++` outside the function and use `i` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695713
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    --- End diff --
    
    We have `sampleByKey` implemented in pyspark. Do you mind adding it? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz closed the pull request at:

    https://github.com/apache/spark/pull/2123


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695706
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    --- End diff --
    
    `import org.apache.spark.SparkContext._`, which provides implicit conversion from `RDD` to `PairRDDFunctions`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz closed the pull request at:

    https://github.com/apache/spark/pull/2123


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695731
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
    +val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
     println(summary.mean) // a dense vector containing the mean value for each column
     println(summary.variance) // column-wise variance
     println(summary.numNonzeros) // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     
    -[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
    +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
     [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
     import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
     
    -RowMatrix mat = ... // a RowMatrix
    +JavaRDD<Vector> mat = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
    +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
     System.out.println(summary.mean()); // a dense vector containing the mean value for each column
     System.out.println(summary.variance()); // column-wise variance
     System.out.println(summary.numNonzeros()); // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -Matrix cov = mat.computeCovariance();
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python" markdown="1">
    +[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
    +[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
    +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
    +total count.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +mat = ... # an RDD of Vectors
    +
    +# Compute column summary statistics.
    +summary = Statistics.colStats(mat)
    +print summary.mean()
    +print summary.variance()
    +print summary.numNonzeros()
    +
    +{% endhighlight %}
     </div>
     
    +</div>
    +
    +
    +## Hypothesis Testing
    +
    +Hypothesis testing is a power tool in statistics to determine whether a result is statistically 
    +significant, whether this result occurred by chance or not. MLlib currently supports Pearson's 
    +chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine 
    +whether the goodness of fit or the independence test is conducted. The goodness of fit test requires 
    +an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
    +
    +MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
    +independence tests.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
     
    -## Hypothesis Testing 
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics._
    +
    +val sc: SparkContext = ...
    +
    +val vec: Vector = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
    +println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
    +                                 // test statistic, the method used, and the null hypothesis.
    +
    +val mat: Matrix = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +val independenceTestResult = Statistics.chiSqTest(mat) 
    +println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...
    +
    +val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
    +
    +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
    +// against the label.
    +val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
    +var i: Integer = 1
    +featureTestResults.foreach{ result =>
    +    println(s"Column $i: \n$result")
    +    i += 1
    +} // summary of the test 
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.stat.Statistics;
    +import org.apache.spark.mllib.stat.test.ChiSqTestResult;
    +
    +JavaSparkContext jsc = ...
    +
    +Vector vec = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);
    +// summary of the test including the p-value, degrees of freedom, test statistic, the method used, 
    +// and the null hypothesis.
    +System.out.println(goodnessOfFitTestResult);
    +
    +Matrix mat = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);
    +// summary of the test including the p-value, degrees of freedom...
    +System.out.println(independenceTestResult);
    +
    +JavaRDD<LabeledPoint> obs = ... // (feature, label) pairs.
    --- End diff --
    
    `(feature, label) pairs` may confuse readers. We can simply say `an RDD of labeled points`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695721
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
    +val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
     println(summary.mean) // a dense vector containing the mean value for each column
     println(summary.variance) // column-wise variance
     println(summary.numNonzeros) // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     
    -[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
    +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
     [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
     import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
     
    -RowMatrix mat = ... // a RowMatrix
    +JavaRDD<Vector> mat = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
    +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
     System.out.println(summary.mean()); // a dense vector containing the mean value for each column
     System.out.println(summary.variance()); // column-wise variance
     System.out.println(summary.numNonzeros()); // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -Matrix cov = mat.computeCovariance();
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python" markdown="1">
    +[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
    +[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
    +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
    +total count.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +mat = ... # an RDD of Vectors
    +
    +# Compute column summary statistics.
    +summary = Statistics.colStats(mat)
    +print summary.mean()
    +print summary.variance()
    +print summary.numNonzeros()
    +
    +{% endhighlight %}
     </div>
     
    +</div>
    +
    +
    +## Hypothesis Testing
    +
    +Hypothesis testing is a power tool in statistics to determine whether a result is statistically 
    --- End diff --
    
    `power` -> `powerful`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695700
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    --- End diff --
    
    We can remove this sentence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2123#issuecomment-53381105
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19200/consoleFull) for   PR 2123 at commit [`213fe3f`](https://github.com/apache/spark/commit/213fe3f31f708ff0ee56d56e36644b51c0bba56e).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  shift # Ignore main class (org.apache.spark.deploy.SparkSubmit) and use our own`
      * `    $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`
      * `    $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`
      * `case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)`
      * `In multiclass classification, all `$2^`
      * `public final class JavaDecisionTree `
      * `class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable `
      * `class BoundedFloat(float):`
      * `class JoinedRow2 extends Row `
      * `class JoinedRow3 extends Row `
      * `class JoinedRow4 extends Row `
      * `class JoinedRow5 extends Row `
      * `class GenericRow(protected[sql] val values: Array[Any]) extends Row `
      * `abstract class MutableValue extends Serializable `
      * `final class MutableInt extends MutableValue `
      * `final class MutableFloat extends MutableValue `
      * `final class MutableBoolean extends MutableValue `
      * `final class MutableDouble extends MutableValue `
      * `final class MutableShort extends MutableValue `
      * `final class MutableLong extends MutableValue `
      * `final class MutableByte extends MutableValue `
      * `final class MutableAny extends MutableValue `
      * `final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow `
      * `case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate `
      * `case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression `
      * `case class CollectHashSetFunction(`
      * `case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression `
      * `case class CombineSetsAndCountFunction(`
      * `case class CountDistinctFunction(`
      * `case class MaxOf(left: Expression, right: Expression) extends Expression `
      * `case class NewSet(elementType: DataType) extends LeafExpression `
      * `case class AddItemToSet(item: Expression, set: Expression) extends Expression `
      * `case class CombineSets(left: Expression, right: Expression) extends BinaryExpression `
      * `case class CountSet(child: Expression) extends UnaryExpression `
      * `case class ExplainCommand(plan: LogicalPlan, extended: Boolean = false) extends Command `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695712
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    --- End diff --
    
    ditto, space after `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695698
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    --- End diff --
    
    The Python link doesn't work on my computer. `api/python/pyspark.mllib.stat.Statistics-class.html` works. Could you verify the link on your computer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695722
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
    +val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
     println(summary.mean) // a dense vector containing the mean value for each column
     println(summary.variance) // column-wise variance
     println(summary.numNonzeros) // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     
    -[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
    +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
     [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
     import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
     
    -RowMatrix mat = ... // a RowMatrix
    +JavaRDD<Vector> mat = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
    +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
     System.out.println(summary.mean()); // a dense vector containing the mean value for each column
     System.out.println(summary.variance()); // column-wise variance
     System.out.println(summary.numNonzeros()); // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -Matrix cov = mat.computeCovariance();
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python" markdown="1">
    +[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
    +[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
    +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
    +total count.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +mat = ... # an RDD of Vectors
    +
    +# Compute column summary statistics.
    +summary = Statistics.colStats(mat)
    +print summary.mean()
    +print summary.variance()
    +print summary.numNonzeros()
    +
    +{% endhighlight %}
     </div>
     
    +</div>
    +
    +
    +## Hypothesis Testing
    +
    +Hypothesis testing is a power tool in statistics to determine whether a result is statistically 
    +significant, whether this result occurred by chance or not. MLlib currently supports Pearson's 
    +chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine 
    +whether the goodness of fit or the independence test is conducted. The goodness of fit test requires 
    +an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
    +
    +MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
    +independence tests.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
     
    -## Hypothesis Testing 
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics._
    +
    +val sc: SparkContext = ...
    +
    +val vec: Vector = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
    +println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
    +                                 // test statistic, the method used, and the null hypothesis.
    +
    +val mat: Matrix = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +val independenceTestResult = Statistics.chiSqTest(mat) 
    +println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...
    +
    +val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
    +
    +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
    +// against the label.
    +val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
    +var i: Integer = 1
    --- End diff --
    
    remove `Integer`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695708
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    --- End diff --
    
    space after `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695697
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    --- End diff --
    
    What's the difference between `srdd` and `rdd`? If they are the same, we should use `rdd()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695726
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
    +val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
     println(summary.mean) // a dense vector containing the mean value for each column
     println(summary.variance) // column-wise variance
     println(summary.numNonzeros) // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     
    -[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
    +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
     [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
     import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
     
    -RowMatrix mat = ... // a RowMatrix
    +JavaRDD<Vector> mat = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
    +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
     System.out.println(summary.mean()); // a dense vector containing the mean value for each column
     System.out.println(summary.variance()); // column-wise variance
     System.out.println(summary.numNonzeros()); // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -Matrix cov = mat.computeCovariance();
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python" markdown="1">
    +[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
    +[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
    +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
    +total count.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +mat = ... # an RDD of Vectors
    +
    +# Compute column summary statistics.
    +summary = Statistics.colStats(mat)
    +print summary.mean()
    +print summary.variance()
    +print summary.numNonzeros()
    +
    +{% endhighlight %}
     </div>
     
    +</div>
    +
    +
    +## Hypothesis Testing
    +
    +Hypothesis testing is a power tool in statistics to determine whether a result is statistically 
    +significant, whether this result occurred by chance or not. MLlib currently supports Pearson's 
    +chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine 
    +whether the goodness of fit or the independence test is conducted. The goodness of fit test requires 
    +an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
    +
    +MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
    +independence tests.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
     
    -## Hypothesis Testing 
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics._
    +
    +val sc: SparkContext = ...
    +
    +val vec: Vector = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
    +println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
    +                                 // test statistic, the method used, and the null hypothesis.
    +
    +val mat: Matrix = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +val independenceTestResult = Statistics.chiSqTest(mat) 
    +println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...
    +
    +val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
    +
    +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
    +// against the label.
    +val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
    +var i: Integer = 1
    +featureTestResults.foreach{ result =>
    +    println(s"Column $i: \n$result")
    --- End diff --
    
    remove space after `:`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695724
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
    +val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
     println(summary.mean) // a dense vector containing the mean value for each column
     println(summary.variance) // column-wise variance
     println(summary.numNonzeros) // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     
    -[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
    +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
     [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
     import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
     
    -RowMatrix mat = ... // a RowMatrix
    +JavaRDD<Vector> mat = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
    +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
     System.out.println(summary.mean()); // a dense vector containing the mean value for each column
     System.out.println(summary.variance()); // column-wise variance
     System.out.println(summary.numNonzeros()); // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -Matrix cov = mat.computeCovariance();
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python" markdown="1">
    +[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
    +[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
    +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
    +total count.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +mat = ... # an RDD of Vectors
    +
    +# Compute column summary statistics.
    +summary = Statistics.colStats(mat)
    +print summary.mean()
    +print summary.variance()
    +print summary.numNonzeros()
    +
    +{% endhighlight %}
     </div>
     
    +</div>
    +
    +
    +## Hypothesis Testing
    +
    +Hypothesis testing is a power tool in statistics to determine whether a result is statistically 
    +significant, whether this result occurred by chance or not. MLlib currently supports Pearson's 
    +chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine 
    +whether the goodness of fit or the independence test is conducted. The goodness of fit test requires 
    +an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
    +
    +MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
    +independence tests.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
     
    -## Hypothesis Testing 
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics._
    +
    +val sc: SparkContext = ...
    +
    +val vec: Vector = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
    +println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
    +                                 // test statistic, the method used, and the null hypothesis.
    +
    +val mat: Matrix = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +val independenceTestResult = Statistics.chiSqTest(mat) 
    +println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...
    +
    +val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
    +
    +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
    +// against the label.
    +val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
    +var i: Integer = 1
    +featureTestResults.foreach{ result =>
    --- End diff --
    
    space after `foreach`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695715
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    --- End diff --
    
    remove `static`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2123#issuecomment-53375426
  
    @brkyvz The Python links do not work on my computer. It may be due to my pydoc version. Could you double check the links on your computer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695832
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    --- End diff --
    
    srdd uses scala.Double whereas rdd uses java.Double. Using .rdd() throws a compilation error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695733
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
    +val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
     println(summary.mean) // a dense vector containing the mean value for each column
     println(summary.variance) // column-wise variance
     println(summary.numNonzeros) // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     
    -[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
    +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
     [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
     import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
     
    -RowMatrix mat = ... // a RowMatrix
    +JavaRDD<Vector> mat = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
    +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
     System.out.println(summary.mean()); // a dense vector containing the mean value for each column
     System.out.println(summary.variance()); // column-wise variance
     System.out.println(summary.numNonzeros()); // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -Matrix cov = mat.computeCovariance();
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python" markdown="1">
    +[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
    +[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
    +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
    +total count.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +mat = ... # an RDD of Vectors
    +
    +# Compute column summary statistics.
    +summary = Statistics.colStats(mat)
    +print summary.mean()
    +print summary.variance()
    +print summary.numNonzeros()
    +
    +{% endhighlight %}
     </div>
     
    +</div>
    +
    +
    +## Hypothesis Testing
    +
    +Hypothesis testing is a power tool in statistics to determine whether a result is statistically 
    +significant, whether this result occurred by chance or not. MLlib currently supports Pearson's 
    +chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine 
    +whether the goodness of fit or the independence test is conducted. The goodness of fit test requires 
    +an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
    +
    +MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
    +independence tests.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
     
    -## Hypothesis Testing 
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics._
    +
    +val sc: SparkContext = ...
    +
    +val vec: Vector = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
    +println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
    +                                 // test statistic, the method used, and the null hypothesis.
    +
    +val mat: Matrix = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +val independenceTestResult = Statistics.chiSqTest(mat) 
    +println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...
    +
    +val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
    +
    +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
    +// against the label.
    +val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
    +var i: Integer = 1
    +featureTestResults.foreach{ result =>
    +    println(s"Column $i: \n$result")
    +    i += 1
    +} // summary of the test 
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
    +hypothesis tests.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.stat.Statistics;
    +import org.apache.spark.mllib.stat.test.ChiSqTestResult;
    +
    +JavaSparkContext jsc = ...
    +
    +Vector vec = ... // a vector composed of the frequencies of events
    +
    +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
    +// the test runs against a uniform distribution.  
    +ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);
    +// summary of the test including the p-value, degrees of freedom, test statistic, the method used, 
    +// and the null hypothesis.
    +System.out.println(goodnessOfFitTestResult);
    +
    +Matrix mat = ... // a contingency matrix
    +
    +// conduct Pearson's independence test on the input contingency matrix
    +ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);
    +// summary of the test including the p-value, degrees of freedom...
    +System.out.println(independenceTestResult);
    +
    +JavaRDD<LabeledPoint> obs = ... // (feature, label) pairs.
    +
    +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
    +// against the label.
    +ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());
    +int i = 1;
    +for (ChiSqTestResult individualSummary:featureTestResults){
    --- End diff --
    
    space after `:` and `)`, `individualSummary` -> `result`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695718
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
    +val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
     println(summary.mean) // a dense vector containing the mean value for each column
     println(summary.variance) // column-wise variance
     println(summary.numNonzeros) // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     
    -[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
    +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
     [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
     import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
     
    -RowMatrix mat = ... // a RowMatrix
    +JavaRDD<Vector> mat = ... // an RDD of Vectors
     
     // Compute column summary statistics.
    -MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
    +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
     System.out.println(summary.mean()); // a dense vector containing the mean value for each column
     System.out.println(summary.variance()); // column-wise variance
     System.out.println(summary.numNonzeros()); // number of nonzeros in each column
     
    -// Compute the covariance matrix.
    -Matrix cov = mat.computeCovariance();
     {% endhighlight %}
     </div>
    +
    +<div data-lang="python" markdown="1">
    +[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
    +[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
    +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
    +total count.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +mat = ... # an RDD of Vectors
    +
    +# Compute column summary statistics.
    +summary = Statistics.colStats(mat)
    +print summary.mean()
    +print summary.variance()
    +print summary.numNonzeros()
    +
    +{% endhighlight %}
     </div>
     
    +</div>
    +
    +
    +## Hypothesis Testing
    --- End diff --
    
    `Testing` -> `testing`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695716
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
     
     ### Multivariate summary statistics
     
    -We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, then the 
    -covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
    -and is faster if the rows are sparse.
    +We provide column summary statistics for `RDD[Vector]` through the static function `colStats` 
    +available in `Statistics`.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
     
    -[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
    +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
     [`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
     which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
     total count.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val mat: RDD[Vector] = ... // an RDD of Vectors
    --- End diff --
    
    rename `mat` to `vectors` or `observations` to avoid confusion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2123#issuecomment-53380977
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19200/consoleFull) for   PR 2123 at commit [`213fe3f`](https://github.com/apache/spark/commit/213fe3f31f708ff0ee56d56e36644b51c0bba56e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on the pull request:

    https://github.com/apache/spark/pull/2123#issuecomment-53381256
  
    ??????


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695714
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
     
     ## Summary Statistics 
    --- End diff --
    
    `Summary statistics` (I would recommend moving this section to the beginning.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695704
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    --- End diff --
    
    `PairRDD` -> `RDD of key-value pairs` (since there is no `PairRDD` type in Spark)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695707
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    --- End diff --
    
    add spaces around `=`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695710
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    +sampling is tightly coupled with the PairRDD data type, and the function signature conforms to the 
    +other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support exists 
    +as it requires significant more resources than the per-stratum simple random sampling used in 
    +`sampleByKey`.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.rdd.PairRDDFunctions
    +
    +val sc: SparkContext = ...
    +
    +val data = ... // an RDD[(K,V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement=false, fractions)
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
    +java.util.Map<K,Object> fractions = ... // specify the exact fraction desired from each key
    --- End diff --
    
    ditto, space after `,`. We can consider `import java.util.Map;` first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695702
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    --- End diff --
    
    `Stratified sampling` (to match other section titles)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695694
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    --- End diff --
    
    `Correlations`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695695
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    --- End diff --
    
    `calculate correlation between` -> `calculate pairwise correlations among`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2123#discussion_r16695703
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlation Calculation
    +
    +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
    +we provide the flexibility to calculate correlation between many series. The supported correlation
    +methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
    +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
    +
    +Support for `RowMatrix` operations in python currently don't exist, but will be added in future 
    +releases.  
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified Sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be accessed in `PairRDDFunctions` in core, as stratified 
    --- End diff --
    
    `PairRDDFunctions` is in Scala only. We should avoid it here. Also, `sampleByKeyExact` is not available in Python. It may be useful to describe what keys and values mean in an RDD of key-value pairs in stratified sampling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Posted by brkyvz <gi...@git.apache.org>.

GitHub user brkyvz reopened a pull request:

    https://github.com/apache/spark/pull/2123

    [SPARK-2839][MLlib] Stats Toolkit documentation updated

    Documentation updated for the Statistics Toolkit of MLlib. @mengxr @atalwalkar
    
    https://issues.apache.org/jira/browse/SPARK-2839

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark StatsLib-Docs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2123
    
----
commit fec4d9d8b9569000125e5e3778d8a5521f4f0b72
Author: Burak <br...@gmail.com>
Date:   2014-08-26T01:44:43Z

    [SPARK-2830][MLlib] Stats Toolkit documentation updated

commit 213fe3f31f708ff0ee56d56e36644b51c0bba56e
Author: Burak <br...@gmail.com>
Date:   2014-08-26T06:28:13Z

    [SPARK-2839][MLlib] Modifications made according to review

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org