You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/08/10 20:37:28 UTC

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1878

    [SPARK-2850] [mllib] MLlib stats examples + small fixes

    Added examples for statistical summarization:
    * Scala: StatisticalSummary.scala
    ** Tests: correlation, MultivariateOnlineSummarizer
    * python: statistical_summary.py
    ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
    
    Added examples for random and sampled RDDs:
    * Scala: RandomAndSampledRDDs.scala
    * python: random_and_sampled_rdds.py
    * Both test:
    ** RandomRDDGenerators.normalRDD, normalVectorRDD
    ** RDD.sample, takeSample, sampleByKey
    
    Added sc.stop() to all examples.
    
    CorrelationSuite.scala
    * Added 1 test for RDDs with only 1 value
    
    RowMatrix.scala
    * numCols(): Added check for numRows = 0, with error message.
    * computeCovariance(): Added check for numRows <= 1, with error message.
    
    Python SparseVector (pyspark/mllib/linalg.py)
    * Added toDense() function
    
    python/run-tests script
    * Added stat.py (doc test)
    
    CC: @mengxr @dorx  Main changes were examples to show usage across APIs.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark mllib-stats-api-check

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1878.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1878
    
----
commit ee918e9e165a02dc55235877484502baaaf906e0
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-07T21:34:11Z

    Added examples for statistical summarization:
    * Scala: StatisticalSummary.scala
    ** Tests: correlation, MultivariateOnlineSummarizer
    * python: statistical_summary.py
    ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
    
    Added sc.stop() to all examples.
    
    CorrelationSuite.scala
    * Added 1 test for RDDs with only 1 value
    
    Python SparseVector (pyspark/mllib/linalg.py)
    * Added toDense() function
    
    python/run-tests script
    * Added stat.py (doc test)

commit 064985bd59b854bbca70290256348177415b5bda
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-07T23:34:38Z

    Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check

commit 8195c78a312087ee18375b745600946e47fcdd46
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-08T01:42:52Z

    Added examples for random and sampled RDDs:
    * Scala: RandomAndSampledRDDs.scala
    * python: random_and_sampled_rdds.py
    * Both test:
    ** RandomRDDGenerators.normalRDD, normalVectorRDD
    ** RDD.sample, takeSample, sampleByKey

commit 65e4ebc8c07c7fb4bf76f80c11b28f790362533e
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-10T17:36:10Z

    Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check

commit ab48f6eb01541309ffa2d86febb0a039f435a60a
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-10T18:26:03Z

    RowMatrix.scala
    * numCols(): Added check for numRows = 0, with error message.
    * computeCovariance(): Added check for numRows <= 1, with error message.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52555036
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18782/consoleFull) for   PR 1878 at commit [`ea5c047`](https://github.com/apache/spark/commit/ea5c0470a12b0048160ed4b3281c3048004230b3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52535155
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18770/consoleFull) for   PR 1878 at commit [`dafebe2`](https://github.com/apache/spark/commit/dafebe2233aa925f3210ccf59b1ccd71774aed26).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094836
  
    --- Diff: examples/src/main/python/mllib/random_and_sampled_rdds.py ---
    @@ -0,0 +1,88 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Randomly generated and sampled RDDs.
    +"""
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.random import RandomRDDGenerators
    +from pyspark.mllib.util import MLUtils
    +
    +
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) not in [1, 2]:
    +        print >> sys.stderr, "Usage: logistic_regression <libsvm data file>"
    +        exit(-1)
    +    if len(sys.argv) == 2:
    +        datapath = sys.argv[1]
    +    else:
    +        datapath = 'data/mllib/sample_binary_classification_data.txt'
    +
    +    sc = SparkContext(appName="PythonRandomAndSampledRDDs")
    +
    +    points = MLUtils.loadLibSVMFile(sc, datapath)
    +
    +    numExamples = 10000 # number of examples to generate
    +    fraction = 0.1 # fraction of data to sample
    +
    +    # Example: RandomRDDGenerators
    +    normalRDD = RandomRDDGenerators.normalRDD(sc, numExamples)
    +    print 'Generated RDD of %d examples sampled from a unit normal distribution' % normalRDD.count()
    +    normalVectorRDD = RandomRDDGenerators.normalVectorRDD(sc, numRows = numExamples, numCols = 2)
    +    print 'Generated RDD of %d examples of length-2 vectors.' % normalVectorRDD.count()
    +
    +    print ''
    +
    +    # Example: RDD.sample() and RDD.takeSample()
    +    exactSampleSize = int(numExamples * fraction)
    +    print 'Sampling RDD using fraction %g.  Expected sample size = %d.' \
    +        % (fraction, exactSampleSize)
    +    sampledRDD = normalRDD.sample(withReplacement = True, fraction = fraction)
    +    print '  RDD.sample(): sample has %d examples' % sampledRDD.count()
    +    sampledArray = normalRDD.takeSample(withReplacement = True, num = exactSampleSize)
    +    print '  RDD.takeSample(): sample has %d examples' % len(sampledArray)
    +
    +    print ''
    +
    +    # Example: RDD.sampleByKey()
    +    examples = MLUtils.loadLibSVMFile(sc, datapath)
    +    sizeA = examples.count()
    +    print 'Loaded data with %d examples from file: %s' % (sizeA, datapath)
    +    keyedRDD = examples.map(lambda lp: (int(lp.label), lp.features))
    +    print '  Keyed data using label (Int) as key ==> Orig'
    +    #  Count examples per label in original data.
    +    keyCountsA = keyedRDD.countByKey()
    +    #  Subsample, and count examples per label in sampled data.
    +    fractions = {}
    +    for k in keyCountsA.keys():
    +        fractions[k] = fraction
    +    sampledByKeyRDD = \
    +        keyedRDD.sampleByKey(withReplacement = True, fractions = fractions)#, exact = True)
    --- End diff --
    
    remove `#, exact = True)` because we don't support it in Python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52449697
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18729/consoleFull) for   PR 1878 at commit [`4e5d15e`](https://github.com/apache/spark/commit/4e5d15ef333ec468c872fa24adea98486b168ded).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")`
      * `  case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")`
      * `  case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52450830
  
    @mengxr  Yes, it occurred in local mode.  I would not expect the parts of the code which use the timer to be parallelized though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52560856
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18782/consoleFull) for   PR 1878 at commit [`ea5c047`](https://github.com/apache/spark/commit/ea5c0470a12b0048160ed4b3281c3048004230b3).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52416428
  
    @mengxr Thanks for the comments!  Updated accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-51731530
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52541989
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18770/consoleFull) for   PR 1878 at commit [`dafebe2`](https://github.com/apache/spark/commit/dafebe2233aa925f3210ccf59b1ccd71774aed26).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")`
      * `  case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")`
      * `  case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52435269
  
    @mengxr  It looks like the failures are in other tests; how best to proceed?  With respect to the case class Params, is it OK to have them public since they are in examples?  (Other examples have them public too.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52447106
  
    I saw a hotfix just merged in. Let's try Jenkins again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52428670
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18704/consoleFull) for   PR 1878 at commit [`4e5d15e`](https://github.com/apache/spark/commit/4e5d15ef333ec468c872fa24adea98486b168ded).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52450396
  
    Are you using local mode? All executors are running inside the same JVM under local mode. They may use the same timer instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52456910
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18738/consoleFull) for   PR 1878 at commit [`60c72d9`](https://github.com/apache/spark/commit/60c72d98b20525e328a791830b5132d42d167202).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")`
      * `  case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")`
      * `  case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52447202
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18729/consoleFull) for   PR 1878 at commit [`4e5d15e`](https://github.com/apache/spark/commit/4e5d15ef333ec468c872fa24adea98486b168ded).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52416455
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18700/consoleFull) for   PR 1878 at commit [`32173b7`](https://github.com/apache/spark/commit/32173b7a2ebf4a7118faf42cccf6c6e9af073842).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52534308
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16165816
  
    --- Diff: examples/src/main/python/mllib/random_and_sampled_rdds.py ---
    @@ -0,0 +1,88 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Randomly generated and sampled RDDs.
    --- End diff --
    
    Sure, I can separate them.  I'll call them random_rdds.py and sampled_rdds.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52549345
  
    Driver suite test failed...merging with updated master and trying again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52430001
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18704/consoleFull) for   PR 1878 at commit [`4e5d15e`](https://github.com/apache/spark/commit/4e5d15ef333ec468c872fa24adea98486b168ded).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")`
      * `  case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")`
      * `  case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-51731591
  
    QA tests have started for PR 1878. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18291/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52537882
  
    @mengxr Hopefully read pending Jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16338266
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/SampledRDDs.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.mllib.util.MLUtils
    +import scopt.OptionParser
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.SparkContext._
    +
    +/**
    + * An example app for randomly generated and sampled RDDs. Run with
    + * {{{
    + * bin/run-example org.apache.spark.examples.mllib.SampledRDDs
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object SampledRDDs {
    +
    +  case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")
    +
    +  def main(args: Array[String]) {
    +    val defaultParams = Params()
    +
    +    val parser = new OptionParser[Params]("SampledRDDs") {
    +      head("SampledRDDs: an example app for randomly generated and sampled RDDs.")
    +      opt[String]("input")
    +        .text(s"Input path to labeled examples in LIBSVM format, default: ${defaultParams.input}")
    +        .action((x, c) => c.copy(input = x))
    +      note(
    +        """
    +        |For example, the following command runs this app:
    +        |
    +        | bin/spark-submit --class org.apache.spark.examples.mllib.SampledRDDs \
    +        |  examples/target/scala-*/spark-examples-*.jar
    +        """.stripMargin)
    +    }
    +
    +    parser.parse(args, defaultParams).map { params =>
    +      run(params)
    +    } getOrElse {
    +      sys.exit(1)
    +    }
    +  }
    +
    +  def run(params: Params) {
    +    val conf = new SparkConf().setAppName(s"SampledRDDs with $params")
    +    val sc = new SparkContext(conf)
    +
    +    val fraction = 0.1 // fraction of data to sample
    +
    +    val examples = MLUtils.loadLibSVMFile(sc, params.input)
    +    val numExamples = examples.count()
    +    println(s"Loaded data with $numExamples examples from file: ${params.input}")
    +
    +    // Example: RDD.sample() and RDD.takeSample()
    +    val expectedSampleSize = (numExamples * fraction).toInt
    +    println(s"Sampling RDD using fraction $fraction.  Expected sample size = $expectedSampleSize.")
    +    val sampledRDD = examples.sample(withReplacement = true, fraction = fraction)
    +    println(s"  RDD.sample(): sample has ${sampledRDD.count()} examples")
    +    val sampledArray = examples.takeSample(withReplacement = true, num = expectedSampleSize)
    +    println(s"  RDD.takeSample(): sample has ${sampledArray.size} examples")
    +
    +    println()
    +
    +    // Example: RDD.sampleByKey() and RDD.sampleByKeyExact()
    +    val keyedRDD = examples.map { lp => (lp.label.toInt, lp.features) }
    +    println(s"  Keyed data using label (Int) as key ==> Orig")
    +    //  Count examples per label in original data.
    +    val keyCounts = keyedRDD.countByKey()
    +
    +    //  Subsample, and count examples per label in sampled data. (approximate)
    +    val fractions = keyCounts.keys.map((_, fraction)).toMap
    +    val sampledByKeyRDD = keyedRDD.sampleByKey(withReplacement = true, fractions = fractions)
    +    val keyCountsB = sampledByKeyRDD.countByKey()
    +    val sizeB = keyCountsB.values.sum
    +    println(s"  Sampled $sizeB examples using approximate stratified sampling (by label)." +
    +      " ==> Approx Sample")
    +
    +    //  Subsample, and count examples per label in sampled data. (approximate)
    +    val sampledByKeyRDDExact =
    +      keyedRDD.sampleByKeyExact(withReplacement = true, fractions = fractions)
    +    val keyCountsBExact = sampledByKeyRDDExact.countByKey()
    +    val sizeBExact = keyCountsBExact.values.sum
    +    println(s"  Sampled $sizeBExact examples using exact stratified sampling (by label)." +
    +      " ==> Exact Sample")
    +
    +    //  Compare samples
    +    println(s"   \tFractions of examples with key")
    +    println(s"Key\tOrig\tApprox Sample\tExact Sample")
    +    keyCounts.keys.toSeq.sorted.foreach { key =>
    +      val origFrac = keyCounts(key) / numExamples.toDouble
    +      val approxFrac = keyCountsB(key) / sizeB.toDouble
    --- End diff --
    
    There is a chance that `keyCountsB` doesn't contains `key`. It is safer to use `keyCountsB.getOrElse` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52417298
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18700/consoleFull) for   PR 1878 at commit [`32173b7`](https://github.com/apache/spark/commit/32173b7a2ebf4a7118faf42cccf6c6e9af073842).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52449554
  
    @mengxr  I just found a bug: timer.start("chooseSplits") on DecisionTree.scala: line 1245 is sometimes called twice before timer.stop("chooseSplits") several lines below that is called.  Perhaps this is due to copies of timer being made?  I had thought the timer object would stay on the master since it does not appear in the distributed operations.  It happens very rarely since I've run it many times and have only just encountered it.  Any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-51723167
  
    QA tests have started for PR 1878. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18283/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094834
  
    --- Diff: examples/src/main/python/mllib/random_and_sampled_rdds.py ---
    @@ -0,0 +1,88 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Randomly generated and sampled RDDs.
    +"""
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.random import RandomRDDGenerators
    +from pyspark.mllib.util import MLUtils
    +
    +
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) not in [1, 2]:
    +        print >> sys.stderr, "Usage: logistic_regression <libsvm data file>"
    +        exit(-1)
    +    if len(sys.argv) == 2:
    +        datapath = sys.argv[1]
    +    else:
    +        datapath = 'data/mllib/sample_binary_classification_data.txt'
    +
    +    sc = SparkContext(appName="PythonRandomAndSampledRDDs")
    +
    +    points = MLUtils.loadLibSVMFile(sc, datapath)
    +
    +    numExamples = 10000 # number of examples to generate
    +    fraction = 0.1 # fraction of data to sample
    +
    +    # Example: RandomRDDGenerators
    +    normalRDD = RandomRDDGenerators.normalRDD(sc, numExamples)
    +    print 'Generated RDD of %d examples sampled from a unit normal distribution' % normalRDD.count()
    +    normalVectorRDD = RandomRDDGenerators.normalVectorRDD(sc, numRows = numExamples, numCols = 2)
    +    print 'Generated RDD of %d examples of length-2 vectors.' % normalVectorRDD.count()
    +
    +    print ''
    --- End diff --
    
    `print ''` -> `print`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094844
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/RandomAndSampledRDDs.scala ---
    @@ -0,0 +1,110 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.mllib.random.RandomRDDGenerators
    +import org.apache.spark.mllib.util.MLUtils
    +import org.apache.spark.rdd.RDD
    +import scopt.OptionParser
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.SparkContext._
    +
    +/**
    + * An example app for randomly generated and sampled RDDs. Run with
    + * {{{
    + * bin/run-example org.apache.spark.examples.mllib.RandomAndSampledRDDs
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object RandomAndSampledRDDs extends App {
    --- End diff --
    
    ditto: It may be better if we separate random data generation and sampling.
    
    There are some caveats with `scala.App`. Maybe we should remove `extends App` and create `def main` explicitly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094835
  
    --- Diff: examples/src/main/python/mllib/random_and_sampled_rdds.py ---
    @@ -0,0 +1,88 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Randomly generated and sampled RDDs.
    +"""
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.random import RandomRDDGenerators
    +from pyspark.mllib.util import MLUtils
    +
    +
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) not in [1, 2]:
    +        print >> sys.stderr, "Usage: logistic_regression <libsvm data file>"
    +        exit(-1)
    +    if len(sys.argv) == 2:
    +        datapath = sys.argv[1]
    +    else:
    +        datapath = 'data/mllib/sample_binary_classification_data.txt'
    +
    +    sc = SparkContext(appName="PythonRandomAndSampledRDDs")
    +
    +    points = MLUtils.loadLibSVMFile(sc, datapath)
    +
    +    numExamples = 10000 # number of examples to generate
    +    fraction = 0.1 # fraction of data to sample
    +
    +    # Example: RandomRDDGenerators
    +    normalRDD = RandomRDDGenerators.normalRDD(sc, numExamples)
    +    print 'Generated RDD of %d examples sampled from a unit normal distribution' % normalRDD.count()
    +    normalVectorRDD = RandomRDDGenerators.normalVectorRDD(sc, numRows = numExamples, numCols = 2)
    +    print 'Generated RDD of %d examples of length-2 vectors.' % normalVectorRDD.count()
    +
    +    print ''
    +
    +    # Example: RDD.sample() and RDD.takeSample()
    +    exactSampleSize = int(numExamples * fraction)
    --- End diff --
    
    `exactSampleSize` -> `expectedSampleSize`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16337343
  
    --- Diff: python/pyspark/mllib/stat.py ---
    @@ -119,15 +119,15 @@ def corr(x, y=None, method=None):
             >>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]),
             ...                       Vectors.dense([6, 7, 0,  8]), Vectors.dense([9, 0, 0, 1])])
             >>> Statistics.corr(rdd)
    -        array([[ 1.        ,  0.05564149,         nan,  0.40047142],
    -               [ 0.05564149,  1.        ,         nan,  0.91359586],
    -               [        nan,         nan,  1.        ,         nan],
    -               [ 0.40047142,  0.91359586,         nan,  1.        ]])
    +        array([[ 1.        ,  0.05564149,         NaN,  0.40047142],
    --- End diff --
    
    I'm using 2.7.7 but using `NaN` will result an error in my local test. Maybe we should check the elements one by one in this case and use `isnan` on `NaN` values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52562946
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52512515
  
    @mengxr Thanks!  I'll send the tree fixes in the other PR I sent just now on treeAggregate(), and I will do the keyCount fix in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-51723114
  
    Q: Is the Python SparseVector.toDense() function too big an API update?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1878#discussion_r16094831

--- Diff: examples/src/main/python/mllib/random_and_sampled_rdds.py ---
@@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Randomly generated and sampled RDDs.
--- End diff --

I don't quite understand why putting random data generation and sampling in a single example file. We can demo generating random uniform/normal/guassian/poisson RDDs in one example, and then stratified sampling in another (e.g., sampling based on the label to re-balance positive/negative examples).

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094851
  
    --- Diff: python/pyspark/mllib/linalg.py ---
    @@ -160,6 +161,15 @@ def squared_distance(self, other):
                     j += 1
                 return result
     
    +    def toDense(self):
    +        """
    +        Returns a copy of this SparseVector as a 1-dimensional NumPy array.
    +        """
    +        arr = numpy.zeros(self.size)
    +        for i in range(self.indices.size):
    --- End diff --
    
    `range` -> `xrange`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16336930
  
    --- Diff: python/pyspark/mllib/stat.py ---
    @@ -119,15 +119,15 @@ def corr(x, y=None, method=None):
             >>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]),
             ...                       Vectors.dense([6, 7, 0,  8]), Vectors.dense([9, 0, 0, 1])])
             >>> Statistics.corr(rdd)
    -        array([[ 1.        ,  0.05564149,         nan,  0.40047142],
    -               [ 0.05564149,  1.        ,         nan,  0.91359586],
    -               [        nan,         nan,  1.        ,         nan],
    -               [ 0.40047142,  0.91359586,         nan,  1.        ]])
    +        array([[ 1.        ,  0.05564149,         NaN,  0.40047142],
    --- End diff --
    
    `float('nan')` returns `nan` instead of `NaN` on my machine. Which python version are you using?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1878


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52563357
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18789/consoleFull) for   PR 1878 at commit [`ea5c047`](https://github.com/apache/spark/commit/ea5c0470a12b0048160ed4b3281c3048004230b3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52455091
  
    @jkbradley I tested the examples and found that `tree.py` is not included in `run-tests.py`. If we include it, it will throw errors due to `trainClassifier` needs at least three arguments. So we need to update both  the unit tests and the example code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52454050
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52577785
  
    LGTM. Merged into master and branch-1.1. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094838
  
    --- Diff: examples/src/main/python/mllib/statistical_summary.py ---
    @@ -0,0 +1,60 @@
    +#
    --- End diff --
    
    `correlations.py` for `pearson` and `spearman`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52454257
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18738/consoleFull) for   PR 1878 at commit [`60c72d9`](https://github.com/apache/spark/commit/60c72d98b20525e328a791830b5132d42d167202).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by loveconan1988 <gi...@git.apache.org>.

Github user loveconan1988 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16153416
  
    --- Diff: python/pyspark/mllib/linalg.py ---
    @@ -160,6 +161,15 @@ def squared_distance(self, other):
                     j += 1
                 return result
     
    +    def toDense(self):
    --- End diff --
    
    ------------------ 原始邮件 ------------------
      发件人: "Xiangrui Meng";<no...@github.com>;
     发送时间: 2014年8月12日(星期二) 中午12:08
     收件人: "apache/spark"<sp...@noreply.github.com>; 
     
     主题: Re: [spark] [SPARK-2850] [mllib] MLlib stats examples + small fixes(#1878)
    
     
    
     
    In python/pyspark/mllib/linalg.py:
    > @@ -160,6 +161,15 @@ def squared_distance(self, other): >                  j += 1 >              return result >   > +    def toDense(self):  
    toDense -> toArray (compatible with Scala API)?
     
    —
    Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16337005
  
    --- Diff: python/pyspark/mllib/stat.py ---
    @@ -119,15 +119,15 @@ def corr(x, y=None, method=None):
             >>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]),
             ...                       Vectors.dense([6, 7, 0,  8]), Vectors.dense([9, 0, 0, 1])])
             >>> Statistics.corr(rdd)
    -        array([[ 1.        ,  0.05564149,         nan,  0.40047142],
    -               [ 0.05564149,  1.        ,         nan,  0.91359586],
    -               [        nan,         nan,  1.        ,         nan],
    -               [ 0.40047142,  0.91359586,         nan,  1.        ]])
    +        array([[ 1.        ,  0.05564149,         NaN,  0.40047142],
    --- End diff --
    
    Interesting.  Version 2.7.8.  Should we keep this doc test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094832
  
    --- Diff: examples/src/main/python/mllib/random_and_sampled_rdds.py ---
    @@ -0,0 +1,88 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Randomly generated and sampled RDDs.
    +"""
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.random import RandomRDDGenerators
    +from pyspark.mllib.util import MLUtils
    +
    +
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) not in [1, 2]:
    +        print >> sys.stderr, "Usage: logistic_regression <libsvm data file>"
    --- End diff --
    
    help message needs update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094850
  
    --- Diff: python/pyspark/mllib/linalg.py ---
    @@ -160,6 +161,15 @@ def squared_distance(self, other):
                     j += 1
                 return result
     
    +    def toDense(self):
    --- End diff --
    
    `toDense` -> `toArray` (compatible with Scala API)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52554379
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16094833
  
    --- Diff: examples/src/main/python/mllib/random_and_sampled_rdds.py ---
    @@ -0,0 +1,88 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Randomly generated and sampled RDDs.
    +"""
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.random import RandomRDDGenerators
    +from pyspark.mllib.util import MLUtils
    +
    +
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) not in [1, 2]:
    +        print >> sys.stderr, "Usage: logistic_regression <libsvm data file>"
    +        exit(-1)
    +    if len(sys.argv) == 2:
    +        datapath = sys.argv[1]
    +    else:
    +        datapath = 'data/mllib/sample_binary_classification_data.txt'
    +
    +    sc = SparkContext(appName="PythonRandomAndSampledRDDs")
    +
    +    points = MLUtils.loadLibSVMFile(sc, datapath)
    +
    +    numExamples = 10000 # number of examples to generate
    +    fraction = 0.1 # fraction of data to sample
    +
    +    # Example: RandomRDDGenerators
    +    normalRDD = RandomRDDGenerators.normalRDD(sc, numExamples)
    +    print 'Generated RDD of %d examples sampled from a unit normal distribution' % normalRDD.count()
    --- End diff --
    
    `a unit` -> `the standard`
    
    We can also call `normalRDD.stats()` to get the basic statistics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52447067
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [mllib] MLlib stats examples + sm...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1878#discussion_r16165936
  
    --- Diff: examples/src/main/python/mllib/random_and_sampled_rdds.py ---
    @@ -0,0 +1,88 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Randomly generated and sampled RDDs.
    +"""
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.random import RandomRDDGenerators
    +from pyspark.mllib.util import MLUtils
    +
    +
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) not in [1, 2]:
    +        print >> sys.stderr, "Usage: logistic_regression <libsvm data file>"
    +        exit(-1)
    +    if len(sys.argv) == 2:
    +        datapath = sys.argv[1]
    +    else:
    +        datapath = 'data/mllib/sample_binary_classification_data.txt'
    +
    +    sc = SparkContext(appName="PythonRandomAndSampledRDDs")
    +
    +    points = MLUtils.loadLibSVMFile(sc, datapath)
    +
    +    numExamples = 10000 # number of examples to generate
    +    fraction = 0.1 # fraction of data to sample
    +
    +    # Example: RandomRDDGenerators
    +    normalRDD = RandomRDDGenerators.normalRDD(sc, numExamples)
    +    print 'Generated RDD of %d examples sampled from a unit normal distribution' % normalRDD.count()
    --- End diff --
    
    This file shows off different functionality than normalRDD.stats().  normalRDD.stats() seems very similar to MultivariateStatisticalSummary / MultivariateOnlineSummarizer.  Why are normalRDD.stats() and statcounter.py not following the MultivariateStatisticalSummary / MultivariateOnlineSummarizer APIs (for which there are no Python APIs currently)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2850] [SPARK-2626] [mllib] MLlib stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1878#issuecomment-52568358
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18789/consoleFull) for   PR 1878 at commit [`ea5c047`](https://github.com/apache/spark/commit/ea5c0470a12b0048160ed4b3281c3048004230b3).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org