You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by yh...@apache.org on 2016/12/28 22:35:21 UTC
[10/25] spark-website git commit: Update 2.1.0 docs to include
https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-pmml-model-export.html
----------------------------------------------------------------------
diff --git a/site/docs/2.1.0/mllib-pmml-model-export.html b/site/docs/2.1.0/mllib-pmml-model-export.html
index 30815e0..3f2fd91 100644
--- a/site/docs/2.1.0/mllib-pmml-model-export.html
+++ b/site/docs/2.1.0/mllib-pmml-model-export.html
@@ -307,8 +307,8 @@
<ul id="markdown-toc">
- <li><a href="#sparkmllib-supported-models" id="markdown-toc-sparkmllib-supported-models"><code>spark.mllib</code> supported models</a></li>
- <li><a href="#examples" id="markdown-toc-examples">Examples</a></li>
+ <li><a href="#sparkmllib-supported-models"><code>spark.mllib</code> supported models</a></li>
+ <li><a href="#examples">Examples</a></li>
</ul>
<h2 id="sparkmllib-supported-models"><code>spark.mllib</code> supported models</h2>
@@ -353,32 +353,31 @@
<p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.clustering.KMeans"><code>KMeans</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$"><code>Vectors</code> Scala docs</a> for details on the API.</p>
- <p>Here a complete example of building a KMeansModel and print it out in PMML format:</p>
- <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.clustering.KMeans</span>
-<span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span>
+ <p>Here a complete example of building a KMeansModel and print it out in PMML format:
+<div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.clustering.KMeans</span>
+<span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span></p>
-<span class="c1">// Load and parse the data</span>
+ <p><span class="c1">// Load and parse the data</span>
<span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"data/mllib/kmeans_data.txt"</span><span class="o">)</span>
-<span class="k">val</span> <span class="n">parsedData</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">s</span> <span class="k">=></span> <span class="nc">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="n">s</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="sc">' '</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toDouble</span><span class="o">))).</span><span class="n">cache</span><span class="o">()</span>
+<span class="k">val</span> <span class="n">parsedData</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">s</span> <span class="k">=></span> <span class="nc">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="n">s</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="sc">' '</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toDouble</span><span class="o">))).</span><span class="n">cache</span><span class="o">()</span></p>
-<span class="c1">// Cluster the data into two classes using KMeans</span>
+ <p><span class="c1">// Cluster the data into two classes using KMeans</span>
<span class="k">val</span> <span class="n">numClusters</span> <span class="k">=</span> <span class="mi">2</span>
<span class="k">val</span> <span class="n">numIterations</span> <span class="k">=</span> <span class="mi">20</span>
-<span class="k">val</span> <span class="n">clusters</span> <span class="k">=</span> <span class="nc">KMeans</span><span class="o">.</span><span class="n">train</span><span class="o">(</span><span class="n">parsedData</span><span class="o">,</span> <span class="n">numClusters</span><span class="o">,</span> <span class="n">numIterations</span><span class="o">)</span>
+<span class="k">val</span> <span class="n">clusters</span> <span class="k">=</span> <span class="nc">KMeans</span><span class="o">.</span><span class="n">train</span><span class="o">(</span><span class="n">parsedData</span><span class="o">,</span> <span class="n">numClusters</span><span class="o">,</span> <span class="n">numIterations</span><span class="o">)</span></p>
-<span class="c1">// Export to PMML to a String in PMML format</span>
-<span class="n">println</span><span class="o">(</span><span class="s">"PMML Model:\n"</span> <span class="o">+</span> <span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">)</span>
+ <p><span class="c1">// Export to PMML to a String in PMML format</span>
+<span class="n">println</span><span class="o">(</span><span class="s">"PMML Model:\n"</span> <span class="o">+</span> <span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">)</span></p>
-<span class="c1">// Export the model to a local file in PMML format</span>
-<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="s">"/tmp/kmeans.xml"</span><span class="o">)</span>
+ <p><span class="c1">// Export the model to a local file in PMML format</span>
+<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="s">"/tmp/kmeans.xml"</span><span class="o">)</span></p>
-<span class="c1">// Export the model to a directory on a distributed file system in PMML format</span>
-<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">"/tmp/kmeans"</span><span class="o">)</span>
+ <p><span class="c1">// Export the model to a directory on a distributed file system in PMML format</span>
+<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">"/tmp/kmeans"</span><span class="o">)</span></p>
-<span class="c1">// Export the model to the OutputStream in PMML format</span>
+ <p><span class="c1">// Export the model to the OutputStream in PMML format</span>
<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="nc">System</span><span class="o">.</span><span class="n">out</span><span class="o">)</span>
-</pre></div>
- <div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala" in the Spark repo.</small></div>
+</pre></div><div><small>Find full example code at “examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala” in the Spark repo.</small></div></p>
<p>For unsupported models, either you will not find a <code>.toPMML</code> method or an <code>IllegalArgumentException</code> will be thrown.</p>
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-statistics.html
----------------------------------------------------------------------
diff --git a/site/docs/2.1.0/mllib-statistics.html b/site/docs/2.1.0/mllib-statistics.html
index 4485ecf..f04924c 100644
--- a/site/docs/2.1.0/mllib-statistics.html
+++ b/site/docs/2.1.0/mllib-statistics.html
@@ -358,15 +358,15 @@
<ul id="markdown-toc">
- <li><a href="#summary-statistics" id="markdown-toc-summary-statistics">Summary statistics</a></li>
- <li><a href="#correlations" id="markdown-toc-correlations">Correlations</a></li>
- <li><a href="#stratified-sampling" id="markdown-toc-stratified-sampling">Stratified sampling</a></li>
- <li><a href="#hypothesis-testing" id="markdown-toc-hypothesis-testing">Hypothesis testing</a> <ul>
- <li><a href="#streaming-significance-testing" id="markdown-toc-streaming-significance-testing">Streaming Significance Testing</a></li>
+ <li><a href="#summary-statistics">Summary statistics</a></li>
+ <li><a href="#correlations">Correlations</a></li>
+ <li><a href="#stratified-sampling">Stratified sampling</a></li>
+ <li><a href="#hypothesis-testing">Hypothesis testing</a> <ul>
+ <li><a href="#streaming-significance-testing">Streaming Significance Testing</a></li>
</ul>
</li>
- <li><a href="#random-data-generation" id="markdown-toc-random-data-generation">Random data generation</a></li>
- <li><a href="#kernel-density-estimation" id="markdown-toc-kernel-density-estimation">Kernel density estimation</a></li>
+ <li><a href="#random-data-generation">Random data generation</a></li>
+ <li><a href="#kernel-density-estimation">Kernel density estimation</a></li>
</ul>
<p><code>\[
@@ -401,7 +401,7 @@ total count.</p>
<p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary"><code>MultivariateStatisticalSummary</code> Scala docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span>
+ <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span>
<span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.</span><span class="o">{</span><span class="nc">MultivariateStatisticalSummary</span><span class="o">,</span> <span class="nc">Statistics</span><span class="o">}</span>
<span class="k">val</span> <span class="n">observations</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span>
@@ -430,7 +430,7 @@ total count.</p>
<p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html"><code>MultivariateStatisticalSummary</code> Java docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+ <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vector</span><span class="o">;</span>
@@ -463,19 +463,19 @@ total count.</p>
<p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary"><code>MultivariateStatisticalSummary</code> Python docs</a> for more details on the API.</p>
- <div class="highlight"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
+ <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
<span class="n">mat</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span>
<span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">200.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">,</span> <span class="mf">300.0</span><span class="p">])]</span>
-<span class="p">)</span> <span class="c"># an RDD of Vectors</span>
+<span class="p">)</span> <span class="c1"># an RDD of Vectors</span>
-<span class="c"># Compute column summary statistics.</span>
+<span class="c1"># Compute column summary statistics.</span>
<span class="n">summary</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">colStats</span><span class="p">(</span><span class="n">mat</span><span class="p">)</span>
-<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="c"># a dense vector containing the mean value for each column</span>
-<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">variance</span><span class="p">())</span> <span class="c"># column-wise variance</span>
-<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">numNonzeros</span><span class="p">())</span> <span class="c"># number of nonzeros in each column</span>
+<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="c1"># a dense vector containing the mean value for each column</span>
+<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">variance</span><span class="p">())</span> <span class="c1"># column-wise variance</span>
+<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">numNonzeros</span><span class="p">())</span> <span class="c1"># number of nonzeros in each column</span>
</pre></div>
<div><small>Find full example code at "examples/src/main/python/mllib/summary_statistics_example.py" in the Spark repo.</small></div>
</div>
@@ -496,7 +496,7 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor
<p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.Statistics$"><code>Statistics</code> Scala docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span>
+ <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span>
<span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span>
<span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span>
@@ -507,7 +507,7 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor
<span class="c1">// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a</span>
<span class="c1">// method is not specified, Pearson's method will be used by default.</span>
<span class="k">val</span> <span class="n">correlation</span><span class="k">:</span> <span class="kt">Double</span> <span class="o">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="o">(</span><span class="n">seriesX</span><span class="o">,</span> <span class="n">seriesY</span><span class="o">,</span> <span class="s">"pearson"</span><span class="o">)</span>
-<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"Correlation is: $correlation"</span><span class="o">)</span>
+<span class="n">println</span><span class="o">(</span><span class="s">s"Correlation is: </span><span class="si">$correlation</span><span class="s">"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">data</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Vector</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">(</span>
@@ -531,7 +531,7 @@ a <code>JavaRDD<Vector></code>, the output will be a <code>Double</code> o
<p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/Statistics.html"><code>Statistics</code> Java docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+ <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaDoubleRDD</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span>
@@ -577,23 +577,23 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor
<p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>
- <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
+ <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
-<span class="n">seriesX</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span> <span class="c"># a series</span>
-<span class="c"># seriesY must have the same number of partitions and cardinality as seriesX</span>
+<span class="n">seriesX</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span> <span class="c1"># a series</span>
+<span class="c1"># seriesY must have the same number of partitions and cardinality as seriesX</span>
<span class="n">seriesY</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">11.0</span><span class="p">,</span> <span class="mf">22.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">555.0</span><span class="p">])</span>
-<span class="c"># Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.</span>
-<span class="c"># If a method is not specified, Pearson's method will be used by default.</span>
-<span class="k">print</span><span class="p">(</span><span class="s">"Correlation is: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">seriesX</span><span class="p">,</span> <span class="n">seriesY</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"pearson"</span><span class="p">)))</span>
+<span class="c1"># Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.</span>
+<span class="c1"># If a method is not specified, Pearson's method will be used by default.</span>
+<span class="k">print</span><span class="p">(</span><span class="s2">"Correlation is: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">seriesX</span><span class="p">,</span> <span class="n">seriesY</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">"pearson"</span><span class="p">)))</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span>
<span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">200.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">366.0</span><span class="p">])]</span>
-<span class="p">)</span> <span class="c"># an RDD of Vectors</span>
+<span class="p">)</span> <span class="c1"># an RDD of Vectors</span>
-<span class="c"># calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.</span>
-<span class="c"># If a method is not specified, Pearson's method will be used by default.</span>
-<span class="k">print</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"pearson"</span><span class="p">))</span>
+<span class="c1"># calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.</span>
+<span class="c1"># If a method is not specified, Pearson's method will be used by default.</span>
+<span class="k">print</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">"pearson"</span><span class="p">))</span>
</pre></div>
<div><small>Find full example code at "examples/src/main/python/mllib/correlations_example.py" in the Spark repo.</small></div>
</div>
@@ -621,9 +621,9 @@ fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K
keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
size, whereas sampling with replacement requires two additional passes.</p>
- <div class="highlight"><pre><span class="c1">// an RDD[(K, V)] of any key value pairs</span>
+ <div class="highlight"><pre><span></span><span class="c1">// an RDD[(K, V)] of any key value pairs</span>
<span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span>
- <span class="nc">Seq</span><span class="o">((</span><span class="mi">1</span><span class="o">,</span> <span class="-Symbol">'a</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="-Symbol">'b</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">'c</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">'d</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">'e</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="-Symbol">'f</span><span class="err">'</span><sp
an class="o">)))</span>
+ <span class="nc">Seq</span><span class="o">((</span><span class="mi">1</span><span class="o">,</span> <span class="sc">'a'</span><span class="o">),</span> <span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="sc">'b'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">'c'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">'d'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">'e'</span><span class="o">),</span> <span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="sc">'f'</span><span class="o">)))</span>
<span class="c1">// specify the exact fraction desired from each key</span>
<span class="k">val</span> <span class="n">fractions</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="mi">1</span> <span class="o">-></span> <span class="mf">0.1</span><span class="o">,</span> <span class="mi">2</span> <span class="o">-></span> <span class="mf">0.6</span><span class="o">,</span> <span class="mi">3</span> <span class="o">-></span> <span class="mf">0.3</span><span class="o">)</span>
@@ -643,7 +643,7 @@ fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K
keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
size, whereas sampling with replacement requires two additional passes.</p>
- <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.*</span><span class="o">;</span>
+ <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.*</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span>
@@ -678,10 +678,10 @@ set of keys.</p>
<p><em>Note:</em> <code>sampleByKeyExact()</code> is currently not supported in Python.</p>
- <div class="highlight"><pre><span class="c"># an RDD of any key value pairs</span>
-<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'a'</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'b'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'c'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'d'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'e'</span><span class="p">),</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s">'f'</span><span class="p">)])</span>
+ <div class="highlight"><pre><span></span><span class="c1"># an RDD of any key value pairs</span>
+<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">'c'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">'d'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">'e'</span><span class="p">),</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s1">'f'</span><span class="p">)])</span>
-<span class="c"># specify the exact fraction desired from each key as a dictionary</span>
+<span class="c1"># specify the exact fraction desired from each key as a dictionary</span>
<span class="n">fractions</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">2</span><span class="p">:</span> <span class="mf">0.6</span><span class="p">,</span> <span class="mi">3</span><span class="p">:</span> <span class="mf">0.3</span><span class="p">}</span>
<span class="n">approxSample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampleByKey</span><span class="p">(</span><span class="bp">False</span><span class="p">,</span> <span class="n">fractions</span><span class="p">)</span>
@@ -708,7 +708,7 @@ independence tests.</p>
run Pearson’s chi-squared tests. The following example demonstrates how to run and interpret
hypothesis tests.</p>
- <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span>
+ <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span>
<span class="k">import</span> <span class="nn">org.apache.spark.mllib.regression.LabeledPoint</span>
<span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span>
<span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.test.ChiSqTestResult</span>
@@ -722,7 +722,7 @@ hypothesis tests.</p>
<span class="k">val</span> <span class="n">goodnessOfFitTestResult</span> <span class="k">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="o">(</span><span class="n">vec</span><span class="o">)</span>
<span class="c1">// summary of the test including the p-value, degrees of freedom, test statistic, the method</span>
<span class="c1">// used, and the null hypothesis.</span>
-<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"$goodnessOfFitTestResult\n"</span><span class="o">)</span>
+<span class="n">println</span><span class="o">(</span><span class="s">s"</span><span class="si">$goodnessOfFitTestResult</span><span class="s">\n"</span><span class="o">)</span>
<span class="c1">// a contingency matrix. Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))</span>
<span class="k">val</span> <span class="n">mat</span><span class="k">:</span> <span class="kt">Matrix</span> <span class="o">=</span> <span class="nc">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="mi">2</span><span class="o">,</span> <span class="nc">Array</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">,</span> <span class="mf">5.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">4.0</span><span class="o">,</span> <span class="mf">6.0</span><span class="o">))</span>
@@ -730,7 +730,7 @@ hypothesis tests.</p>
<span class="c1">// conduct Pearson's independence test on the input contingency matrix</span>
<span class="k">val</span> <span class="n">independenceTestResult</span> <span class="k">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="o">(</span><span class="n">mat</span><span class="o">)</span>
<span class="c1">// summary of the test including the p-value, degrees of freedom</span>
-<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"$independenceTestResult\n"</span><span class="o">)</span>
+<span class="n">println</span><span class="o">(</span><span class="s">s"</span><span class="si">$independenceTestResult</span><span class="s">\n"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">obs</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">LabeledPoint</span><span class="o">]</span> <span class="k">=</span>
<span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span>
@@ -761,7 +761,7 @@ hypothesis tests.</p>
<p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/test/ChiSqTestResult.html"><code>ChiSqTestResult</code> Java docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+ <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.linalg.Matrices</span><span class="o">;</span>
@@ -793,9 +793,9 @@ hypothesis tests.</p>
<span class="c1">// an RDD of labeled points</span>
<span class="n">JavaRDD</span><span class="o"><</span><span class="n">LabeledPoint</span><span class="o">></span> <span class="n">obs</span> <span class="o">=</span> <span class="n">jsc</span><span class="o">.</span><span class="na">parallelize</span><span class="o">(</span>
<span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span>
- <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">)),</span>
- <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">)),</span>
- <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="o">))</span>
+ <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">)),</span>
+ <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">)),</span>
+ <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="o">))</span>
<span class="o">)</span>
<span class="o">);</span>
@@ -820,42 +820,42 @@ hypothesis tests.</p>
<p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>
- <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Matrices</span><span class="p">,</span> <span class="n">Vectors</span>
+ <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Matrices</span><span class="p">,</span> <span class="n">Vectors</span>
<span class="kn">from</span> <span class="nn">pyspark.mllib.regression</span> <span class="kn">import</span> <span class="n">LabeledPoint</span>
<span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
-<span class="n">vec</span> <span class="o">=</span> <span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">)</span> <span class="c"># a vector composed of the frequencies of events</span>
+<span class="n">vec</span> <span class="o">=</span> <span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">)</span> <span class="c1"># a vector composed of the frequencies of events</span>
-<span class="c"># compute the goodness of fit. If a second vector to test against</span>
-<span class="c"># is not supplied as a parameter, the test runs against a uniform distribution.</span>
+<span class="c1"># compute the goodness of fit. If a second vector to test against</span>
+<span class="c1"># is not supplied as a parameter, the test runs against a uniform distribution.</span>
<span class="n">goodnessOfFitTestResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">vec</span><span class="p">)</span>
-<span class="c"># summary of the test including the p-value, degrees of freedom,</span>
-<span class="c"># test statistic, the method used, and the null hypothesis.</span>
-<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="se">\n</span><span class="s">"</span> <span class="o">%</span> <span class="n">goodnessOfFitTestResult</span><span class="p">)</span>
+<span class="c1"># summary of the test including the p-value, degrees of freedom,</span>
+<span class="c1"># test statistic, the method used, and the null hypothesis.</span>
+<span class="k">print</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="se">\n</span><span class="s2">"</span> <span class="o">%</span> <span class="n">goodnessOfFitTestResult</span><span class="p">)</span>
-<span class="n">mat</span> <span class="o">=</span> <span class="n">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])</span> <span class="c"># a contingency matrix</span>
+<span class="n">mat</span> <span class="o">=</span> <span class="n">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])</span> <span class="c1"># a contingency matrix</span>
-<span class="c"># conduct Pearson's independence test on the input contingency matrix</span>
+<span class="c1"># conduct Pearson's independence test on the input contingency matrix</span>
<span class="n">independenceTestResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">mat</span><span class="p">)</span>
-<span class="c"># summary of the test including the p-value, degrees of freedom,</span>
-<span class="c"># test statistic, the method used, and the null hypothesis.</span>
-<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="se">\n</span><span class="s">"</span> <span class="o">%</span> <span class="n">independenceTestResult</span><span class="p">)</span>
+<span class="c1"># summary of the test including the p-value, degrees of freedom,</span>
+<span class="c1"># test statistic, the method used, and the null hypothesis.</span>
+<span class="k">print</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="se">\n</span><span class="s2">"</span> <span class="o">%</span> <span class="n">independenceTestResult</span><span class="p">)</span>
<span class="n">obs</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span>
<span class="p">[</span><span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">]),</span>
<span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">]),</span>
<span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="p">])]</span>
-<span class="p">)</span> <span class="c"># LabeledPoint(feature, label)</span>
+<span class="p">)</span> <span class="c1"># LabeledPoint(feature, label)</span>
-<span class="c"># The contingency table is constructed from an RDD of LabeledPoint and used to conduct</span>
-<span class="c"># the independence test. Returns an array containing the ChiSquaredTestResult for every feature</span>
-<span class="c"># against the label.</span>
+<span class="c1"># The contingency table is constructed from an RDD of LabeledPoint and used to conduct</span>
+<span class="c1"># the independence test. Returns an array containing the ChiSquaredTestResult for every feature</span>
+<span class="c1"># against the label.</span>
<span class="n">featureTestResults</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">obs</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">result</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">featureTestResults</span><span class="p">):</span>
- <span class="k">print</span><span class="p">(</span><span class="s">"Column </span><span class="si">%d</span><span class="s">:</span><span class="se">\n</span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">result</span><span class="p">))</span>
+ <span class="k">print</span><span class="p">(</span><span class="s2">"Column </span><span class="si">%d</span><span class="s2">:</span><span class="se">\n</span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">result</span><span class="p">))</span>
</pre></div>
<div><small>Find full example code at "examples/src/main/python/mllib/hypothesis_testing_example.py" in the Spark repo.</small></div>
</div>
@@ -879,7 +879,7 @@ and interpret the hypothesis tests.</p>
<p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.Statistics$"><code>Statistics</code> Scala docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span>
+ <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span>
<span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span>
<span class="k">val</span> <span class="n">data</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span><span class="nc">Seq</span><span class="o">(</span><span class="mf">0.1</span><span class="o">,</span> <span class="mf">0.15</span><span class="o">,</span> <span class="mf">0.2</span><span class="o">,</span> <span class="mf">0.3</span><span class="o">,</span> <span class="mf">0.25</span><span class="o">))</span> <span class="c1">// an RDD of sample data</span>
@@ -906,7 +906,7 @@ and interpret the hypothesis tests.</p>
<p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/Statistics.html"><code>Statistics</code> Java docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+ <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaDoubleRDD</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span><span class="o">;</span>
@@ -929,16 +929,16 @@ and interpret the hypothesis tests.</p>
<p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>
- <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
+ <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
<span class="n">parallelData</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">])</span>
-<span class="c"># run a KS test for the sample versus a standard normal distribution</span>
-<span class="n">testResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">kolmogorovSmirnovTest</span><span class="p">(</span><span class="n">parallelData</span><span class="p">,</span> <span class="s">"norm"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
-<span class="c"># summary of the test including the p-value, test statistic, and null hypothesis</span>
-<span class="c"># if our p-value indicates significance, we can reject the null hypothesis</span>
-<span class="c"># Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with</span>
-<span class="c"># a lambda to calculate the CDF is not made available in the Python API</span>
+<span class="c1"># run a KS test for the sample versus a standard normal distribution</span>
+<span class="n">testResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">kolmogorovSmirnovTest</span><span class="p">(</span><span class="n">parallelData</span><span class="p">,</span> <span class="s2">"norm"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
+<span class="c1"># summary of the test including the p-value, test statistic, and null hypothesis</span>
+<span class="c1"># if our p-value indicates significance, we can reject the null hypothesis</span>
+<span class="c1"># Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with</span>
+<span class="c1"># a lambda to calculate the CDF is not made available in the Python API</span>
<span class="k">print</span><span class="p">(</span><span class="n">testResult</span><span class="p">)</span>
</pre></div>
<div><small>Find full example code at "examples/src/main/python/mllib/hypothesis_testing_kolmogorov_smirnov_test_example.py" in the Spark repo.</small></div>
@@ -967,7 +967,7 @@ all prior batches.</li>
<p><a href="api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest"><code>StreamingTest</code></a>
provides streaming hypothesis testing.</p>
- <div class="highlight"><pre><span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">textFileStream</span><span class="o">(</span><span class="n">dataDir</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">","</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
+ <div class="highlight"><pre><span></span><span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">textFileStream</span><span class="o">(</span><span class="n">dataDir</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">","</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Array</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">)</span> <span class="k">=></span> <span class="nc">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">.</span><span class="n">toBoolean</span><span class="o">,</span> <span class="n">value</span><span class="o">.</span><span class="n">toDouble</span><span class="o">)</span>
<span class="o">})</span>
@@ -986,7 +986,7 @@ provides streaming hypothesis testing.</p>
<p><a href="api/java/index.html#org.apache.spark.mllib.stat.test.StreamingTest"><code>StreamingTest</code></a>
provides streaming hypothesis testing.</p>
- <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.BinarySample</span><span class="o">;</span>
+ <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.BinarySample</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.StreamingTest</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.StreamingTestResult</span><span class="o">;</span>
@@ -997,11 +997,11 @@ provides streaming hypothesis testing.</p>
<span class="n">String</span><span class="o">[]</span> <span class="n">ts</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">","</span><span class="o">);</span>
<span class="kt">boolean</span> <span class="n">label</span> <span class="o">=</span> <span class="n">Boolean</span><span class="o">.</span><span class="na">parseBoolean</span><span class="o">(</span><span class="n">ts</span><span class="o">[</span><span class="mi">0</span><span class="o">]);</span>
<span class="kt">double</span> <span class="n">value</span> <span class="o">=</span> <span class="n">Double</span><span class="o">.</span><span class="na">parseDouble</span><span class="o">(</span><span class="n">ts</span><span class="o">[</span><span class="mi">1</span><span class="o">]);</span>
- <span class="k">return</span> <span class="k">new</span> <span class="nf">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">);</span>
+ <span class="k">return</span> <span class="k">new</span> <span class="n">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">});</span>
-<span class="n">StreamingTest</span> <span class="n">streamingTest</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">StreamingTest</span><span class="o">()</span>
+<span class="n">StreamingTest</span> <span class="n">streamingTest</span> <span class="o">=</span> <span class="k">new</span> <span class="n">StreamingTest</span><span class="o">()</span>
<span class="o">.</span><span class="na">setPeacePeriod</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span>
<span class="o">.</span><span class="na">setWindowSize</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span>
<span class="o">.</span><span class="na">setTestMethod</span><span class="o">(</span><span class="s">"welch"</span><span class="o">);</span>
@@ -1028,7 +1028,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
<p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$"><code>RandomRDDs</code> Scala docs</a> for details on the API.</p>
- <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">import</span> <span class="nn">org.apache.spark.SparkContext</span>
+ <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="k">import</span> <span class="nn">org.apache.spark.SparkContext</span>
<span class="k">import</span> <span class="nn">org.apache.spark.mllib.random.RandomRDDs._</span>
<span class="k">val</span> <span class="n">sc</span><span class="k">:</span> <span class="kt">SparkContext</span> <span class="o">=</span> <span class="o">...</span>
@@ -1037,7 +1037,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
<span class="c1">// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span>
<span class="k">val</span> <span class="n">u</span> <span class="k">=</span> <span class="n">normalRDD</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="mi">1000000L</span><span class="o">,</span> <span class="mi">10</span><span class="o">)</span>
<span class="c1">// Apply a transform to get a random double RDD following `N(1, 4)`.</span>
-<span class="k">val</span> <span class="n">v</span> <span class="k">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">)</span></code></pre></div>
+<span class="k">val</span> <span class="n">v</span> <span class="k">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">)</span></code></pre></figure>
</div>
@@ -1049,9 +1049,9 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
<p>Refer to the <a href="api/java/org/apache/spark/mllib/random/RandomRDDs"><code>RandomRDDs</code> Java docs</a> for details on the API.</p>
- <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.spark.SparkContext</span><span class="o">;</span>
+ <figure class="highlight"><pre><code class="language-java" data-lang="java"><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.SparkContext</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.api.JavaDoubleRDD</span><span class="o">;</span>
-<span class="kn">import</span> <span class="nn">static</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">mllib</span><span class="o">.</span><span class="na">random</span><span class="o">.</span><span class="na">RandomRDDs</span><span class="o">.*;</span>
+<span class="kn">import static</span> <span class="nn">org.apache.spark.mllib.random.RandomRDDs.*</span><span class="o">;</span>
<span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="o">...</span>
@@ -1064,7 +1064,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
<span class="kd">public</span> <span class="n">Double</span> <span class="nf">call</span><span class="o">(</span><span class="n">Double</span> <span class="n">x</span><span class="o">)</span> <span class="o">{</span>
<span class="k">return</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">;</span>
<span class="o">}</span>
- <span class="o">});</span></code></pre></div>
+ <span class="o">});</span></code></pre></figure>
</div>
@@ -1076,15 +1076,15 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
<p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs"><code>RandomRDDs</code> Python docs</a> for more details on the API.</p>
- <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyspark.mllib.random</span> <span class="kn">import</span> <span class="n">RandomRDDs</span>
+ <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.random</span> <span class="kn">import</span> <span class="n">RandomRDDs</span>
-<span class="n">sc</span> <span class="o">=</span> <span class="o">...</span> <span class="c"># SparkContext</span>
+<span class="n">sc</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># SparkContext</span>
-<span class="c"># Generate a random double RDD that contains 1 million i.i.d. values drawn from the</span>
-<span class="c"># standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span>
+<span class="c1"># Generate a random double RDD that contains 1 million i.i.d. values drawn from the</span>
+<span class="c1"># standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">RandomRDDs</span><span class="o">.</span><span class="n">normalRDD</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="il">1000000L</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
-<span class="c"># Apply a transform to get a random double RDD following `N(1, 4)`.</span>
-<span class="n">v</span> <span class="o">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span></code></pre></div>
+<span class="c1"># Apply a transform to get a random double RDD following `N(1, 4)`.</span>
+<span class="n">v</span> <span class="o">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span></code></pre></figure>
</div>
</div>
@@ -1107,7 +1107,7 @@ to do so.</p>
<p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity"><code>KernelDensity</code> Scala docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span>
+ <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span>
<span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span>
<span class="c1">// an RDD of sample data</span>
@@ -1132,7 +1132,7 @@ to do so.</p>
<p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/KernelDensity.html"><code>KernelDensity</code> Java docs</a> for details on the API.</p>
- <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+ <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span><span class="o">;</span>
@@ -1143,7 +1143,7 @@ to do so.</p>
<span class="c1">// Construct the density estimator with the sample data</span>
<span class="c1">// and a standard deviation for the Gaussian kernels</span>
-<span class="n">KernelDensity</span> <span class="n">kd</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">KernelDensity</span><span class="o">().</span><span class="na">setSample</span><span class="o">(</span><span class="n">data</span><span class="o">).</span><span class="na">setBandwidth</span><span class="o">(</span><span class="mf">3.0</span><span class="o">);</span>
+<span class="n">KernelDensity</span> <span class="n">kd</span> <span class="o">=</span> <span class="k">new</span> <span class="n">KernelDensity</span><span class="o">().</span><span class="na">setSample</span><span class="o">(</span><span class="n">data</span><span class="o">).</span><span class="na">setBandwidth</span><span class="o">(</span><span class="mf">3.0</span><span class="o">);</span>
<span class="c1">// Find density estimates for the given values</span>
<span class="kt">double</span><span class="o">[]</span> <span class="n">densities</span> <span class="o">=</span> <span class="n">kd</span><span class="o">.</span><span class="na">estimate</span><span class="o">(</span><span class="k">new</span> <span class="kt">double</span><span class="o">[]{-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">5.0</span><span class="o">});</span>
@@ -1160,18 +1160,18 @@ to do so.</p>
<p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity"><code>KernelDensity</code> Python docs</a> for more details on the API.</p>
- <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">KernelDensity</span>
+ <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">KernelDensity</span>
-<span class="c"># an RDD of sample data</span>
+<span class="c1"># an RDD of sample data</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">,</span> <span class="mf">7.0</span><span class="p">,</span> <span class="mf">8.0</span><span class="p">,</span> <span class="mf">9.0</span><span class="p">,</span> <span class="mf">9.0</span><span class="p">])</span>
-<span class="c"># Construct the density estimator with the sample data and a standard deviation for the Gaussian</span>
-<span class="c"># kernels</span>
+<span class="c1"># Construct the density estimator with the sample data and a standard deviation for the Gaussian</span>
+<span class="c1"># kernels</span>
<span class="n">kd</span> <span class="o">=</span> <span class="n">KernelDensity</span><span class="p">()</span>
<span class="n">kd</span><span class="o">.</span><span class="n">setSample</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">kd</span><span class="o">.</span><span class="n">setBandwidth</span><span class="p">(</span><span class="mf">3.0</span><span class="p">)</span>
-<span class="c"># Find density estimates for the given values</span>
+<span class="c1"># Find density estimates for the given values</span>
<span class="n">densities</span> <span class="o">=</span> <span class="n">kd</span><span class="o">.</span><span class="n">estimate</span><span class="p">([</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span>
</pre></div>
<div><small>Find full example code at "examples/src/main/python/mllib/kernel_density_estimation_example.py" in the Spark repo.</small></div>
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org