You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by yh...@apache.org on 2016/12/28 22:35:21 UTC

[10/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-pmml-model-export.html
----------------------------------------------------------------------
diff --git a/site/docs/2.1.0/mllib-pmml-model-export.html b/site/docs/2.1.0/mllib-pmml-model-export.html
index 30815e0..3f2fd91 100644
--- a/site/docs/2.1.0/mllib-pmml-model-export.html
+++ b/site/docs/2.1.0/mllib-pmml-model-export.html
@@ -307,8 +307,8 @@
                     
 
                     <ul id="markdown-toc">
-  <li><a href="#sparkmllib-supported-models" id="markdown-toc-sparkmllib-supported-models"><code>spark.mllib</code> supported models</a></li>
-  <li><a href="#examples" id="markdown-toc-examples">Examples</a></li>
+  <li><a href="#sparkmllib-supported-models"><code>spark.mllib</code> supported models</a></li>
+  <li><a href="#examples">Examples</a></li>
 </ul>
 
 <h2 id="sparkmllib-supported-models"><code>spark.mllib</code> supported models</h2>
@@ -353,32 +353,31 @@
 
     <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.clustering.KMeans"><code>KMeans</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$"><code>Vectors</code> Scala docs</a> for details on the API.</p>
 
-    <p>Here a complete example of building a KMeansModel and print it out in PMML format:</p>
-    <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.clustering.KMeans</span>
-<span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span>
+    <p>Here a complete example of building a KMeansModel and print it out in PMML format:
+&lt;div class="highlight"&gt;&lt;pre&gt;<span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.clustering.KMeans</span>
+<span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span></p>
 
-<span class="c1">// Load and parse the data</span>
+    <p><span class="c1">// Load and parse the data</span>
 <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">&quot;data/mllib/kmeans_data.txt&quot;</span><span class="o">)</span>
-<span class="k">val</span> <span class="n">parsedData</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">s</span> <span class="k">=&gt;</span> <span class="nc">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="n">s</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="sc">&#39; &#39;</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toDouble</span><span class="o">))).</span><span class="n">cache</span><span class="o">()</span>
+<span class="k">val</span> <span class="n">parsedData</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">s</span> <span class="k">=&gt;</span> <span class="nc">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="n">s</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="sc">&#39; &#39;</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toDouble</span><span class="o">))).</span><span class="n">cache</span><span class="o">()</span></p>
 
-<span class="c1">// Cluster the data into two classes using KMeans</span>
+    <p><span class="c1">// Cluster the data into two classes using KMeans</span>
 <span class="k">val</span> <span class="n">numClusters</span> <span class="k">=</span> <span class="mi">2</span>
 <span class="k">val</span> <span class="n">numIterations</span> <span class="k">=</span> <span class="mi">20</span>
-<span class="k">val</span> <span class="n">clusters</span> <span class="k">=</span> <span class="nc">KMeans</span><span class="o">.</span><span class="n">train</span><span class="o">(</span><span class="n">parsedData</span><span class="o">,</span> <span class="n">numClusters</span><span class="o">,</span> <span class="n">numIterations</span><span class="o">)</span>
+<span class="k">val</span> <span class="n">clusters</span> <span class="k">=</span> <span class="nc">KMeans</span><span class="o">.</span><span class="n">train</span><span class="o">(</span><span class="n">parsedData</span><span class="o">,</span> <span class="n">numClusters</span><span class="o">,</span> <span class="n">numIterations</span><span class="o">)</span></p>
 
-<span class="c1">// Export to PMML to a String in PMML format</span>
-<span class="n">println</span><span class="o">(</span><span class="s">&quot;PMML Model:\n&quot;</span> <span class="o">+</span> <span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">)</span>
+    <p><span class="c1">// Export to PMML to a String in PMML format</span>
+<span class="n">println</span><span class="o">(</span><span class="s">&quot;PMML Model:\n&quot;</span> <span class="o">+</span> <span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">)</span></p>
 
-<span class="c1">// Export the model to a local file in PMML format</span>
-<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="s">&quot;/tmp/kmeans.xml&quot;</span><span class="o">)</span>
+    <p><span class="c1">// Export the model to a local file in PMML format</span>
+<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="s">&quot;/tmp/kmeans.xml&quot;</span><span class="o">)</span></p>
 
-<span class="c1">// Export the model to a directory on a distributed file system in PMML format</span>
-<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">&quot;/tmp/kmeans&quot;</span><span class="o">)</span>
+    <p><span class="c1">// Export the model to a directory on a distributed file system in PMML format</span>
+<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">&quot;/tmp/kmeans&quot;</span><span class="o">)</span></p>
 
-<span class="c1">// Export the model to the OutputStream in PMML format</span>
+    <p><span class="c1">// Export the model to the OutputStream in PMML format</span>
 <span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="nc">System</span><span class="o">.</span><span class="n">out</span><span class="o">)</span>
-</pre></div>
-    <div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala" in the Spark repo.</small></div>
+&lt;/pre&gt;&lt;/div&gt;&lt;div&gt;<small>Find full example code at &#8220;examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala&#8221; in the Spark repo.</small>&lt;/div&gt;</p>
 
     <p>For unsupported models, either you will not find a <code>.toPMML</code> method or an <code>IllegalArgumentException</code> will be thrown.</p>
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-statistics.html
----------------------------------------------------------------------
diff --git a/site/docs/2.1.0/mllib-statistics.html b/site/docs/2.1.0/mllib-statistics.html
index 4485ecf..f04924c 100644
--- a/site/docs/2.1.0/mllib-statistics.html
+++ b/site/docs/2.1.0/mllib-statistics.html
@@ -358,15 +358,15 @@
                     
 
                     <ul id="markdown-toc">
-  <li><a href="#summary-statistics" id="markdown-toc-summary-statistics">Summary statistics</a></li>
-  <li><a href="#correlations" id="markdown-toc-correlations">Correlations</a></li>
-  <li><a href="#stratified-sampling" id="markdown-toc-stratified-sampling">Stratified sampling</a></li>
-  <li><a href="#hypothesis-testing" id="markdown-toc-hypothesis-testing">Hypothesis testing</a>    <ul>
-      <li><a href="#streaming-significance-testing" id="markdown-toc-streaming-significance-testing">Streaming Significance Testing</a></li>
+  <li><a href="#summary-statistics">Summary statistics</a></li>
+  <li><a href="#correlations">Correlations</a></li>
+  <li><a href="#stratified-sampling">Stratified sampling</a></li>
+  <li><a href="#hypothesis-testing">Hypothesis testing</a>    <ul>
+      <li><a href="#streaming-significance-testing">Streaming Significance Testing</a></li>
     </ul>
   </li>
-  <li><a href="#random-data-generation" id="markdown-toc-random-data-generation">Random data generation</a></li>
-  <li><a href="#kernel-density-estimation" id="markdown-toc-kernel-density-estimation">Kernel density estimation</a></li>
+  <li><a href="#random-data-generation">Random data generation</a></li>
+  <li><a href="#kernel-density-estimation">Kernel density estimation</a></li>
 </ul>
 
 <p><code>\[
@@ -401,7 +401,7 @@ total count.</p>
 
     <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary"><code>MultivariateStatisticalSummary</code> Scala docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span>
+    <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.</span><span class="o">{</span><span class="nc">MultivariateStatisticalSummary</span><span class="o">,</span> <span class="nc">Statistics</span><span class="o">}</span>
 
 <span class="k">val</span> <span class="n">observations</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span>
@@ -430,7 +430,7 @@ total count.</p>
 
     <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html"><code>MultivariateStatisticalSummary</code> Java docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
 
 <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vector</span><span class="o">;</span>
@@ -463,19 +463,19 @@ total count.</p>
 
     <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary"><code>MultivariateStatisticalSummary</code> Python docs</a> for more details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
 
 <span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
 
 <span class="n">mat</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span>
     <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">200.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">,</span> <span class="mf">300.0</span><span class="p">])]</span>
-<span class="p">)</span>  <span class="c"># an RDD of Vectors</span>
+<span class="p">)</span>  <span class="c1"># an RDD of Vectors</span>
 
-<span class="c"># Compute column summary statistics.</span>
+<span class="c1"># Compute column summary statistics.</span>
 <span class="n">summary</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">colStats</span><span class="p">(</span><span class="n">mat</span><span class="p">)</span>
-<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>  <span class="c"># a dense vector containing the mean value for each column</span>
-<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">variance</span><span class="p">())</span>  <span class="c"># column-wise variance</span>
-<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">numNonzeros</span><span class="p">())</span>  <span class="c"># number of nonzeros in each column</span>
+<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>  <span class="c1"># a dense vector containing the mean value for each column</span>
+<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">variance</span><span class="p">())</span>  <span class="c1"># column-wise variance</span>
+<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">numNonzeros</span><span class="p">())</span>  <span class="c1"># number of nonzeros in each column</span>
 </pre></div>
     <div><small>Find full example code at "examples/src/main/python/mllib/summary_statistics_example.py" in the Spark repo.</small></div>
   </div>
@@ -496,7 +496,7 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor
 
     <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.Statistics$"><code>Statistics</code> Scala docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span>
+    <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span>
 
@@ -507,7 +507,7 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor
 <span class="c1">// compute the correlation using Pearson&#39;s method. Enter &quot;spearman&quot; for Spearman&#39;s method. If a</span>
 <span class="c1">// method is not specified, Pearson&#39;s method will be used by default.</span>
 <span class="k">val</span> <span class="n">correlation</span><span class="k">:</span> <span class="kt">Double</span> <span class="o">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="o">(</span><span class="n">seriesX</span><span class="o">,</span> <span class="n">seriesY</span><span class="o">,</span> <span class="s">&quot;pearson&quot;</span><span class="o">)</span>
-<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;Correlation is: $correlation&quot;</span><span class="o">)</span>
+<span class="n">println</span><span class="o">(</span><span class="s">s&quot;Correlation is: </span><span class="si">$correlation</span><span class="s">&quot;</span><span class="o">)</span>
 
 <span class="k">val</span> <span class="n">data</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Vector</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span>
   <span class="nc">Seq</span><span class="o">(</span>
@@ -531,7 +531,7 @@ a <code>JavaRDD&lt;Vector&gt;</code>, the output will be a <code>Double</code> o
 
     <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/Statistics.html"><code>Statistics</code> Java docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
 
 <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaDoubleRDD</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span>
@@ -577,23 +577,23 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor
 
     <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
+    <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
 
-<span class="n">seriesX</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span>  <span class="c"># a series</span>
-<span class="c"># seriesY must have the same number of partitions and cardinality as seriesX</span>
+<span class="n">seriesX</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span>  <span class="c1"># a series</span>
+<span class="c1"># seriesY must have the same number of partitions and cardinality as seriesX</span>
 <span class="n">seriesY</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">11.0</span><span class="p">,</span> <span class="mf">22.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">555.0</span><span class="p">])</span>
 
-<span class="c"># Compute the correlation using Pearson&#39;s method. Enter &quot;spearman&quot; for Spearman&#39;s method.</span>
-<span class="c"># If a method is not specified, Pearson&#39;s method will be used by default.</span>
-<span class="k">print</span><span class="p">(</span><span class="s">&quot;Correlation is: &quot;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">seriesX</span><span class="p">,</span> <span class="n">seriesY</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">&quot;pearson&quot;</span><span class="p">)))</span>
+<span class="c1"># Compute the correlation using Pearson&#39;s method. Enter &quot;spearman&quot; for Spearman&#39;s method.</span>
+<span class="c1"># If a method is not specified, Pearson&#39;s method will be used by default.</span>
+<span class="k">print</span><span class="p">(</span><span class="s2">&quot;Correlation is: &quot;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">seriesX</span><span class="p">,</span> <span class="n">seriesY</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">&quot;pearson&quot;</span><span class="p">)))</span>
 
 <span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span>
     <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">200.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">366.0</span><span class="p">])]</span>
-<span class="p">)</span>  <span class="c"># an RDD of Vectors</span>
+<span class="p">)</span>  <span class="c1"># an RDD of Vectors</span>
 
-<span class="c"># calculate the correlation matrix using Pearson&#39;s method. Use &quot;spearman&quot; for Spearman&#39;s method.</span>
-<span class="c"># If a method is not specified, Pearson&#39;s method will be used by default.</span>
-<span class="k">print</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">&quot;pearson&quot;</span><span class="p">))</span>
+<span class="c1"># calculate the correlation matrix using Pearson&#39;s method. Use &quot;spearman&quot; for Spearman&#39;s method.</span>
+<span class="c1"># If a method is not specified, Pearson&#39;s method will be used by default.</span>
+<span class="k">print</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">&quot;pearson&quot;</span><span class="p">))</span>
 </pre></div>
     <div><small>Find full example code at "examples/src/main/python/mllib/correlations_example.py" in the Spark repo.</small></div>
   </div>
@@ -621,9 +621,9 @@ fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K
 keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
 size, whereas sampling with replacement requires two additional passes.</p>
 
-    <div class="highlight"><pre><span class="c1">// an RDD[(K, V)] of any key value pairs</span>
+    <div class="highlight"><pre><span></span><span class="c1">// an RDD[(K, V)] of any key value pairs</span>
 <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span>
-  <span class="nc">Seq</span><span class="o">((</span><span class="mi">1</span><span class="o">,</span> <span class="-Symbol">&#39;a</span><span class="err">&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="-Symbol">&#39;b</span><span class="err">&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">&#39;c</span><span class="err">&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">&#39;d</span><span class="err">&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">&#39;e</span><span class="err">&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="-Symbol">&#39;f</span><span class="err">&#39;</span><sp
 an class="o">)))</span>
+  <span class="nc">Seq</span><span class="o">((</span><span class="mi">1</span><span class="o">,</span> <span class="sc">&#39;a&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="sc">&#39;b&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">&#39;c&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">&#39;d&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">&#39;e&#39;</span><span class="o">),</span> <span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="sc">&#39;f&#39;</span><span class="o">)))</span>
 
 <span class="c1">// specify the exact fraction desired from each key</span>
 <span class="k">val</span> <span class="n">fractions</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="mi">1</span> <span class="o">-&gt;</span> <span class="mf">0.1</span><span class="o">,</span> <span class="mi">2</span> <span class="o">-&gt;</span> <span class="mf">0.6</span><span class="o">,</span> <span class="mi">3</span> <span class="o">-&gt;</span> <span class="mf">0.3</span><span class="o">)</span>
@@ -643,7 +643,7 @@ fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K
 keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
 size, whereas sampling with replacement requires two additional passes.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.*</span><span class="o">;</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.*</span><span class="o">;</span>
 
 <span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span>
 
@@ -678,10 +678,10 @@ set of keys.</p>
 
     <p><em>Note:</em> <code>sampleByKeyExact()</code> is currently not supported in Python.</p>
 
-    <div class="highlight"><pre><span class="c"># an RDD of any key value pairs</span>
-<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="s">&#39;a&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">&#39;b&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">&#39;c&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">&#39;d&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">&#39;e&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s">&#39;f&#39;</span><span class="p">)])</span>
+    <div class="highlight"><pre><span></span><span class="c1"># an RDD of any key value pairs</span>
+<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;a&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">&#39;c&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">&#39;d&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">&#39;e&#39;</span><span class="p">),</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s1">&#39;f&#39;</span><span class="p">)])</span>
 
-<span class="c"># specify the exact fraction desired from each key as a dictionary</span>
+<span class="c1"># specify the exact fraction desired from each key as a dictionary</span>
 <span class="n">fractions</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">2</span><span class="p">:</span> <span class="mf">0.6</span><span class="p">,</span> <span class="mi">3</span><span class="p">:</span> <span class="mf">0.3</span><span class="p">}</span>
 
 <span class="n">approxSample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampleByKey</span><span class="p">(</span><span class="bp">False</span><span class="p">,</span> <span class="n">fractions</span><span class="p">)</span>
@@ -708,7 +708,7 @@ independence tests.</p>
 run Pearson&#8217;s chi-squared tests. The following example demonstrates how to run and interpret
 hypothesis tests.</p>
 
-    <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span>
+    <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.mllib.regression.LabeledPoint</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.test.ChiSqTestResult</span>
@@ -722,7 +722,7 @@ hypothesis tests.</p>
 <span class="k">val</span> <span class="n">goodnessOfFitTestResult</span> <span class="k">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="o">(</span><span class="n">vec</span><span class="o">)</span>
 <span class="c1">// summary of the test including the p-value, degrees of freedom, test statistic, the method</span>
 <span class="c1">// used, and the null hypothesis.</span>
-<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;$goodnessOfFitTestResult\n&quot;</span><span class="o">)</span>
+<span class="n">println</span><span class="o">(</span><span class="s">s&quot;</span><span class="si">$goodnessOfFitTestResult</span><span class="s">\n&quot;</span><span class="o">)</span>
 
 <span class="c1">// a contingency matrix. Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))</span>
 <span class="k">val</span> <span class="n">mat</span><span class="k">:</span> <span class="kt">Matrix</span> <span class="o">=</span> <span class="nc">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="mi">2</span><span class="o">,</span> <span class="nc">Array</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">,</span> <span class="mf">5.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">4.0</span><span class="o">,</span> <span class="mf">6.0</span><span class="o">))</span>
@@ -730,7 +730,7 @@ hypothesis tests.</p>
 <span class="c1">// conduct Pearson&#39;s independence test on the input contingency matrix</span>
 <span class="k">val</span> <span class="n">independenceTestResult</span> <span class="k">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="o">(</span><span class="n">mat</span><span class="o">)</span>
 <span class="c1">// summary of the test including the p-value, degrees of freedom</span>
-<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;$independenceTestResult\n&quot;</span><span class="o">)</span>
+<span class="n">println</span><span class="o">(</span><span class="s">s&quot;</span><span class="si">$independenceTestResult</span><span class="s">\n&quot;</span><span class="o">)</span>
 
 <span class="k">val</span> <span class="n">obs</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">LabeledPoint</span><span class="o">]</span> <span class="k">=</span>
   <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span>
@@ -761,7 +761,7 @@ hypothesis tests.</p>
 
     <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/test/ChiSqTestResult.html"><code>ChiSqTestResult</code> Java docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
 
 <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.linalg.Matrices</span><span class="o">;</span>
@@ -793,9 +793,9 @@ hypothesis tests.</p>
 <span class="c1">// an RDD of labeled points</span>
 <span class="n">JavaRDD</span><span class="o">&lt;</span><span class="n">LabeledPoint</span><span class="o">&gt;</span> <span class="n">obs</span> <span class="o">=</span> <span class="n">jsc</span><span class="o">.</span><span class="na">parallelize</span><span class="o">(</span>
   <span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span>
-    <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">)),</span>
-    <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">)),</span>
-    <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="o">))</span>
+    <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">)),</span>
+    <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">)),</span>
+    <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="o">))</span>
   <span class="o">)</span>
 <span class="o">);</span>
 
@@ -820,42 +820,42 @@ hypothesis tests.</p>
 
     <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Matrices</span><span class="p">,</span> <span class="n">Vectors</span>
+    <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Matrices</span><span class="p">,</span> <span class="n">Vectors</span>
 <span class="kn">from</span> <span class="nn">pyspark.mllib.regression</span> <span class="kn">import</span> <span class="n">LabeledPoint</span>
 <span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
 
-<span class="n">vec</span> <span class="o">=</span> <span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">)</span>  <span class="c"># a vector composed of the frequencies of events</span>
+<span class="n">vec</span> <span class="o">=</span> <span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">)</span>  <span class="c1"># a vector composed of the frequencies of events</span>
 
-<span class="c"># compute the goodness of fit. If a second vector to test against</span>
-<span class="c"># is not supplied as a parameter, the test runs against a uniform distribution.</span>
+<span class="c1"># compute the goodness of fit. If a second vector to test against</span>
+<span class="c1"># is not supplied as a parameter, the test runs against a uniform distribution.</span>
 <span class="n">goodnessOfFitTestResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">vec</span><span class="p">)</span>
 
-<span class="c"># summary of the test including the p-value, degrees of freedom,</span>
-<span class="c"># test statistic, the method used, and the null hypothesis.</span>
-<span class="k">print</span><span class="p">(</span><span class="s">&quot;</span><span class="si">%s</span><span class="se">\n</span><span class="s">&quot;</span> <span class="o">%</span> <span class="n">goodnessOfFitTestResult</span><span class="p">)</span>
+<span class="c1"># summary of the test including the p-value, degrees of freedom,</span>
+<span class="c1"># test statistic, the method used, and the null hypothesis.</span>
+<span class="k">print</span><span class="p">(</span><span class="s2">&quot;</span><span class="si">%s</span><span class="se">\n</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">goodnessOfFitTestResult</span><span class="p">)</span>
 
-<span class="n">mat</span> <span class="o">=</span> <span class="n">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])</span>  <span class="c"># a contingency matrix</span>
+<span class="n">mat</span> <span class="o">=</span> <span class="n">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])</span>  <span class="c1"># a contingency matrix</span>
 
-<span class="c"># conduct Pearson&#39;s independence test on the input contingency matrix</span>
+<span class="c1"># conduct Pearson&#39;s independence test on the input contingency matrix</span>
 <span class="n">independenceTestResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">mat</span><span class="p">)</span>
 
-<span class="c"># summary of the test including the p-value, degrees of freedom,</span>
-<span class="c"># test statistic, the method used, and the null hypothesis.</span>
-<span class="k">print</span><span class="p">(</span><span class="s">&quot;</span><span class="si">%s</span><span class="se">\n</span><span class="s">&quot;</span> <span class="o">%</span> <span class="n">independenceTestResult</span><span class="p">)</span>
+<span class="c1"># summary of the test including the p-value, degrees of freedom,</span>
+<span class="c1"># test statistic, the method used, and the null hypothesis.</span>
+<span class="k">print</span><span class="p">(</span><span class="s2">&quot;</span><span class="si">%s</span><span class="se">\n</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">independenceTestResult</span><span class="p">)</span>
 
 <span class="n">obs</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span>
     <span class="p">[</span><span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">]),</span>
      <span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">]),</span>
      <span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="p">])]</span>
-<span class="p">)</span>  <span class="c"># LabeledPoint(feature, label)</span>
+<span class="p">)</span>  <span class="c1"># LabeledPoint(feature, label)</span>
 
-<span class="c"># The contingency table is constructed from an RDD of LabeledPoint and used to conduct</span>
-<span class="c"># the independence test. Returns an array containing the ChiSquaredTestResult for every feature</span>
-<span class="c"># against the label.</span>
+<span class="c1"># The contingency table is constructed from an RDD of LabeledPoint and used to conduct</span>
+<span class="c1"># the independence test. Returns an array containing the ChiSquaredTestResult for every feature</span>
+<span class="c1"># against the label.</span>
 <span class="n">featureTestResults</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">obs</span><span class="p">)</span>
 
 <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">result</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">featureTestResults</span><span class="p">):</span>
-    <span class="k">print</span><span class="p">(</span><span class="s">&quot;Column </span><span class="si">%d</span><span class="s">:</span><span class="se">\n</span><span class="si">%s</span><span class="s">&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">result</span><span class="p">))</span>
+    <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Column </span><span class="si">%d</span><span class="s2">:</span><span class="se">\n</span><span class="si">%s</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">result</span><span class="p">))</span>
 </pre></div>
     <div><small>Find full example code at "examples/src/main/python/mllib/hypothesis_testing_example.py" in the Spark repo.</small></div>
   </div>
@@ -879,7 +879,7 @@ and interpret the hypothesis tests.</p>
 
     <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.Statistics$"><code>Statistics</code> Scala docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span>
+    <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span>
 
 <span class="k">val</span> <span class="n">data</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span><span class="nc">Seq</span><span class="o">(</span><span class="mf">0.1</span><span class="o">,</span> <span class="mf">0.15</span><span class="o">,</span> <span class="mf">0.2</span><span class="o">,</span> <span class="mf">0.3</span><span class="o">,</span> <span class="mf">0.25</span><span class="o">))</span>  <span class="c1">// an RDD of sample data</span>
@@ -906,7 +906,7 @@ and interpret the hypothesis tests.</p>
 
     <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/Statistics.html"><code>Statistics</code> Java docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
 
 <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaDoubleRDD</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span><span class="o">;</span>
@@ -929,16 +929,16 @@ and interpret the hypothesis tests.</p>
 
     <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
+    <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span>
 
 <span class="n">parallelData</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">])</span>
 
-<span class="c"># run a KS test for the sample versus a standard normal distribution</span>
-<span class="n">testResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">kolmogorovSmirnovTest</span><span class="p">(</span><span class="n">parallelData</span><span class="p">,</span> <span class="s">&quot;norm&quot;</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
-<span class="c"># summary of the test including the p-value, test statistic, and null hypothesis</span>
-<span class="c"># if our p-value indicates significance, we can reject the null hypothesis</span>
-<span class="c"># Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with</span>
-<span class="c"># a lambda to calculate the CDF is not made available in the Python API</span>
+<span class="c1"># run a KS test for the sample versus a standard normal distribution</span>
+<span class="n">testResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">kolmogorovSmirnovTest</span><span class="p">(</span><span class="n">parallelData</span><span class="p">,</span> <span class="s2">&quot;norm&quot;</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
+<span class="c1"># summary of the test including the p-value, test statistic, and null hypothesis</span>
+<span class="c1"># if our p-value indicates significance, we can reject the null hypothesis</span>
+<span class="c1"># Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with</span>
+<span class="c1"># a lambda to calculate the CDF is not made available in the Python API</span>
 <span class="k">print</span><span class="p">(</span><span class="n">testResult</span><span class="p">)</span>
 </pre></div>
     <div><small>Find full example code at "examples/src/main/python/mllib/hypothesis_testing_kolmogorov_smirnov_test_example.py" in the Spark repo.</small></div>
@@ -967,7 +967,7 @@ all prior batches.</li>
     <p><a href="api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest"><code>StreamingTest</code></a>
 provides streaming hypothesis testing.</p>
 
-    <div class="highlight"><pre><span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">textFileStream</span><span class="o">(</span><span class="n">dataDir</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=&gt;</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">&quot;,&quot;</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
+    <div class="highlight"><pre><span></span><span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">textFileStream</span><span class="o">(</span><span class="n">dataDir</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=&gt;</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">&quot;,&quot;</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
   <span class="k">case</span> <span class="nc">Array</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">.</span><span class="n">toBoolean</span><span class="o">,</span> <span class="n">value</span><span class="o">.</span><span class="n">toDouble</span><span class="o">)</span>
 <span class="o">})</span>
 
@@ -986,7 +986,7 @@ provides streaming hypothesis testing.</p>
     <p><a href="api/java/index.html#org.apache.spark.mllib.stat.test.StreamingTest"><code>StreamingTest</code></a>
 provides streaming hypothesis testing.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.BinarySample</span><span class="o">;</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.BinarySample</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.StreamingTest</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.StreamingTestResult</span><span class="o">;</span>
 
@@ -997,11 +997,11 @@ provides streaming hypothesis testing.</p>
       <span class="n">String</span><span class="o">[]</span> <span class="n">ts</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">&quot;,&quot;</span><span class="o">);</span>
       <span class="kt">boolean</span> <span class="n">label</span> <span class="o">=</span> <span class="n">Boolean</span><span class="o">.</span><span class="na">parseBoolean</span><span class="o">(</span><span class="n">ts</span><span class="o">[</span><span class="mi">0</span><span class="o">]);</span>
       <span class="kt">double</span> <span class="n">value</span> <span class="o">=</span> <span class="n">Double</span><span class="o">.</span><span class="na">parseDouble</span><span class="o">(</span><span class="n">ts</span><span class="o">[</span><span class="mi">1</span><span class="o">]);</span>
-      <span class="k">return</span> <span class="k">new</span> <span class="nf">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">);</span>
+      <span class="k">return</span> <span class="k">new</span> <span class="n">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">);</span>
     <span class="o">}</span>
   <span class="o">});</span>
 
-<span class="n">StreamingTest</span> <span class="n">streamingTest</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">StreamingTest</span><span class="o">()</span>
+<span class="n">StreamingTest</span> <span class="n">streamingTest</span> <span class="o">=</span> <span class="k">new</span> <span class="n">StreamingTest</span><span class="o">()</span>
   <span class="o">.</span><span class="na">setPeacePeriod</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span>
   <span class="o">.</span><span class="na">setWindowSize</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span>
   <span class="o">.</span><span class="na">setTestMethod</span><span class="o">(</span><span class="s">&quot;welch&quot;</span><span class="o">);</span>
@@ -1028,7 +1028,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
 
     <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$"><code>RandomRDDs</code> Scala docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">import</span> <span class="nn">org.apache.spark.SparkContext</span>
+    <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="k">import</span> <span class="nn">org.apache.spark.SparkContext</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.mllib.random.RandomRDDs._</span>
 
 <span class="k">val</span> <span class="n">sc</span><span class="k">:</span> <span class="kt">SparkContext</span> <span class="o">=</span> <span class="o">...</span>
@@ -1037,7 +1037,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
 <span class="c1">// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span>
 <span class="k">val</span> <span class="n">u</span> <span class="k">=</span> <span class="n">normalRDD</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="mi">1000000L</span><span class="o">,</span> <span class="mi">10</span><span class="o">)</span>
 <span class="c1">// Apply a transform to get a random double RDD following `N(1, 4)`.</span>
-<span class="k">val</span> <span class="n">v</span> <span class="k">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=&gt;</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">)</span></code></pre></div>
+<span class="k">val</span> <span class="n">v</span> <span class="k">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=&gt;</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">)</span></code></pre></figure>
 
   </div>
 
@@ -1049,9 +1049,9 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
 
     <p>Refer to the <a href="api/java/org/apache/spark/mllib/random/RandomRDDs"><code>RandomRDDs</code> Java docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.spark.SparkContext</span><span class="o">;</span>
+    <figure class="highlight"><pre><code class="language-java" data-lang="java"><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.SparkContext</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.api.JavaDoubleRDD</span><span class="o">;</span>
-<span class="kn">import</span> <span class="nn">static</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">mllib</span><span class="o">.</span><span class="na">random</span><span class="o">.</span><span class="na">RandomRDDs</span><span class="o">.*;</span>
+<span class="kn">import static</span> <span class="nn">org.apache.spark.mllib.random.RandomRDDs.*</span><span class="o">;</span>
 
 <span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="o">...</span>
 
@@ -1064,7 +1064,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
     <span class="kd">public</span> <span class="n">Double</span> <span class="nf">call</span><span class="o">(</span><span class="n">Double</span> <span class="n">x</span><span class="o">)</span> <span class="o">{</span>
       <span class="k">return</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">;</span>
     <span class="o">}</span>
-  <span class="o">});</span></code></pre></div>
+  <span class="o">});</span></code></pre></figure>
 
   </div>
 
@@ -1076,15 +1076,15 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
 
     <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs"><code>RandomRDDs</code> Python docs</a> for more details on the API.</p>
 
-    <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyspark.mllib.random</span> <span class="kn">import</span> <span class="n">RandomRDDs</span>
+    <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.random</span> <span class="kn">import</span> <span class="n">RandomRDDs</span>
 
-<span class="n">sc</span> <span class="o">=</span> <span class="o">...</span> <span class="c"># SparkContext</span>
+<span class="n">sc</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># SparkContext</span>
 
-<span class="c"># Generate a random double RDD that contains 1 million i.i.d. values drawn from the</span>
-<span class="c"># standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span>
+<span class="c1"># Generate a random double RDD that contains 1 million i.i.d. values drawn from the</span>
+<span class="c1"># standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span>
 <span class="n">u</span> <span class="o">=</span> <span class="n">RandomRDDs</span><span class="o">.</span><span class="n">normalRDD</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="il">1000000L</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
-<span class="c"># Apply a transform to get a random double RDD following `N(1, 4)`.</span>
-<span class="n">v</span> <span class="o">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span></code></pre></div>
+<span class="c1"># Apply a transform to get a random double RDD following `N(1, 4)`.</span>
+<span class="n">v</span> <span class="o">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span></code></pre></figure>
 
   </div>
 </div>
@@ -1107,7 +1107,7 @@ to do so.</p>
 
     <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity"><code>KernelDensity</code> Scala docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span>
+    <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span>
 
 <span class="c1">// an RDD of sample data</span>
@@ -1132,7 +1132,7 @@ to do so.</p>
 
     <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/KernelDensity.html"><code>KernelDensity</code> Java docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
 
 <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span><span class="o">;</span>
@@ -1143,7 +1143,7 @@ to do so.</p>
 
 <span class="c1">// Construct the density estimator with the sample data</span>
 <span class="c1">// and a standard deviation for the Gaussian kernels</span>
-<span class="n">KernelDensity</span> <span class="n">kd</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">KernelDensity</span><span class="o">().</span><span class="na">setSample</span><span class="o">(</span><span class="n">data</span><span class="o">).</span><span class="na">setBandwidth</span><span class="o">(</span><span class="mf">3.0</span><span class="o">);</span>
+<span class="n">KernelDensity</span> <span class="n">kd</span> <span class="o">=</span> <span class="k">new</span> <span class="n">KernelDensity</span><span class="o">().</span><span class="na">setSample</span><span class="o">(</span><span class="n">data</span><span class="o">).</span><span class="na">setBandwidth</span><span class="o">(</span><span class="mf">3.0</span><span class="o">);</span>
 
 <span class="c1">// Find density estimates for the given values</span>
 <span class="kt">double</span><span class="o">[]</span> <span class="n">densities</span> <span class="o">=</span> <span class="n">kd</span><span class="o">.</span><span class="na">estimate</span><span class="o">(</span><span class="k">new</span> <span class="kt">double</span><span class="o">[]{-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">5.0</span><span class="o">});</span>
@@ -1160,18 +1160,18 @@ to do so.</p>
 
     <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity"><code>KernelDensity</code> Python docs</a> for more details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">KernelDensity</span>
+    <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">KernelDensity</span>
 
-<span class="c"># an RDD of sample data</span>
+<span class="c1"># an RDD of sample data</span>
 <span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">,</span> <span class="mf">7.0</span><span class="p">,</span> <span class="mf">8.0</span><span class="p">,</span> <span class="mf">9.0</span><span class="p">,</span> <span class="mf">9.0</span><span class="p">])</span>
 
-<span class="c"># Construct the density estimator with the sample data and a standard deviation for the Gaussian</span>
-<span class="c"># kernels</span>
+<span class="c1"># Construct the density estimator with the sample data and a standard deviation for the Gaussian</span>
+<span class="c1"># kernels</span>
 <span class="n">kd</span> <span class="o">=</span> <span class="n">KernelDensity</span><span class="p">()</span>
 <span class="n">kd</span><span class="o">.</span><span class="n">setSample</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
 <span class="n">kd</span><span class="o">.</span><span class="n">setBandwidth</span><span class="p">(</span><span class="mf">3.0</span><span class="p">)</span>
 
-<span class="c"># Find density estimates for the given values</span>
+<span class="c1"># Find density estimates for the given values</span>
 <span class="n">densities</span> <span class="o">=</span> <span class="n">kd</span><span class="o">.</span><span class="n">estimate</span><span class="p">([</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span>
 </pre></div>
     <div><small>Find full example code at "examples/src/main/python/mllib/kernel_density_estimation_example.py" in the Spark repo.</small></div>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org