You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by pw...@apache.org on 2013/09/25 02:14:46 UTC
svn commit: r2978 [12/12] - in /dev/incubator/spark/spark-0.8.0-incubating-rc6-docs: ./ css/ img/ js/ js/vendor/

Added: dev/incubator/spark/spark-0.8.0-incubating-rc6-docs/tuning.html
==============================================================================
--- dev/incubator/spark/spark-0.8.0-incubating-rc6-docs/tuning.html (added)
+++ dev/incubator/spark/spark-0.8.0-incubating-rc6-docs/tuning.html Wed Sep 25 00:14:43 2013
@@ -0,0 +1,466 @@
+<!DOCTYPE html>
+<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
+<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
+<!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
+    <head>
+        <meta charset="utf-8">
+        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
+        <title>Tuning Spark - Spark 0.8.0 Documentation</title>
+        <meta name="description" content="">
+
+        <link rel="stylesheet" href="css/bootstrap.min.css">
+        <style>
+            body {
+                padding-top: 60px;
+                padding-bottom: 40px;
+            }
+        </style>
+        <meta name="viewport" content="width=device-width">
+        <link rel="stylesheet" href="css/bootstrap-responsive.min.css">
+        <link rel="stylesheet" href="css/main.css">
+
+        <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
+        
+        <link rel="stylesheet" href="css/pygments-default.css">
+
+        <!-- Google analytics script -->
+        <script type="text/javascript">
+          /*
+          var _gaq = _gaq || [];
+          _gaq.push(['_setAccount', 'UA-32518208-1']);
+          _gaq.push(['_trackPageview']);
+
+          (function() {
+            var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+            ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+            var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+          })();
+          */
+        </script>
+
+    </head>
+    <body>
+        <!--[if lt IE 7]>
+            <p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p>
+        <![endif]-->
+
+        <!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html -->
+
+        <div class="navbar navbar-fixed-top" id="topbar">
+            <div class="navbar-inner">
+                <div class="container">
+                    <div class="brand"><a href="index.html">
+                      <img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">0.8.0</span>
+                    </div>
+                    <ul class="nav">
+                        <!--TODO(andyk): Add class="active" attribute to li some how.-->
+                        <li><a href="index.html">Overview</a></li>
+
+                        <li class="dropdown">
+                            <a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
+                            <ul class="dropdown-menu">
+                                <li><a href="quick-start.html">Quick Start</a></li>
+                                <li><a href="scala-programming-guide.html">Spark in Scala</a></li>
+                                <li><a href="java-programming-guide.html">Spark in Java</a></li>
+                                <li><a href="python-programming-guide.html">Spark in Python</a></li>
+                                <li class="divider"></li>
+                                <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
+                                <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
+                                <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
+                            </ul>
+                        </li>
+                        
+                        <li class="dropdown">
+                            <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
+                            <ul class="dropdown-menu">
+                                <li><a href="api/core/index.html">Spark Core for Java/Scala</a></li>
+                                <li><a href="api/pyspark/index.html">Spark Core for Python</a></li>
+                                <li class="divider"></li>
+                                <li><a href="api/streaming/index.html">Spark Streaming</a></li>
+                                <li><a href="api/mllib/index.html">MLlib (Machine Learning)</a></li>
+                                <li><a href="api/bagel/index.html">Bagel (Pregel on Spark)</a></li>
+                            </ul>
+                        </li>
+
+                        <li class="dropdown">
+                            <a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
+                            <ul class="dropdown-menu">
+                                <li><a href="cluster-overview.html">Overview</a></li>
+                                <li><a href="ec2-scripts.html">Amazon EC2</a></li>
+                                <li><a href="spark-standalone.html">Standalone Mode</a></li>
+                                <li><a href="running-on-mesos.html">Mesos</a></li>
+                                <li><a href="running-on-yarn.html">YARN</a></li>
+                            </ul>
+                        </li>
+
+                        <li class="dropdown">
+                            <a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
+                            <ul class="dropdown-menu">
+                                <li><a href="configuration.html">Configuration</a></li>
+                                <li><a href="monitoring.html">Monitoring</a></li>
+                                <li><a href="tuning.html">Tuning Guide</a></li>
+                                <li><a href="hadoop-third-party-distributions.html">Running with CDH/HDP</a></li>
+                                <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
+                                <li><a href="job-scheduling.html">Job Scheduling</a></li>
+                                <li class="divider"></li>
+                                <li><a href="building-with-maven.html">Building Spark with Maven</a></li>
+                                <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
+                            </ul>
+                        </li>
+                    </ul>
+                    <!--<p class="navbar-text pull-right"><span class="version-text">v0.8.0</span></p>-->
+                </div>
+            </div>
+        </div>
+
+        <div class="container" id="content">
+          <h1 class="title">Tuning Spark</h1>
+
+          <ul id="markdown-toc">
+  <li><a href="#data-serialization">Data Serialization</a></li>
+  <li><a href="#memory-tuning">Memory Tuning</a>    <ul>
+      <li><a href="#determining-memory-consumption">Determining Memory Consumption</a></li>
+      <li><a href="#tuning-data-structures">Tuning Data Structures</a></li>
+      <li><a href="#serialized-rdd-storage">Serialized RDD Storage</a></li>
+      <li><a href="#garbage-collection-tuning">Garbage Collection Tuning</a></li>
+    </ul>
+  </li>
+  <li><a href="#other-considerations">Other Considerations</a>    <ul>
+      <li><a href="#level-of-parallelism">Level of Parallelism</a></li>
+      <li><a href="#memory-usage-of-reduce-tasks">Memory Usage of Reduce Tasks</a></li>
+      <li><a href="#broadcasting-large-variables">Broadcasting Large Variables</a></li>
+    </ul>
+  </li>
+  <li><a href="#summary">Summary</a></li>
+</ul>
+
+<p>Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked
+by any resource in the cluster: CPU, network bandwidth, or memory.
+Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
+also need to do some tuning, such as
+<a href="scala-programming-guide.html#rdd-persistence">storing RDDs in serialized form</a>, to
+decrease memory usage.
+This guide will cover two main topics: data serialization, which is crucial for good network
+performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.</p>
+
+<h1 id="data-serialization">Data Serialization</h1>
+
+<p>Serialization plays an important role in the performance of any distributed application.
+Formats that are slow to serialize objects into, or consume a large number of
+bytes, will greatly slow down the computation.
+Often, this will be the first thing you should tune to optimize a Spark application.
+Spark aims to strike a balance between convenience (allowing you to work with any Java type
+in your operations) and performance. It provides two serialization libraries:</p>
+
+<ul>
+  <li><a href="http://docs.oracle.com/javase/6/docs/api/java/io/Serializable.html">Java serialization</a>:
+By default, Spark serializes objects using Java&#8217;s <code>ObjectOutputStream</code> framework, and can work
+with any class you create that implements
+<a href="http://docs.oracle.com/javase/6/docs/api/java/io/Serializable.html"><code>java.io.Serializable</code></a>.
+You can also control the performance of your serialization more closely by extending
+<a href="http://docs.oracle.com/javase/6/docs/api/java/io/Externalizable.html"><code>java.io.Externalizable</code></a>.
+Java serialization is flexible but often quite slow, and leads to large
+serialized formats for many classes.</li>
+  <li><a href="http://code.google.com/p/kryo/">Kryo serialization</a>: Spark can also use
+the Kryo library (version 2) to serialize objects more quickly. Kryo is significantly
+faster and more compact than Java serialization (often as much as 10x), but does not support all
+<code>Serializable</code> types and requires you to <em>register</em> the classes you&#8217;ll use in the program in advance
+for best performance.</li>
+</ul>
+
+<p>You can switch to using Kryo by calling <code>System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")</code>
+<em>before</em> creating your SparkContext. The only reason it is not the default is because of the custom
+registration requirement, but we recommend trying it in any network-intensive application.</p>
+
+<p>Finally, to register your classes with Kryo, create a public class that extends
+<a href="api/core/index.html#org.apache.spark.serializer.KryoRegistrator"><code>org.apache.spark.serializer.KryoRegistrator</code></a> and set the
+<code>spark.kryo.registrator</code> system property to point to it, as follows:</p>
+
+<div class="highlight"><pre><code class="scala"><span class="k">import</span> <span class="nn">com.esotericsoftware.kryo.Kryo</span>
+<span class="k">import</span> <span class="nn">org.apache.spark.serializer.KryoRegistrator</span>
+
+<span class="k">class</span> <span class="nc">MyRegistrator</span> <span class="k">extends</span> <span class="nc">KryoRegistrator</span> <span class="o">{</span>
+  <span class="k">override</span> <span class="k">def</span> <span class="n">registerClasses</span><span class="o">(</span><span class="n">kryo</span><span class="k">:</span> <span class="kt">Kryo</span><span class="o">)</span> <span class="o">{</span>
+    <span class="n">kryo</span><span class="o">.</span><span class="n">register</span><span class="o">(</span><span class="n">classOf</span><span class="o">[</span><span class="kt">MyClass1</span><span class="o">])</span>
+    <span class="n">kryo</span><span class="o">.</span><span class="n">register</span><span class="o">(</span><span class="n">classOf</span><span class="o">[</span><span class="kt">MyClass2</span><span class="o">])</span>
+  <span class="o">}</span>
+<span class="o">}</span>
+
+<span class="c1">// Make sure to set these properties *before* creating a SparkContext!</span>
+<span class="nc">System</span><span class="o">.</span><span class="n">setProperty</span><span class="o">(</span><span class="s">&quot;spark.serializer&quot;</span><span class="o">,</span> <span class="s">&quot;org.apache.spark.serializer.KryoSerializer&quot;</span><span class="o">)</span>
+<span class="nc">System</span><span class="o">.</span><span class="n">setProperty</span><span class="o">(</span><span class="s">&quot;spark.kryo.registrator&quot;</span><span class="o">,</span> <span class="s">&quot;mypackage.MyRegistrator&quot;</span><span class="o">)</span>
+<span class="k">val</span> <span class="n">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(...)</span>
+</code></pre></div>
+
+<p>The <a href="http://code.google.com/p/kryo/">Kryo documentation</a> describes more advanced
+registration options, such as adding custom serialization code.</p>
+
+<p>If your objects are large, you may also need to increase the <code>spark.kryoserializer.buffer.mb</code>
+system property. The default is 32, but this value needs to be large enough to hold the <em>largest</em>
+object you will serialize.</p>
+
+<p>Finally, if you don&#8217;t register your classes, Kryo will still work, but it will have to store the
+full class name with each object, which is wasteful.</p>
+
+<h1 id="memory-tuning">Memory Tuning</h1>
+
+<p>There are three considerations in tuning memory usage: the <em>amount</em> of memory used by your objects
+(you may want your entire dataset to fit in memory), the <em>cost</em> of accessing those objects, and the
+overhead of <em>garbage collection</em> (if you have high turnover in terms of objects).</p>
+
+<p>By default, Java objects are fast to access, but can easily consume a factor of 2-5x more space
+than the &#8220;raw&#8221; data inside their fields. This is due to several reasons:</p>
+
+<ul>
+  <li>Each distinct Java object has an &#8220;object header&#8221;, which is about 16 bytes and contains information
+such as a pointer to its class. For an object with very little data in it (say one <code>Int</code> field), this
+can be bigger than the data.</li>
+  <li>Java Strings have about 40 bytes of overhead over the raw string data (since they store it in an
+array of <code>Char</code>s and keep extra data such as the length), and store each character
+as <em>two</em> bytes due to Unicode. Thus a 10-character string can easily consume 60 bytes.</li>
+  <li>Common collection classes, such as <code>HashMap</code> and <code>LinkedList</code>, use linked data structures, where
+there is a &#8220;wrapper&#8221; object for each entry (e.g. <code>Map.Entry</code>). This object not only has a header,
+but also pointers (typically 8 bytes each) to the next object in the list.</li>
+  <li>Collections of primitive types often store them as &#8220;boxed&#8221; objects such as <code>java.lang.Integer</code>.</li>
+</ul>
+
+<p>This section will discuss how to determine the memory usage of your objects, and how to improve
+it &#8211; either by changing your data structures, or by storing data in a serialized format.
+We will then cover tuning Spark&#8217;s cache size and the Java garbage collector.</p>
+
+<h2 id="determining-memory-consumption">Determining Memory Consumption</h2>
+
+<p>The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD. You will see messages like this:</p>
+
+<pre><code>INFO BlockManagerMasterActor: Added rdd_0_1 in memory on mbk.local:50311 (size: 717.5 KB, free: 332.3 MB)
+</code></pre>
+
+<p>This means that partition 1 of RDD 0 consumed 717.5 KB.</p>
+
+<h2 id="tuning-data-structures">Tuning Data Structures</h2>
+
+<p>The first way to reduce memory consumption is to avoid the Java features that add overhead, such as
+pointer-based data structures and wrapper objects. There are several ways to do this:</p>
+
+<ol>
+  <li>Design your data structures to prefer arrays of objects, and primitive types, instead of the
+standard Java or Scala collection classes (e.g. <code>HashMap</code>). The <a href="http://fastutil.di.unimi.it">fastutil</a>
+library provides convenient collection classes for primitive types that are compatible with the
+Java standard library.</li>
+  <li>Avoid nested structures with a lot of small objects and pointers when possible.</li>
+  <li>Consider using numeric IDs or enumeration objects instead of strings for keys.</li>
+  <li>If you have less than 32 GB of RAM, set the JVM flag <code>-XX:+UseCompressedOops</code> to make pointers be
+four bytes instead of eight. Also, on Java 7 or later, try <code>-XX:+UseCompressedStrings</code> to store
+ASCII strings as just 8 bits per character. You can add these options in
+<a href="configuration.html#environment-variables-in-spark-envsh"><code>spark-env.sh</code></a>.</li>
+</ol>
+
+<h2 id="serialized-rdd-storage">Serialized RDD Storage</h2>
+
+<p>When your objects are still too large to efficiently store despite this tuning, a much simpler way
+to reduce memory usage is to store them in <em>serialized</em> form, using the serialized StorageLevels in
+the <a href="scala-programming-guide.html#rdd-persistence">RDD persistence API</a>, such as <code>MEMORY_ONLY_SER</code>.
+Spark will then store each RDD partition as one large byte array.
+The only downside of storing data in serialized form is slower access times, due to having to
+deserialize each object on the fly.
+We highly recommend <a href="#data-serialization">using Kryo</a> if you want to cache data in serialized form, as
+it leads to much smaller sizes than Java serialization (and certainly than raw Java objects).</p>
+
+<h2 id="garbage-collection-tuning">Garbage Collection Tuning</h2>
+
+<p>JVM garbage collection can be a problem when you have large &#8220;churn&#8221; in terms of the RDDs
+stored by your program. (It is usually not a problem in programs that just read an RDD once
+and then run many operations on it.) When Java needs to evict old objects to make room for new ones, it will
+need to trace through all your Java objects and find the unused ones. The main point to remember here is
+that <em>the cost of garbage collection is proportional to the number of Java objects</em>, so using data
+structures with fewer objects (e.g. an array of <code>Int</code>s instead of a <code>LinkedList</code>) greatly lowers
+this cost. An even better method is to persist objects in serialized form, as described above: now
+there will be only <em>one</em> object (a byte array) per RDD partition. Before trying other
+techniques, the first thing to try if GC is a problem is to use <a href="#serialized-rdd-storage">serialized caching</a>.</p>
+
+<p>GC can also be a problem due to interference between your tasks&#8217; working memory (the
+amount of space needed to run the task) and the RDDs cached on your nodes. We will discuss how to control
+the space allocated to the RDD cache to mitigate this.</p>
+
+<p><strong>Measuring the Impact of GC</strong></p>
+
+<p>The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of
+time spent GC. This can be done by adding <code>-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</code> to your
+<code>SPARK_JAVA_OPTS</code> environment variable. Next time your Spark job is run, you will see messages printed in the worker&#8217;s logs
+each time a garbage collection occurs. Note these logs will be on your cluster&#8217;s worker nodes (in the <code>stdout</code> files in
+their work directories), <em>not</em> on your driver program.</p>
+
+<p><strong>Cache Size Tuning</strong></p>
+
+<p>One important configuration parameter for GC is the amount of memory that should be used for caching RDDs.
+By default, Spark uses 66% of the configured executor memory (<code>spark.executor.memory</code> or <code>SPARK_MEM</code>) to
+cache RDDs. This means that 33% of memory is available for any objects created during task execution.</p>
+
+<p>In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of
+memory, lowering this value will help reduce the memory consumption. To change this to say 50%, you can call
+<code>System.setProperty("spark.storage.memoryFraction", "0.5")</code>. Combined with the use of serialized caching,
+using a smaller cache should be sufficient to mitigate most of the garbage collection problems.
+In case you are interested in further tuning the Java GC, continue reading below.</p>
+
+<p><strong>Advanced GC Tuning</strong></p>
+
+<p>To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:</p>
+
+<ul>
+  <li>
+    <p>Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects
+while the Old generation is intended for objects with longer lifetimes.</p>
+  </li>
+  <li>
+    <p>The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].</p>
+  </li>
+  <li>
+    <p>A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects
+that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old
+enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.</p>
+  </li>
+</ul>
+
+<p>The goal of GC tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that
+the Young generation is sufficiently sized to store short-lived objects. This will help avoid full GCs to collect
+temporary objects created during task execution. Some steps which may be useful are:</p>
+
+<ul>
+  <li>
+    <p>Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for
+before a task completes, it means that there isn&#8217;t enough memory available for executing tasks.</p>
+  </li>
+  <li>
+    <p>In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching.
+This can be done using the <code>spark.storage.memoryFraction</code> property. It is better to cache fewer objects than to slow
+down task execution!</p>
+  </li>
+  <li>
+    <p>If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You
+can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden
+is determined to be <code>E</code>, then you can set the size of the Young generation using the option <code>-Xmn=4/3*E</code>. (The scaling
+up by 4/3 is to account for space used by survivor regions as well.)</p>
+  </li>
+  <li>
+    <p>As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using
+the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the
+size of the block. So if we wish to have 3 or 4 tasks worth of working space, and the HDFS block size is 64 MB,
+we can estimate size of Eden to be <code>4*3*64MB</code>.</p>
+  </li>
+  <li>
+    <p>Monitor how the frequency and time taken by garbage collection changes with the new settings.</p>
+  </li>
+</ul>
+
+<p>Our experience suggests that the effect of GC tuning depends on your application and the amount of memory available.
+There are <a href="http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html">many more tuning options</a> described online,
+but at a high level, managing how frequently full GC takes place can help in reducing the overhead.</p>
+
+<h1 id="other-considerations">Other Considerations</h1>
+
+<h2 id="level-of-parallelism">Level of Parallelism</h2>
+
+<p>Clusters will not be fully utilized unless you set the level of parallelism for each operation high
+enough. Spark automatically sets the number of &#8220;map&#8221; tasks to run on each file according to its size
+(though you can control it through optional parameters to <code>SparkContext.textFile</code>, etc), and for
+distributed &#8220;reduce&#8221; operations, such as <code>groupByKey</code> and <code>reduceByKey</code>, it uses the largest
+parent RDD&#8217;s number of partitions. You can pass the level of parallelism as a second argument
+(see the <a href="api/core/index.html#org.apache.spark.rdd.PairRDDFunctions"><code>spark.PairRDDFunctions</code></a> documentation),
+or set the system property <code>spark.default.parallelism</code> to change the default.
+In general, we recommend 2-3 tasks per CPU core in your cluster.</p>
+
+<h2 id="memory-usage-of-reduce-tasks">Memory Usage of Reduce Tasks</h2>
+
+<p>Sometimes, you will get an OutOfMemoryError not because your RDDs don&#8217;t fit in memory, but because the
+working set of one of your tasks, such as one of the reduce tasks in <code>groupByKey</code>, was too large.
+Spark&#8217;s shuffle operations (<code>sortByKey</code>, <code>groupByKey</code>, <code>reduceByKey</code>, <code>join</code>, etc) build a hash table
+within each task to perform the grouping, which can often be large. The simplest fix here is to
+<em>increase the level of parallelism</em>, so that each task&#8217;s input set is smaller. Spark can efficiently
+support tasks as short as 200 ms, because it reuses one worker JVMs across all tasks and it has
+a low task launching cost, so you can safely increase the level of parallelism to more than the
+number of cores in your clusters.</p>
+
+<h2 id="broadcasting-large-variables">Broadcasting Large Variables</h2>
+
+<p>Using the <a href="scala-programming-guide.html#broadcast-variables">broadcast functionality</a>
+available in <code>SparkContext</code> can greatly reduce the size of each serialized task, and the cost
+of launching a job over a cluster. If your tasks use any large object from the driver program
+inside of them (e.g. a static lookup table), consider turning it into a broadcast variable.
+Spark prints the serialized size of each task on the master, so you can look at that to
+decide whether your tasks are too large; in general tasks larger than about 20 KB are probably
+worth optimizing.</p>
+
+<h1 id="summary">Summary</h1>
+
+<p>This has been a short guide to point out the main concerns you should know about when tuning a
+Spark application &#8211; most importantly, data serialization and memory tuning. For most programs,
+switching to Kryo serialization and persisting data in serialized form will solve most common
+performance issues. Feel free to ask on the
+<a href="http://groups.google.com/group/spark-users">Spark mailing list</a> about other tuning best practices.</p>
+
+            <!-- Main hero unit for a primary marketing message or call to action -->
+            <!--<div class="hero-unit">
+                <h1>Hello, world!</h1>
+                <p>This is a template for a simple marketing or informational website. It includes a large callout called the hero unit and three supporting pieces of content. Use it as a starting point to create something more unique.</p>
+                <p><a class="btn btn-primary btn-large">Learn more &raquo;</a></p>
+            </div>-->
+
+            <!-- Example row of columns -->
+            <!--<div class="row">
+                <div class="span4">
+                    <h2>Heading</h2>
+                    <p>Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui. </p>
+                    <p><a class="btn" href="#">View details &raquo;</a></p>
+                </div>
+                <div class="span4">
+                    <h2>Heading</h2>
+                    <p>Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui. </p>
+                    <p><a class="btn" href="#">View details &raquo;</a></p>
+               </div>
+                <div class="span4">
+                    <h2>Heading</h2>
+                    <p>Donec sed odio dui. Cras justo odio, dapibus ac facilisis in, egestas eget quam. Vestibulum id ligula porta felis euismod semper. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p>
+                    <p><a class="btn" href="#">View details &raquo;</a></p>
+                </div>
+            </div>
+
+            <hr>-->
+
+            <footer>
+              <hr>
+              <p style="text-align: center; veritcal-align: middle; color: #999;">
+                Apache Spark is an effort undergoing incubation at the Apache Software Foundation.
+                <a href="http://incubator.apache.org">
+                  <img style="margin-left: 20px;" src="img/incubator-logo.png" />
+                </a>
+              </p>
+            </footer>
+
+        </div> <!-- /container -->
+
+        <script src="js/vendor/jquery-1.8.0.min.js"></script>
+        <script src="js/vendor/bootstrap.min.js"></script>
+        <script src="js/main.js"></script>
+        
+        <!-- A script to fix internal hash links because we have an overlapping top bar.
+             Based on https://github.com/twitter/bootstrap/issues/193#issuecomment-2281510 -->
+        <script>
+          $(function() {
+            function maybeScrollToHash() {
+              if (window.location.hash && $(window.location.hash).length) {
+                var newTop = $(window.location.hash).offset().top - $('#topbar').height() - 5;
+                $(window).scrollTop(newTop);
+              }
+            }
+            $(window).bind('hashchange', function() {
+              maybeScrollToHash();
+            });
+            // Scroll now too in case we had opened the page on a hash, but wait 1 ms because some browsers
+            // will try to do *their* initial scroll after running the onReady handler.
+            setTimeout(function() { maybeScrollToHash(); }, 1)
+          })
+        </script>
+
+    </body>
+</html>