You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2017/03/07 19:32:40 UTC
[2/3] spark git commit: [SPARK-19516][DOC] update public doc to use SparkSession instead of SparkContext

http://git-wip-us.apache.org/repos/asf/spark/blob/d69aeeaf/docs/quick-start.md
----------------------------------------------------------------------
diff --git a/docs/quick-start.md b/docs/quick-start.md
index aa4319a..b88ae5f 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -10,12 +10,13 @@ description: Quick start tutorial for Spark SPARK_VERSION_SHORT
 This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's
 interactive shell (in Python or Scala),
 then show how to write applications in Java, Scala, and Python.
-See the [programming guide](programming-guide.html) for a more complete reference.
 
 To follow along with this guide, first download a packaged release of Spark from the
 [Spark website](http://spark.apache.org/downloads.html). Since we won't be using HDFS,
 you can download a package for any version of Hadoop.
 
+Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more complete reference at the [RDD programming guide](rdd-programming-guide.html). However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the [SQL programming guide](sql-programming-guide.html) to get more information about Dataset.
+
 # Interactive Analysis with the Spark Shell
 
 ## Basics
@@ -29,28 +30,28 @@ or Python. Start it by running the following in the Spark directory:
 
     ./bin/spark-shell
 
-Spark's primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let's make a new RDD from the text of the README file in the Spark source directory:
+Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Let's make a new Dataset from the text of the README file in the Spark source directory:
 
 {% highlight scala %}
-scala> val textFile = sc.textFile("README.md")
-textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25
+scala> val textFile = spark.read.textFile("README.md")
+textFile: org.apache.spark.sql.Dataset[String] = [value: string]
 {% endhighlight %}
 
-RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
+You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the _[API doc](api/scala/index.html#org.apache.spark.sql.Dataset)_.
 
 {% highlight scala %}
-scala> textFile.count() // Number of items in this RDD
+scala> textFile.count() // Number of items in this Dataset
 res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs
 
-scala> textFile.first() // First item in this RDD
+scala> textFile.first() // First item in this Dataset
 res1: String = # Apache Spark
 {% endhighlight %}
 
-Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
+Now let's transform this Dataset to a new one. We call `filter` to return a new Dataset with a subset of the items in the file.
 
 {% highlight scala %}
 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
-linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27
+linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
 {% endhighlight %}
 
 We can chain together transformations and actions:
@@ -65,32 +66,32 @@ res3: Long = 15
 
     ./bin/pyspark
 
-Spark's primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let's make a new RDD from the text of the README file in the Spark source directory:
+Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python's dynamic nature, we don't need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let's make a new DataFrame from the text of the README file in the Spark source directory:
 
 {% highlight python %}
->>> textFile = sc.textFile("README.md")
+>>> textFile = spark.read.text("README.md")
 {% endhighlight %}
 
-RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
+You can get values from DataFrame directly, by calling some actions, or transform the DataFrame to get a new one. For more details, please read the _[API doc](api/python/index.html#pyspark.sql.DataFrame)_.
 
 {% highlight python %}
->>> textFile.count()  # Number of items in this RDD
+>>> textFile.count()  # Number of rows in this DataFrame
 126
 
->>> textFile.first()  # First item in this RDD
-u'# Apache Spark'
+>>> textFile.first()  # First row in this DataFrame
+Row(value=u'# Apache Spark')
 {% endhighlight %}
 
-Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
+Now let's transform this DataFrame to a new one. We call `filter` to return a new DataFrame with a subset of the lines in the file.
 
 {% highlight python %}
->>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
+>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
 {% endhighlight %}
 
 We can chain together transformations and actions:
 
 {% highlight python %}
->>> textFile.filter(lambda line: "Spark" in line).count()  # How many lines contain "Spark"?
+>>> textFile.filter(textFile.value.contains("Spark")).count()  # How many lines contain "Spark"?
 15
 {% endhighlight %}
 
@@ -98,8 +99,8 @@ We can chain together transformations and actions:
 </div>
 
 
-## More on RDD Operations
-RDD actions and transformations can be used for more complex computations. Let's say we want to find the line with the most words:
+## More on Dataset Operations
+Dataset actions and transformations can be used for more complex computations. Let's say we want to find the line with the most words:
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -109,7 +110,7 @@ scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a
 res4: Long = 15
 {% endhighlight %}
 
-This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to `map` and `reduce` are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We'll use `Math.max()` function to make this code easier to understand:
+This first maps a line to an integer value, creating a new Dataset. `reduce` is called on that Dataset to find the largest word count. The arguments to `map` and `reduce` are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We'll use `Math.max()` function to make this code easier to understand:
 
 {% highlight scala %}
 scala> import java.lang.Math
@@ -122,11 +123,11 @@ res5: Int = 15
 One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
 
 {% highlight scala %}
-scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
-wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28
+scala> val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
+wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]
 {% endhighlight %}
 
-Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations), and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:
+Here, we call `flatMap` to transform a Dataset of lines to a Dataset of words, and then combine `groupByKey` and `count` to compute the per-word counts in the file as a Dataset of (String, Long) pairs. To collect the word counts in our shell, we can call `collect`:
 
 {% highlight scala %}
 scala> wordCounts.collect()
@@ -137,37 +138,24 @@ res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1),
 <div data-lang="python" markdown="1">
 
 {% highlight python %}
->>> textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
-15
+>>> from pyspark.sql.functions import *
+>>> textFile.select(size(split(textFile.value, "\s+")).name("numWords")).agg(max(col("numWords"))).collect()
+[Row(max(numWords)=15)]
 {% endhighlight %}
 
-This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to `map` and `reduce` are Python [anonymous functions (lambdas)](https://docs.python.org/2/reference/expressions.html#lambda),
-but we can also pass any top-level Python function we want.
-For example, we'll define a `max` function to make this code easier to understand:
-
-{% highlight python %}
->>> def max(a, b):
-...     if a > b:
-...         return a
-...     else:
-...         return b
-...
-
->>> textFile.map(lambda line: len(line.split())).reduce(max)
-15
-{% endhighlight %}
+This first maps a line to an integer value and aliases it as "numWords", creating a new DataFrame. `agg` is called on that DataFrame to find the largest word count. The arguments to `select` and `agg` are both _[Column](api/python/index.html#pyspark.sql.Column)_, we can use `df.colName` to get a column from a DataFrame. We can also import pyspark.sql.functions, which provides a lot of convenient functions to build a new Column from an old one.
 
 One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
 
 {% highlight python %}
->>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
+>>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).as("word")).groupBy("word").count()
 {% endhighlight %}
 
-Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations), and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:
+Here, we use the `explode` function in `select`, to transfrom a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: "word" and "count". To collect the word counts in our shell, we can call `collect`:
 
 {% highlight python %}
 >>> wordCounts.collect()
-[(u'and', 9), (u'A', 1), (u'webpage', 1), (u'README', 1), (u'Note', 1), (u'"local"', 1), (u'variable', 1), ...]
+[Row(word=u'online', count=1), Row(word=u'graphs', count=1), ...]
 {% endhighlight %}
 
 </div>
@@ -181,7 +169,7 @@ Spark also supports pulling data sets into a cluster-wide in-memory cache. This
 
 {% highlight scala %}
 scala> linesWithSpark.cache()
-res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27
+res7: linesWithSpark.type = [value: string]
 
 scala> linesWithSpark.count()
 res8: Long = 15
@@ -193,7 +181,7 @@ res9: Long = 15
 It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
 that these same functions can be used on very large data sets, even when they are striped across
 tens or hundreds of nodes. You can also do this interactively by connecting `bin/spark-shell` to
-a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
+a cluster, as described in the [RDD programming guide](rdd-programming-guide.html#using-the-shell).
 
 </div>
 <div data-lang="python" markdown="1">
@@ -211,7 +199,7 @@ a cluster, as described in the [programming guide](programming-guide.html#initia
 It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
 that these same functions can be used on very large data sets, even when they are striped across
 tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to
-a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
+a cluster, as described in the [RDD programming guide](rdd-programming-guide.html#using-the-shell).
 
 </div>
 </div>
@@ -228,20 +216,17 @@ named `SimpleApp.scala`:
 
 {% highlight scala %}
 /* SimpleApp.scala */
-import org.apache.spark.SparkContext
-import org.apache.spark.SparkContext._
-import org.apache.spark.SparkConf
+import org.apache.spark.sql.SparkSession
 
 object SimpleApp {
   def main(args: Array[String]) {
     val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
-    val conf = new SparkConf().setAppName("Simple Application")
-    val sc = new SparkContext(conf)
-    val logData = sc.textFile(logFile, 2).cache()
+    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
+    val logData = spark.read.textFile(logFile).cache()
     val numAs = logData.filter(line => line.contains("a")).count()
     val numBs = logData.filter(line => line.contains("b")).count()
     println(s"Lines with a: $numAs, Lines with b: $numBs")
-    sc.stop()
+    spark.stop()
   }
 }
 {% endhighlight %}
@@ -251,16 +236,13 @@ Subclasses of `scala.App` may not work correctly.
 
 This program just counts the number of lines containing 'a' and the number containing 'b' in the
 Spark README. Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark is
-installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext,
-we initialize a SparkContext as part of the program.
+installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkSession,
+we initialize a SparkSession as part of the program.
 
-We pass the SparkContext constructor a 
-[SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
-object which contains information about our
-application. 
+We call `SparkSession.builder` to construct a [[SparkSession]], then set the application name, and finally call `getOrCreate` to get the [[SparkSession]] instance.
 
-Our application depends on the Spark API, so we'll also include an sbt configuration file, 
-`build.sbt`, which explains that Spark is a dependency. This file also adds a repository that 
+Our application depends on the Spark API, so we'll also include an sbt configuration file,
+`build.sbt`, which explains that Spark is a dependency. This file also adds a repository that
 Spark depends on:
 
 {% highlight scala %}
@@ -270,7 +252,7 @@ version := "1.0"
 
 scalaVersion := "{{site.SCALA_VERSION}}"
 
-libraryDependencies += "org.apache.spark" %% "spark-core" % "{{site.SPARK_VERSION}}"
+libraryDependencies += "org.apache.spark" %% "spark-sql" % "{{site.SPARK_VERSION}}"
 {% endhighlight %}
 
 For sbt to work correctly, we'll need to layout `SimpleApp.scala` and `build.sbt`
@@ -309,34 +291,28 @@ We'll create a very simple Spark application, `SimpleApp.java`:
 
 {% highlight java %}
 /* SimpleApp.java */
-import org.apache.spark.api.java.*;
-import org.apache.spark.SparkConf;
-import org.apache.spark.api.java.function.Function;
+import org.apache.spark.sql.SparkSession;
 
 public class SimpleApp {
   public static void main(String[] args) {
     String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system
-    SparkConf conf = new SparkConf().setAppName("Simple Application");
-    JavaSparkContext sc = new JavaSparkContext(conf);
-    JavaRDD<String> logData = sc.textFile(logFile).cache();
+    SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
+    Dataset<String> logData = spark.read.textFile(logFile).cache();
 
     long numAs = logData.filter(s -> s.contains("a")).count();
     long numBs = logData.filter(s -> s.contains("b")).count();
 
     System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
-    
-    sc.stop();
+
+    spark.stop();
   }
 }
 {% endhighlight %}
 
-This program just counts the number of lines containing 'a' and the number containing 'b' in a text
-file. Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
-As with the Scala example, we initialize a SparkContext, though we use the special
-`JavaSparkContext` class to get a Java-friendly one. We also create RDDs (represented by
-`JavaRDD`) and run transformations on them. Finally, we pass functions to Spark by creating classes
-that extend `spark.api.java.function.Function`. The
-[Spark programming guide](programming-guide.html) describes these differences in more detail.
+This program just counts the number of lines containing 'a' and the number containing 'b' in the
+Spark README. Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark is
+installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkSession,
+we initialize a SparkSession as part of the program.
 
 To build the program, we also write a Maven `pom.xml` file that lists Spark as a dependency.
 Note that Spark artifacts are tagged with a Scala version.
@@ -352,7 +328,7 @@ Note that Spark artifacts are tagged with a Scala version.
   <dependencies>
     <dependency> <!-- Spark dependency -->
       <groupId>org.apache.spark</groupId>
-      <artifactId>spark-core_{{site.SCALA_BINARY_VERSION}}</artifactId>
+      <artifactId>spark-sql_{{site.SCALA_BINARY_VERSION}}</artifactId>
       <version>{{site.SPARK_VERSION}}</version>
     </dependency>
   </dependencies>
@@ -395,27 +371,25 @@ As an example, we'll create a simple Spark application, `SimpleApp.py`:
 
 {% highlight python %}
 """SimpleApp.py"""
-from pyspark import SparkContext
+from pyspark.sql import SparkSession
 
 logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
-sc = SparkContext("local", "Simple App")
-logData = sc.textFile(logFile).cache()
+spark = SparkSession.builder().appName(appName).master(master).getOrCreate()
+logData = spark.read.text(logFile).cache()
 
-numAs = logData.filter(lambda s: 'a' in s).count()
-numBs = logData.filter(lambda s: 'b' in s).count()
+numAs = logData.filter(logData.value.contains('a')).count()
+numBs = logData.filter(logData.value.contains('b')).count()
 
 print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
 
-sc.stop()
+spark.stop()
 {% endhighlight %}
 
 
 This program just counts the number of lines containing 'a' and the number containing 'b' in a
 text file.
 Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
-As with the Scala and Java examples, we use a SparkContext to create RDDs.
-We can pass Python functions to Spark, which are automatically serialized along with any variables
-that they reference.
+As with the Scala and Java examples, we use a SparkSession to create Datasets.
 For applications that use custom classes or third-party libraries, we can also add code
 dependencies to `spark-submit` through its `--py-files` argument by packaging them into a
 .zip file (see `spark-submit --help` for details).
@@ -438,8 +412,7 @@ Lines with a: 46, Lines with b: 23
 # Where to Go from Here
 Congratulations on running your first Spark application!
 
-* For an in-depth overview of the API, start with the [Spark programming guide](programming-guide.html),
-  or see "Programming Guides" menu for other components.
+* For an in-depth overview of the API, start with the [RDD programming guide](rdd-programming-guide.html) and the [SQL programming guide](sql-programming-guide.html), or see "Programming Guides" menu for other components.
 * For running applications on a cluster, head to the [deployment overview](cluster-overview.html).
 * Finally, Spark includes several samples in the `examples` directory
 ([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org